TinySLFL: A Flash-Endurance-Aware Federated Edge Learning Framework with Layer-Wise Delayed Aggregation for Resource-Constrained Microcontrollers

Tao, Yiru; Jia, Juncheng; Deng, Tao

doi:10.3390/electronics15102084

Open AccessArticle

TinySLFL: A Flash-Endurance-Aware Federated Edge Learning Framework with Layer-Wise Delayed Aggregation for Resource-Constrained Microcontrollers

by

Yiru Tao

,

Juncheng Jia

^*

and

Tao Deng

School of Computer Science and Technology, Soochow University, Suzhou 215006, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(10), 2084; https://doi.org/10.3390/electronics15102084

Submission received: 15 April 2026 / Revised: 7 May 2026 / Accepted: 11 May 2026 / Published: 13 May 2026

(This article belongs to the Special Issue Federated Edge Learning: Models, Mechanisms, Algorithms, and Applications)

Download

Browse Figures

Versions Notes

Abstract

Federated edge learning on microcontrollers (MCUs) enables privacy-preserving adaptation, but on-device training faces a hardware tradeoff: fitting backpropagation into a limited static random-access memory (SRAM) often relies on on-chip flash as auxiliary storage, while repeated parameter persistence rapidly consumes finite program/erase (P/E) endurance. This paper proposes TinySLFL, a flash-endurance-aware federated learning framework for resource-constrained MCUs. On the client, layer-wise training bounds the peak SRAM usage to one layer, and delayed aggregation keeps intermediate updates in SRAM so that each communication round incurs only one flash persistence. On the server, dynamic aggregation combines loss-aware freezing with proxy-accuracy-guided filtering to improve the robustness under non-independently and identically distributed (Non-IID) data while suppressing unnecessary rounds. Experiments on CIFAR-10 and SVHN under a severe Dirichlet label skew and on a naturally heterogeneous FEMNIST showed, in a server-side simulation, that TinySLFL reduces the cumulative protocol-level erase-block operations (EOs) required to reach a common target accuracy by 97.8–98.6% relative to sequential layer training (SLT) and improves the mean Top-1 accuracy by up to 5.24 percentage points over the same ResNet-8 backbone in a five-seed evaluation. The power, latency, SRAM, and deployment feasibility were reported from actual ESP32-S3 measurements. These results demonstrate durable federated learning for extreme-edge MCUs.

Keywords:

federated learning; on-device training; microcontroller; flash endurance

1. Introduction

The frontier of tiny machine learning (TinyML) is rapidly shifting from inference-only deployment toward continuous on-device learning on commodity microcontrollers (MCUs) [1]. Rather than treating an MCU as a static inference endpoint that ships a frozen model, emerging applications—adaptive wearables, self-calibrating industrial sensors, and privacy-sensitive smart-home devices—require the device itself to refine its model using locally observed data. This shift promises a lower latency, reduced cloud dependency, and fundamentally better data privacy, but it also exposes a set of hardware constraints that had been largely hidden during the inference-only era.

Two such constraints dominate the MCU regime. The first is the well-known SRAM memory wall: with typically 128–512 KB of SRAM, an MCU cannot simultaneously hold the weights, activations, gradients, and optimizer state required by standard backpropagation, even for a modest convolutional network [2,3]. The second, and less appreciated, is the finite endurance of on-chip flash memory. Unlike DRAM, flash cells wear out after a bounded number of program/erase (P/E) cycles—typically

10^{4}

to

10^{5}

for commodity-embedded flash [4,5]. During inference, this is a non-issue because the weights are written only once at deployment time. During training, however, every weight update is a candidate flash write, and the high-frequency, iterative nature of stochastic gradient descent turns the flash array into a consumable resource.

The naive strategy of treating flash as an SRAM extension—swapping layers in and out across the memory hierarchy to overcome the SRAM wall—is therefore not merely inefficient; it is physically destructive. Write amplification from log-structured file systems such as LittleFS, combined with the sheer volume of per-step updates, can exhaust the P/E budget of an ESP32-class device in a matter of days, long before the model has converged. On-device training that ignores flash endurance is a short-lived operation, not a sustainable capability.

Federated learning (FL) [6]—the natural paradigm for privacy-preserving collaborative adaptation across distributed IoT devices—acts as a threat multiplier on top of this constraint. Standard FL protocols such as FedAvg require every participating device to persist the freshly received global model and to emit updated parameters every round. Each synchronization round therefore forces at least one full-model overwrite of flash. Under realistic non-independently and identically distributed (Non-IID) client data [7,8], hundreds of rounds are commonly needed for convergence; combined with layer-wise swapping, a single client can incur on the order of

10^{5}

–

10^{6}

flash writes before reaching the target accuracy. For commodity MCUs, this is indistinguishable from scheduled hardware failure.

This conflict—the algorithmic need for frequent synchronization versus the physical need to minimize flash writes—is not addressable by tuning the hyperparameters or pruning weights alone. It calls for a co-design of the training protocol, the aggregation strategy, and the underlying storage access pattern. Prior on-device FL work has made important progress on individual aspects of this problem. Llisterri Giménez et al. [9] demonstrated real on-device FL for keyword spotting on MCUs, but restricted training to a small fully connected head, sidestepping both the SRAM wall and the endurance problem rather than solving them. Sha et al. [10] reduced the communication cost through asynchronous edge-assisted aggregation, but their setting assumed edge servers with abundant memory, not 512 KB SRAM endpoints. Sequential layer training (SLT) [11] tackles the SRAM wall by serializing layer updates, yet by persisting each layer in turn, it amplifies flash wear rather than relieving it. To the best of our knowledge, no prior work has treated flash P/E endurance as a first-class optimization target jointly with accuracy.

We propose TinySLFL, a federated learning framework for commodity MCUs in which hardware longevity and Non-IID robustness are optimized simultaneously. Our design rests on four coupled mechanisms: (i) layer-wise local training partitions the computation so that peak SRAM usage is bounded by the footprint of the largest single layer rather than the whole network, breaking the SRAM wall without architectural surgery; (ii) delayed aggregation decouples local iteration from flash persistence—intermediate layer updates accumulate in volatile SRAM and are streamed directly to the server, so that each global round incurs exactly one full-model flash overwrite, independent of the number of layers or local epochs; (iii) a dynamic server-side aggregation strategy, combining a per-layer learning-rate schedule, loss-sensing layer freezing, and accuracy-guided selective aggregation, accelerates convergence under Non-IID data, thereby reducing the total number of communication rounds and flash overwrites over the device lifetime; and (iv) a fault-tolerant streaming protocol based on round/client/layer sequence numbers with CRC-checked idempotent reception ensures that the above guarantees survive the packet loss and sudden power failures that are endemic to real MCU deployments.

We validated TinySLFL through a large-scale server-side simulation on CIFAR-10 and SVHN under a severe Dirichlet Non-IID split (

α = 0.1

), and on FEMNIST under its native writer-partitioned heterogeneity, together with real hardware experiments on an ESP32-S3 (Espressif Systems, Shanghai, China) testbed. Across all three datasets, TinySLFL attained the highest mean final accuracy across five seeds and reduced the cumulative protocol-level erase-block operations to a common target accuracy by 97.8–98.6% relative to SLT; on CIFAR-10, for example, the cumulative simulated protocol-level erase-block volume required to reach 65% dropped from 24,552 to 342. This value is not a real-device wear measurement; the hardware study separately reports the measured hottest-block erase increments, latency, SRAM, energy, and deployment feasibility on physical ESP32-S3 boards. An ablation study and a hyperparameter sensitivity analysis further support the contributions of the server-side design.

The principal contributions of this paper are as follows.

We identified and formalized flash P/E endurance as a first-class optimization target in federated learning on MCUs, and provide an end-to-end device-lifetime evaluation methodology that complements the single-round cost metrics common in prior work.
We designed TinySLFL, a hardware-aware FL framework whose layer-wise training and delayed aggregation are provably SRAM-bounded and reduce the per-round flash writes from $O (K)$ to $O (1)$ , where K is the network depth.
We propose a three-mechanism dynamic aggregation strategy that improves Non-IID accuracy and shortens the round budget, and complement it with a fault-tolerant streaming protocol for realistic lossy-network MCU deployments.
We provide experiments spanning three vision benchmarks, real ESP32-S3 measurements, full-module ablation, and a hyperparameter sensitivity analysis, establishing TinySLFL as a practical pathway for long-term, sustainable on-device federated adaptation.

The remainder of this paper is organized as follows. Section 2 reviews related work on TinyML on-device training, federated aggregation, and flash endurance. Section 3 formalizes the SRAM and flash-endurance constraints and states our optimization objective. Section 4 presents the TinySLFL framework in detail. Section 5 reports the simulation and on-device experiments. Section 6 discusses limitations and broader implications, and Section 7 concludes.

2. Related Work

TinySLFL relates to three research threads: on-device training on resource-constrained microcontrollers, federated learning under resource and data heterogeneity, and flash memory endurance in embedded systems.

2.1. On-Device Training on Microcontrollers

Early TinyML systems focused on inference, assuming that models were trained offline and deployed in frozen form [1,12]. More recent work has studied on-device training [2], where the main obstacle is the SRAM limit of commodity MCUs: backpropagation must store parameters, activations, gradients, and optimizer states within only 128–512 KB of SRAM. The existing solutions mainly reduce this footprint through quantization or mixed precision [13], checkpointing [14], or partial fine-tuning such as TinyTL [15]. Lin et al. [16] instead treated flash as virtual memory to page layer states between flash and SRAM during training. These methods mitigate SRAM pressure, but they do not explicitly optimize the flash cost of repeated persistence; swap-based schemes can substantially increase the write frequency.

2.2. Federated Learning Under Resource and Data Heterogeneity

FedAvg [6] is the standard FL baseline, but its convergence degrades under Non-IID data [7]. Methods such as FedProx [8] improve the robustness to client drift, while EEFL [10] reduces communication through hierarchical edge-assisted aggregation. However, these methods target optimization or topology design rather than MCU storage limits. EEFL in particular assumes memory-rich intermediate edge servers and repeated full-model synchronization, which is mismatched to commodity MCUs.

More relevant are FL systems designed for microcontrollers. Llisterri Giménez et al. [9] demonstrated end-to-end on-device FL on real MCUs, but restricted training to a small fully connected head. TinyFedTL [17] and TinyMetaFed [18] follow similar lightweight adaptation strategies. SLT [11] is closer to our setting because it serializes layer updates to fit memory budgets, yet it persists each layer in turn, making per-round flash writes grow with the network depth.

2.3. Flash Memory Endurance and Wear-Aware Systems

Embedded NOR flash offers a high density, but a limited endurance, typically

10^{4}

–

10^{5}

P/E cycles per block [4,5]. File systems such as LittleFS can further amplify physical writes through journaling and wear leveling [19]. Prior systems work has studied wear from the storage perspective, and hardware proposals such as FlipBit [20] aim to improve the endurance at the controller level. However, already-deployed MCUs must operate within fixed hardware limits. The existing FL methods for MCUs rarely treat the protocol-induced flash persistence and endurance cost as an explicit objective. TinySLFL addresses this gap by decoupling local training from intermediate persistence, reducing the per-round writes from

O (K)

to

O (1)

while combining this with server-side aggregation for better Non-IID robustness.

3. Preliminaries and Problem Formulation

We formalize the federated learning setting on commodity MCUs and the two coupled hardware constraints that motivate TinySLFL.

3.1. System Model and Federated Objective

We considered one server and N resource-constrained clients indexed by

C = {1, 2, \dots, N}

that collaboratively train a global model

W = {W^{1}, \dots, W^{K}}

. At round t, the server broadcasts

W^{(t - 1)}

, client i updates it on private data

D_{i}

, and the server aggregates the uploads. The global objective is the weighted empirical risk

min_{W} F (W) ≜ \sum_{i = 1}^{N} \frac{| D_{i} |}{| D |} L_{i} (W),

(1)

where

| D_{i} |

is the number of local training samples held by client i,

| D | = \sum_{i} | D_{i} |

is the total number of samples over all clients, and

L_{i} (W)

denotes the local empirical loss. We used

W^{(t)}

for the global model after communication round t,

W^{(t), k}

for its k-th layer, and

{\hat{W}}_{i}^{(t), k}

for the layer-k update uploaded by client i in round t. Thus, k always indexes layers, whereas t indexes communication rounds. We considered the Non-IID setting used later in Section 5. On commodity MCUs, solving (1) is constrained by the SRAM capacity and flash endurance.

3.2. Constraint I: The SRAM Memory Wall

Commodity MCUs typically provide only 128–

512 KB

of SRAM, which is insufficient for whole-network backpropagation. Training must hold the parameters, activations, gradients, and optimizer state simultaneously; to avoid confusing the layer index k with an optimizer constant, we denote the optimizer-state multiplier by

c_{opt}

, with

c_{opt} = 1

for SGD with momentum and

c_{opt} = 2

for Adam. We therefore considered layer-wise execution, where only one layer is active in SRAM and the rest remain in flash. Let

M_{k}

denote the peak SRAM footprint required to train layer k:

M_{k} \approx (1 + c_{opt}) | W^{k} | + A_{k}^{max} + B_{ws},

(2)

where

A_{k}^{max}

is the peak activation footprint of layer k and

B_{ws}

is a small workspace. Feasibility requires

max_{k \in {1, \dots, K}} M_{k} \leq S_{max},

(3)

so the model width, depth, batch size, and optimizer choice are directly constrained by the device SRAM limit

S_{max}

.

3.3. Constraint II: Flash Endurance

On-chip flash has a limited program/erase (P/E) budget, typically

10^{4}

–

10^{5}

cycles per block for commodity NOR flash [4]. Let B denote the erase-block size (

B = 4 KB

in our ESP32-S3 deployment) and define

ϕ (X) = ⌈X / B⌉

(4)

as the protocol-level erase-block volume required to persist X bytes under block-aligned accounting. An EO is a storage-protocol surrogate rather than a direct physical-wear count. The exact device lifetime is governed by the most heavily worn flash block, and therefore also depends on file-system allocation, journaling, the wear-leveling behavior, and the storage layout. To isolate the protocol contribution, we analyzed the block-operation volume induced by the learning schedule and deferred the hottest-block wear to the empirical evaluation in Section 5. Under a naive layer-wise scheme, each completed layer is written back to flash before the next layer is loaded, so the per-round block-operation volume of client i is

{EO}_{round}^{(i)} = \underset{global - sync write}{\underset{︸}{ϕ (| W |)}} + \underset{layer - wise write amplification}{\underset{︸}{\sum_{k = 1}^{K} ϕ (| W^{k} |)}} .

(5)

The first term is the full-model persistence used by protocols that keep a local recovery snapshot of the synchronized model; the second is depth-dependent write amplification caused by layer-wise write-back. Because MCU file systems such as LittleFS may introduce additional journaling and wear-leveling overhead [19], (5) should be read as a protocol-level accounting lower bound for this snapshot-persisting protocol class, not as a complete model of physical flash wear.

3.4. Co-Optimization Objective

The two constraints are coupled: using flash to relieve the SRAM pressure increases the write amplification, whereas keeping more state in SRAM may violate (3). We therefore optimized a protocol

π

that includes the local execution schedule, persistence policy, and server aggregation rule. Because exact hottest-block wear depends on the storage layout, we minimized the protocol-induced erase-block operations as a surrogate objective and then validated the hottest-block lifetime directly on the hardware. For a target accuracy

τ

, we sought

min_{π} EO (T; π) s . t . acc (W^{(T)}) \geq τ, max_{k} M_{k} \leq S_{max},

(6)

where

EO (T; π) = \sum_{t = 1}^{T} {EO}_{round, t}^{(i)}

is the cumulative protocol-level erase-block volume over the T rounds required to reach

τ

. Here,

T (π, τ)

denotes the number of communication rounds required by protocol

π

to reach the target accuracy

τ

, and

{\bar{EO}}_{round} (π)

denotes the average per-round protocol-level erase-block operation volume. These EO terms are used for algorithmic co-optimization; the physical lifetime is later tied to the measured hottest-block wear. Equation (6) naturally factorizes as

EO (T; π) = \underset{rounds to reach τ}{\underset{︸}{T (π, τ)}} \cdot \underset{per - round block operations}{\underset{︸}{{\bar{EO}}_{round} (π)}},

(7)

so a practical solution must reduce both the per-round protocol-induced flash persistence and the number of rounds needed to reach the target accuracy. TinySLFL addresses the first via delayed aggregation (Section 4.2) and the second via dynamic aggregation (Section 4.3).

4. The TinySLFL Framework

We propose TinySLFL, a client–server co-design for federated learning on commodity MCUs, as illustrated in Figure 1. On the client, layer-wise training bounds the SRAM footprint by one layer, and delayed aggregation streams completed updates to the server without intermediate flash write-back. On the server, dynamic aggregation combines layer-wise learning-rate scheduling, loss-aware freezing, and accuracy-guided filtering to reduce the round budget T, and thus, cumulative protocol-level flash persistence.

4.1. Problem Formulation

TinySLFL addresses the co-optimization problem defined in Section 3. The client protocol minimizes the per-round block-operation term

{\bar{EO}}_{round} (π)

by avoiding intermediate flash persistence, while the server controller reduces the round budget

T (π, τ)

under Non-IID data. Together, these two components target both factors in (7).

4.2. Client Side: Layer-Wise Training with Delayed Aggregation

4.2.1. Layer-Wise Training Scheduler

To satisfy the SRAM constraint in (3), TinySLFL decomposes local training into K serial single-layer updates. At any instant, only layer k, its gradients, and the optimizer state reside in SRAM; the remaining layers stay in flash. The local update of client i for layer k in round t is

{\hat{W}}_{i}^{(t), k} \leftarrow LocalSGD (W^{(t - 1), k}, D_{i}, η_{t}^{k}, E),

(8)

where

η_{t}^{k}

is the server-broadcast layer-wise learning rate and E is the local epoch count, reserving

τ

for the target accuracy in

T (π, τ)

. Hence, the peak SRAM footprint is bounded by the largest single-layer footprint rather than the full model size.

4.2.2. Delayed-Aggregation Protocol

A naive layer-wise implementation writes each finished layer back to flash before loading the next one, causing depth-dependent write amplification. TinySLFL instead performs only one flash persistence at the round start: the client stores

W^{(t - 1)}

as a recovery snapshot, updates each layer in SRAM, streams

(t, i, k, {\hat{W}}_{i}^{(t), k}, Δ L_{i, t}^{k})

to the server with a CRC check, and releases the layer buffers immediately after successful upload. Retransmission is layer-granular, and the server enforces idempotent reception keyed by

(t, i, k)

. The resulting per-round block-operation volume becomes

{EO}_{round}^{TinySLFL} = ϕ (| W |),

(9)

which is independent of the network depth K and the local epoch count E. Within the class of protocols that persist a full local recovery snapshot every round, (9) reaches the per-round lower bound on protocol-induced flash persistence; this does not cover stateless or streaming-only designs. The complete client procedure is Algorithm 1, where

W_{i, tmp}^{k}

is the temporary in-SRAM copy of layer k trained by client i.

Algorithm 1 Client side: layer-wise training with delayed streaming upload

Require: Round t; global model

W^{(t - 1)}

; local data

D_{i}

; layer count K; per-layer learning
rates

{η_{t}^{k}}

; local epochs E
Ensure: Layer-wise updates

{{\hat{W}}_{i}^{(t), k}}

and loss statistics

{Δ L_{i, t}^{k}}

uploaded to the server
Receive

W^{(t - 1)}

; persist to Flash ▹ Phase 1: single write
for

k = 1

to K do ▹ Phase 2: streaming layer-wise upload
if

η_{t}^{k} = 0

then
continue ▹ layer frozen by server
end if
Load

W^{(t - 1), k}

into SRAM; allocate gradient/optimizer buffers

W_{i, tmp}^{k} \leftarrow W^{(t - 1), k}

for

e = 1

to E do

W_{i, tmp}^{k} \leftarrow LocalSGD (W_{i, tmp}^{k}, D_{i}, η_{t}^{k})

end for

{\hat{W}}_{i}^{(t), k} \leftarrow W_{i, tmp}^{k}

Compute layer-wise loss reduction

Δ L_{i, t}^{k}

Send

(t, i, k, {\hat{W}}_{i}^{(t), k}, Δ L_{i, t}^{k})

with CRC
Free SRAM of layer k ▹ Phase 3: zero Flash write-back
end for
return round completion after all non-frozen layers have been uploaded or skipped

4.2.3. Fault Tolerance, State Alignment, and Straggler Control

TinySLFL also handles common deployment failures. Each client records the round index together with the round-start snapshot; after reboot, the server returns the set of layers already received so that the client resumes from the first missing layer without extra flash writes. CRC-protected, idempotent reception rejects duplicates and limits retransmission to the failed layer only. To bound the latency, the server uses a soft deadline

T_{round}

and excludes stragglers from round t while allowing them to rejoin in round

t + 1

through the same recovery path.

4.3. Server Side: Dynamic Aggregation

Even with per-round protocol-induced flash persistence minimized, severe Non-IID data can still increase the round budget T. The server therefore applies three layer-resolved mechanisms to improve convergence and reduce the cumulative protocol-level persistence.

4.3.1. Layer-Wise Learning-Rate Scheduling

Different layers exhibit different gradient sensitivities, so TinySLFL uses

η_{t}^{k} = η_{0} s_{k} \frac{1}{1 + α t},

(10)

where

η_{0}

is the initial learning rate,

s_{k}

is a per-layer scaling factor, and

α

controls the temporal decay. This gives larger early steps for fast progress and smaller later steps for stability.

4.3.2. Loss-Aware Freezing

To suppress redundant computation and communication on converged layers, the server computes a relative loss reduction:

Δ L_{i, t}^{k} ≜ \frac{L_{i, t}^{k, 0} - L_{i, t}^{k, E}}{L_{i, t}^{k, 0} + 10^{- 8}},

(11)

where

L_{i, t}^{k, 0}

and

L_{i, t}^{k, E}

are the pre- and post-local-training losses for layer k, and

10^{- 8}

is a numerical stabilizer. The server averages this quantity over participating clients

Π_{t}

and applies an EMA:

{\bar{Δ L}}_{t}^{k} = β {\bar{Δ L}}_{t - 1}^{k} + (1 - β) \frac{1}{| Π_{t} |} \sum_{i \in Π_{t}} Δ L_{i, t}^{k},

(12)

with smoothing factor

β

. After a warm-up window, layer k is frozen by setting

η_{t + 1}^{k} \leftarrow 0

once

{\bar{Δ L}}_{t}^{k} \leq ϵ

, so later rounds skip it while it remains frozen. Freezing is reversible: the scheduler may unfreeze the next frozen zero-rate layer by removing it from the frozen set and assigning a positive learning rate, and if the round-end proxy accuracy drops by more than

γ

, recently frozen or drift-sensitive layers are removed from the frozen set and assigned

max (η_{min}, η_{0} s_{k} / (1 + α t))

. This reactivation rule is heuristic and proxy-dependent.

4.3.3. Accuracy-Guided Selective Aggregation

Under severe Non-IID heterogeneity, some client updates may harm the global model. TinySLFL therefore validates each candidate layer update before committing it. Let

W_{cand}^{k} = Aggregate ({{\hat{W}}_{i}^{(t), k}}_{i \in Π_{t}})

denote the candidate aggregate for layer k, and let

W^{'}

be the temporary model formed by replacing layer k of

{\tilde{W}}^{(t, k - 1)}

, where

{\tilde{W}}^{(t, 0)} = W^{(t - 1)}

. The proxy-set accuracy gain is

Δ {acc}_{t}^{k} ≜ acc (W^{'}) - acc ({\tilde{W}}^{(t, k - 1)}) .

(13)

The candidate is committed only if

Δ {acc}_{t}^{k} > δ

; otherwise, the previous weights are retained, making the proxy-accuracy trajectory non-decreasing within each round. This is a proxy-evaluated property only and does not imply monotonic true test accuracy. If an explicit proxy is unavailable, the filter can be disabled and the server reduces to the freezing-only variant.

4.4. Complexity, Overhead, and Formal Guarantees

Under Algorithm 1, the peak SRAM footprint is bounded by

{max}_{k} M_{k}

, independent of depth K and epoch count E, because only one layer and its auxiliary buffers are resident at any instant. Flash usage remains

O (| W |)

(one snapshot, no swap region). For protocols that persist a full local recovery snapshot each round, the class-specific minimum protocol-level erase-block volume is

ϕ (| W |)

, which TinySLFL matches; this is not a lower bound for stateless or streaming-only clients. The cumulative protocol-level erase-block volume over the training horizon is

EO (T) = T ϕ (| W |),

(14)

and the cumulative communication is

\sum_{t = 1}^{T} (1 - ρ (t)) | W |

, where

ρ (t)

is the fraction of frozen layers at round t. Under a stable proxy set and

δ

above proxy noise, Algorithm 2 is non-decreasing only on the proxy sequence; proxy–test mismatch can still reduce the true task accuracy. Overall, TinySLFL satisfies the SRAM constraint, minimizes protocol-induced persistence for snapshot-persisting clients, and empirically reduces the round budget in the evaluated Non-IID settings.

Algorithm 2 Server side: dynamic aggregation

Require: Round t; participating clients

Π_{t}

; global model

W^{(t - 1)}

; freezing threshold

ϵ

;
accuracy threshold

δ

; frozen set

F_{t - 1}

; reactivation margin

γ

; minimum unfreeze rate

η_{min}

; best proxy accuracy

a_{t - 1}^{★}

Ensure: Updated global model

W^{(t)}

; next-round learning rates

{η_{t + 1}^{k}}

; frozen set

F_{t}

W^{(t)} \leftarrow W^{(t - 1)}

F_{t} \leftarrow F_{t - 1}

for

k = 1

to K do
if

k \in F_{t}

then

η_{t + 1}^{k} \leftarrow 0

; continue ▹ skip frozen layer
end if
Compute

{\bar{Δ L}}_{t}^{k}

via (11)–(12) ▹ Step 1: loss-aware freezing
if

{\bar{Δ L}}_{t}^{k} \leq ϵ

then

F_{t} \leftarrow F_{t} \cup {k}

;

η_{t + 1}^{k} \leftarrow 0

W^{(t), k} \leftarrow W^{(t - 1), k}

if

k < K

and

k + 1 \in F_{t}

then

F_{t} \leftarrow F_{t} ∖ {k + 1}

;

η_{t + 1}^{k + 1} \leftarrow max (η_{min}, η_{0} s_{k + 1} / (1 + α t))

▹ unfreeze next
layer
end if
else
Update

η_{t + 1}^{k}

via (10)

W_{cand}^{k} \leftarrow Aggregate ({{\hat{W}}_{i}^{(t), k}}_{i \in Π_{t}})

Form

W^{'}

by replacing layer k of current

W^{(t)}

; compute

Δ {acc}_{t}^{k}

via (13) ▹ Step
2: accuracy-guided filtering
if

Δ {acc}_{t}^{k} > δ

then

W^{(t), k} \leftarrow W_{cand}^{k}

▹ commit
else

W^{(t), k} \leftarrow W^{(t - 1), k}

▹ reject
end if
end if
end for
Compute round-end proxy accuracy

a_{t} = acc (W^{(t)})

if

a_{t} < a_{t - 1}^{★} - γ

then
Select

R_{t} \subseteq F_{t}

;

F_{t} \leftarrow F_{t} ∖ R_{t}

▹ recent or drift-sensitive layers
For

r \in R_{t}

, set

η_{t + 1}^{r} \leftarrow max (η_{min}, η_{0} s_{r} / (1 + α t))

▹ reactivate/unfreeze
end if

a_{t}^{★} \leftarrow max (a_{t - 1}^{★}, a_{t})

if all layers remain frozen then
return

W^{(t)}

and terminate global rounds
end if
Broadcast

W^{(t)}

and

{η_{t + 1}^{k}}

to all clients

5. Results

5.1. Experimental Setup

Table 1 summarizes the full configuration. We evaluated TinySLFL on CIFAR-10 [21], SVHN [22], and FEMNIST [23]. CIFAR-10 and SVHN use Dirichlet splits [24] with

α = 0.1

over

N = 20

clients; FEMNIST uses writer partitions truncated to the same client count, with roughly 200–300 samples per client. Each client uses an 8:2 train/proxy split. The accuracy is reported on the standard CIFAR-10/SVHN test sets and on the held-out union of the selected FEMNIST writers.

All methods use the same ResNet-8 backbone [25] (

\sim 70

KB), pre-trained on Tiny-ImageNet [26], and only the federated fine-tuning stage was compared. The server-side proxy pool contained approximately 10,000 samples for CIFAR-10, 14,651 for SVHN, and 800–1200 for FEMNIST, and was used only by the accept/reject rule in Algorithm 2.

We compared against FedAvg [6], FedProx [8], and SLT [11]. Because full-network backpropagation exceeded the native 512 KB SRAM budget, FedAvg and FedProx were evaluated in a simulation under the TinyOps flash-swap model [16]; in the hardware study, FedAvg served only as a feasibility reference and failed with OOM. Accordingly, the experimental design had two levels: a simulation-level algorithm comparison among FedAvg, FedProx, SLT, and TinySLFL, and a real-device feasibility/wear/latency/energy comparison only among methods that can be deployed natively on the ESP32-S3. SLT is the direct layer-wise baseline that writes each layer back to flash after every step. Unless stated otherwise, all methods used identical client sampling,

E = 5

, an effective batch size of 32 (hardware batch 1 with 32-step accumulation), and the same initialization. The simulations were run on two NVIDIA (Santa Clara, CA, USA) V100 GPUs.

We report the final Top-1 accuracy, cumulative protocol-level erase-block operations (EOs) to a common target accuracy, and the projected lifetime from the measured hottest-block wear. The hardware study additionally reports the peak SRAM, per-round flash wear (FW/Round), latency, and energy. FedAvg and FedProx were therefore not interpreted as hardware fairness baselines; they remained simulation baselines, and FedAvg is shown in the hardware table only as a representative full-model OOM case on the target MCU. The simulation results used five seeds; the hardware resource metrics in Table 2 average three boards over five rounds, whereas the on-device accuracy in Table 3 was evaluated after ten FL rounds. The server thresholds were fixed to

ϵ = 0.01

,

δ = 0.01

, and

β = 0.9

.

Table 1. Summary of datasets, model, hyperparameters, and hardware configuration.

Item	Configuration
Datasets	CIFAR-10, SVHN, FEMNIST
Non-IID partition	CIFAR-10/SVHN: Dirichlet $α = 0.1$ ; FEMNIST: writer-identity partition; $N = 20$ clients
Backbone	ResNet-8 (∼70 KB), pre-trained on Tiny-ImageNet
Local epochs/batch size	$E = 5$ /effective batch 32 (hardware batch 1, accumulation 32 in both simulation and device runs)
Optimizer	SGD, momentum $0.9$ , $η_{0} = 0.01$ , layer-wise decay
Server thresholds	$ϵ = 0.01$ , $δ = 0.01$ , $β = 0.9$
Server proxy set	Held-out validation pool (20% of partitioned data): CIFAR-10 10,000; SVHN∼14,651; FEMNIST∼800–1200
MCU model	Espressif ESP32-S3; 512 KB SRAM; 8 MB NOR flash
Flash erase block	4 KB
Wear estimation	Server-side simulation with inferred erase-block tracing
Simulation seeds	5 random seeds

Table 2. Real-device ESP32-S3 results on CIFAR-10 under

α = 0.1

, averaged over three boards and five rounds. “OOM” denotes native deployment failure; FW/Round is the erase-count increment of the most heavily worn 4 KB block. The 18 value is the measured per-round hottest-block increment, not the cumulative simulated EO value 342 in Table 4. Energy/Round covers snapshot persistence, layer-wise training, flash updates, Wi-Fi upload, and idle gaps.

Table 2. Real-device ESP32-S3 results on CIFAR-10 under

α = 0.1

, averaged over three boards and five rounds. “OOM” denotes native deployment failure; FW/Round is the erase-count increment of the most heavily worn 4 KB block. The 18 value is the measured per-round hottest-block increment, not the cumulative simulated EO value 342 in Table 4. Energy/Round covers snapshot persistence, layer-wise training, flash updates, Wi-Fi upload, and idle gaps.

Method	Peak SRAM (KB)	FW/Round	Latency/Round (s)	Energy/Round (J)
FedAvg	OOM	—	—	—
SLT	310	558	235	$70.5$
TinySLFL	320	18	192	$57.6$

Table 3. On-device vs. simulation Top-1 accuracy (%) on CIFAR-10 under

α = 0.1

after 10 FL rounds. The on-device value was averaged over three ESP32-S3 boards; the simulation value is the five-seed mean at the same round count.

Table 3. On-device vs. simulation Top-1 accuracy (%) on CIFAR-10 under

α = 0.1

after 10 FL rounds. The on-device value was averaged over three ESP32-S3 boards; the simulation value is the five-seed mean at the same round count.

Method	Simulation Acc. (%)	On-Device Acc. (%)
SLT	$36.52 \pm 0.73$	$36.38 \pm 0.85$
TinySLFL	$62.45 \pm 0.56$	$62.28 \pm 0.65$

Table 4. Estimated cumulative protocol-level erase-block operations (EOs) to a common target accuracy in server-side simulation. These values are cumulative simulation estimates to the target accuracy, not real-device measurements and not per-round wear counts. The targets were

65 %

for CIFAR-10,

85 %

for SVHN, and

70 %

for FEMNIST. ^† FedAvg and FedProx followed the TinyOps flash-swap model; the values are simulation estimates rather than on-device measurements. The best results are in bold.

Table 4. Estimated cumulative protocol-level erase-block operations (EOs) to a common target accuracy in server-side simulation. These values are cumulative simulation estimates to the target accuracy, not real-device measurements and not per-round wear counts. The targets were

65 %

for CIFAR-10,

85 %

for SVHN, and

70 %

for FEMNIST. ^† FedAvg and FedProx followed the TinyOps flash-swap model; the values are simulation estimates rather than on-device measurements. The best results are in bold.

Method	CIFAR-10	SVHN	FEMNIST
FedAvg ^†	90,031	93,782	90,031
FedProx ^†	82,529	61,896	31,886
SLT	24,552	23,436	12,834
TinySLFL	342	486	288

5.2. Accuracy and Wear-to-Target Comparison

Figure 2 shows the mean accuracy trajectories; Table 4 and Table 5 report the final accuracy and EO to target. The target for each dataset was set as a conservative common accuracy level reached by all methods within the 100-round horizon. TinySLFL achieved the best final accuracy on all three benchmarks and cut the EO to target relative to SLT from 24,552 to 342 on CIFAR-10, from 23,436 to 486 on SVHN, and from 12,834 to 288 on FEMNIST (97.8–98.6% reduction). These are cumulative simulated protocol-level volumes, not comparable with the per-round hottest-block increments in the hardware tables. Concretely, the simulated EO of 342 for CIFAR-10 is the total protocol-level block-operation count accumulated across all T rounds required to reach the

65 %

accuracy target, whereas the measured FW/Round of 18 reported in Section 5.3 is the per-round hottest-block erase increment observed on real hardware; the two metrics operate at different granularities (cumulative vs. per-round) and different abstraction levels (protocol-level surrogate vs. physical erase count) and should not be compared directly.

Device Lifetime Projection

Table 6 converts the measured FW/Round into the projected lifetime under light (1 round/day) and frequent (20 rounds/day) deployment. Here, the peak FW/Round denotes the per-round erase-count increment of the most heavily worn 4 KB flash block on the ESP32-S3: SLT incurred 558 erases per round, whereas TinySLFL incurred only 18. These are physical erase counts measured on real hardware, distinct from the cumulative simulated protocol-level EO in Table 4. Under the nominal

10^{5}

-cycle budget, TinySLFL exceeded 15 years and 9 months, whereas SLT lasted about 179 and 9 days. This projection is tied to the measured hottest-block wear rather than inferred from the EO.

5.3. Real-Device Validation on ESP32-S3

We validated TinySLFL on three ESP32-S3 [27] boards (Figure 3) running the client firmware with LittleFS [19] on 8 MB of SPI NOR flash and communicating with a Python 3.10 aggregation server over Wi-Fi. The per-round energy was measured from the supply rail with a Nordic PPK2 [28] at 100 kHz. The hardware experiment validated the native deployability, measured storage wear, latency, energy, and on-device model quality. No training steps, memory-management mechanisms, or protocol functionality were omitted in the firmware implementation; the 10 KB SRAM increase from SLT to TinySLFL comes from the delayed-aggregation staging buffer. After completing ten FL rounds on each of the three ESP32-S3 boards, the trained model weights were exported from flash and evaluated on the standard CIFAR-10 test set at the server side. Table 3 compares the resulting on-device accuracy with the five-seed simulation mean at the same round count. Because the simulation and on-device runs share identical hyperparameters (batch-1 with 32-step accumulation, SGD momentum 0.9,

η_{0} = 0.01

, same layer-wise decay) and the layer-wise delayed-aggregation protocol follows a deterministic execution order, the two accuracy values were expected to be close. TinySLFL achieved

62.28 %

on-device versus

62.45 %

in the simulation, a gap of only

0.17

percentage points; SLT showed a similarly small gap of

0.14

percentage points (

36.38 %

vs.

36.52 %

). These differences are attributable to minor numerical discrepancies between the single-precision floating-point implementation on the Xtensa LX7 core and the server-side GPU, together with board-to-board run variance. The close agreement supports consistency between the server-side simulation and on-device training under the evaluated ten-round setting.

Flash management uses a snapshot-and-stream layout: one LittleFS global-model snapshot is written at the round start; per-layer updates are trained in SRAM and uploaded without intermediate checkpoints. Consequently, model-data blocks are erased and written only once per round under TinySLFL’s delayed-aggregation protocol; LittleFS metadata blocks (directory entries, allocation bitmaps) may incur additional erases, which are captured in the FW/Round measurement. A firmware tracer records 4 KB erase counts; FW/Round is the hottest-block increment, so repeated metadata/journaling erases are captured empirically. LittleFS provides block-level wear leveling, but does not guarantee page-level uniformity, so the hottest-block metric is a conservative characterization and the lifetime projections in Table 6 are lower-bound estimates.

Table 2 confirms native deployability: TinySLFL peaked at 320 KB SRAM (the 10 KB increase over SLT came from the delayed-aggregation staging buffer). FedAvg failed with OOM. Relative to SLT, TinySLFL cut the hottest-block wear from 558 to 18 erase counts per round (96.8%), the latency from

\sim 235

s to

\sim 192

s, and the energy from

\sim 70.5

J to

\sim 57.6

J. This equals

0.0160

Wh versus

0.0196

Wh per round.

To understand the energy budget in more detail, we segmented the PPK2 current trace into four phases based on distinct current profiles: (i) snapshot persistence (flash write of the global model at the round start, characterized by short high-current pulses of

\sim 120

mA), (ii) local training (sustained computation at

\sim 85

mA), (iii) Wi-Fi upload (radio-active bursts at

\sim 180

mA), and (iv) idle/protocol gaps (

\sim 20

mA). Table 7 reports the per-phase energy for both SLT and TinySLFL. Local training dominated the energy budget at

72.2 %

(

41.6

J out of

57.6

J) for TinySLFL, followed by Wi-Fi communication at

21.7 %

(

12.5

J). The snapshot-persistence phase accounted for the largest difference between the two methods: SLT consumed

12.1

J on flash writes per round due to its layer-wise write-back after every layer, whereas TinySLFL required only

2.8

J (a single full-model snapshot), a

76.9 %

reduction. The remaining phases—local training (

41.6

J vs.

45.2

J), Wi-Fi upload (

12.5

J each), and idle (

0.7

J each)—were comparable, confirming that TinySLFL’s end-to-end energy saving of

18.3 %

(

57.6

J vs.

70.5

J) is driven primarily by a reduced flash-write overhead. We additionally measured the inference-only energy by loading the trained model and performing a forward pass on a 1000-sample held-out test subset: both methods consumed

48.5

mJ per sample, confirming that inference energy is identical across protocols and negligible (

\sim 0.084 %

) relative to one FL training round.

5.4. Ablation Study and Sensitivity Analysis

5.4.1. Ablation Study

We studied three CIFAR-10 variants under

α = 0.1

: TinySLFL without loss-sensing freezing, TinySLFL without accuracy-guided filtering, and the full method.

Figure 4 shows that loss-sensing freezing mainly improves the convergence speed, whereas accuracy-guided filtering mainly improves the late-stage stability and final accuracy. The two components are therefore complementary.

5.4.2. Freeze/Unfreeze Dynamics

To verify that the reactivation mechanism in Algorithm 2 (Lines 26–29) is not merely theoretical, we recorded the freeze and unfreeze events across 100 CIFAR-10 rounds (

α = 0.1

) for the full TinySLFL configuration. Over the eight ResNet-8 layers, a total of 26 freeze events and 19 unfreeze (reactivation) events were triggered. Early layers (conv1–conv3) were frozen first and accounted for 46% of the freeze events; 14 of the 19 unfreeze events targeted these same early layers when the proxy accuracy dropped by more than

γ

, suggesting that the mechanism responds to Non-IID drift rather than remaining dormant. On average, each layer experienced 2.3 freeze–unfreeze cycles over the 100-round horizon. The unfreeze events clustered around rounds 65–80, coinciding with the late-phase accuracy recovery visible in Figure 4, where the full method pulls ahead of the “w/o Filtering” variant.

5.4.3. Sensitivity Analysis

Table 8 reports the five-seed mean CIFAR-10 accuracy after 50 rounds over a

3 \times 3

grid of the server hyperparameters

δ

and

β

. The setting

(0.01, 0.90)

performed best and was used in the main experiments; nearby settings degraded smoothly rather than collapsing.

6. Discussion

TinySLFL changes the operating point of deployable on-device FL on MCUs by treating flash endurance as a first-class design constraint alongside accuracy and memory. The strongest hardware claim was comparative among natively deployable ESP32-S3 methods (SLT and TinySLFL); FedAvg and FedProx remain simulation baselines, and FedAvg was shown only as a representative full-model OOM case.

Four limitations should be noted. First, selective aggregation depends on the proxy set; severe proxy–target mismatch can reduce the acceptance rule’s fidelity. Second, the current design assumes that a full global model can be persisted once per round; very large models may require model compression or partitioned storage. Third, the evaluation focused on image classification; audio and sensor-stream tasks may exhibit different freezing dynamics. Fourth, broader seed sweeps would tighten confidence for the smallest inter-method accuracy gaps. EOs remain a protocol-level surrogate; the physical lifetime also depends on file-system allocation, journaling, wear leveling, and storage layout. Reactivation is proxy-triggered and may miss unseen drift; a formal finite-sample variance bound on proxy estimation error remains an open problem and is left to future work. Hardware validation was limited to ESP32-S3 with LittleFS. The on-device accuracy reported in Table 3 covers only ten rounds under a single Non-IID setting; longer-horizon on-device accuracy curves would further strengthen the validation. Nevertheless, the ablation in Figure 4 provides indirect evidence that the reactivation mechanism contributes to robustness: the full configuration showed a more stable late-phase trajectory under

α = 0.1

Non-IID drift, whereas the “w/o Filtering” variant showed oscillations and a lower plateau, suggesting that frozen layers can become stale without a recovery path. The per-phase energy breakdown in Table 7 shows that local training dominated the round budget (

72.2 %

) and that TinySLFL’s energy advantage over SLT stemmed almost entirely from the

76.9 %

reduction in snapshot-persistence energy (

2.8

J vs.

12.1

J). TinySLFL’s

0.0160

Wh round cost gives an idealized budget of about 230 rounds from a 3.7 V, 1000 mAh cell before conversion losses, while the inference energy (

48.5

mJ per sample) is negligible relative to training. Higher round rates or energy harvesting therefore require duty-cycled training. The actual deployment energy would also depend on the Wi-Fi link quality, payload size, and retransmission overhead, none of which were captured by the single-AP laboratory setup used here.

These limitations define the correct scope: TinySLFL is a practical training protocol for commodity MCU deployments in which whole-network backpropagation is SRAM-infeasible and repeated layer write-back is lifetime-prohibitive.

7. Conclusions

This paper presented TinySLFL, a hardware-aware federated learning framework that jointly optimizes model accuracy, SRAM feasibility, and flash endurance for resource-constrained MCUs through client-side layer-wise delayed aggregation and server-side dynamic aggregation. Under a Dirichlet

α = 0.1

label skew (CIFAR-10, SVHN) and native writer heterogeneity (FEMNIST), TinySLFL attained the highest mean final accuracy in a five-seed server-side simulation, reduced the cumulative protocol-level EO to target by 97.8–98.6% relative to SLT, cut the measured hottest-block wear from 558 to 18 erase counts per round on the ESP32-S3, and achieved a lower latency and energy than SLT.

The method also has clear limitations: EOs are a protocol-level surrogate rather than a complete physical-wear model, proxy-set mismatch can affect selective aggregation and reactivation decisions, the unfreezing rule is heuristic rather than globally optimal, the hardware validation covers only one MCU family and storage stack, and the on-device accuracy evaluation was limited to a short training horizon. Future work will extend the framework in four directions: integrating model compression with delayed aggregation for larger backbones, replacing the static proxy set with adaptive validation streams, learning the reactivation margin and unfreeze schedule automatically under distribution drift, and broadening the hardware study to additional MCU families and storage stacks.

Author Contributions

Conceptualization, Y.T. and J.J.; methodology, Y.T. and J.J.; software, Y.T.; validation, Y.T., J.J. and T.D.; formal analysis, Y.T.; investigation, Y.T.; resources, J.J. and T.D.; data curation, Y.T.; writing—original draft preparation, Y.T.; writing—review and editing, J.J. and T.D.; visualization, Y.T.; supervision, J.J. and T.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Collaborative Innovation Center of Novel Software Technology and Industrialization, the Priority Academic Program Development of Jiangsu Higher Education Institutions (PAPD), and the Suzhou Frontier Science and Technology Program (project SYG202310).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The code, configuration files, and processed experimental records supporting the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lin, J.; Zhu, L.; Chen, W.-M.; Wang, W.-C.; Han, S. Tiny Machine Learning: Progress and Futures. IEEE Circuits Syst. Mag. 2023, 23, 8–34. [Google Scholar] [CrossRef]
Zhu, S.; Voigt, T.; Rahimian, F.; Ko, J. On-Device Training: A First Overview on Existing Systems. ACM Trans. Sens. Netw. 2024, 20, 1–39. [Google Scholar] [CrossRef]
Lin, J.; Chen, W.-M.; Cai, H.; Gan, C.; Han, S. MCUNetV2: Memory-Efficient Patch-Based Inference for Tiny Deep Learning. Adv. Neural Inf. Process. Syst. 2021, 34, 2346–2358. [Google Scholar]
Boboila, S.; Desnoyers, P. Write Endurance in Flash Drives: Measurements and Analysis. In Proceedings of the 8th USENIX Conference on File and Storage Technologies, San Jose, CA, USA, 23–26 February 2010; pp. 115–128. [Google Scholar]
Boukhobza, J.; Olivier, P.; Lim, W.S.; Chen, L.-C.; Hsieh, Y.-S.; Wu, S.-T.; Ho, C.-C.; Huang, P.-C.; Chang, Y.-H. A Survey on Flash-Memory Storage Systems: A Host-Side Perspective. ACM Trans. Storage 2025, 21, 1–59. [Google Scholar] [CrossRef]
McMahan, H.B.; Moore, E.; Ramage, D.; Hampson, S.; Arcas, B.A. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Artificial Intelligence and Statistics; PMLR: Cambridge, MA, USA, 2017; pp. 1273–1282. [Google Scholar]
Li, Q.; Diao, Y.; Chen, Q.; He, B. Federated Learning on Non-IID Data Silos: An Experimental Study. In Proceedings of the IEEE 38th International Conference on Data Engineering (ICDE), Kuala Lumpur, Malaysia, 9–12 May 2022; pp. 965–978. [Google Scholar]
Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated Optimization in Heterogeneous Networks. Proc. Mach. Learn. Syst. 2020, 2, 429–450. [Google Scholar]
Llisterri Giménez, N.; Monfort Grau, M.; Pueyo Centelles, R.; Freitag, F. On-Device Training of Machine Learning Models on Microcontrollers with Federated Learning. Electronics 2022, 11, 573. [Google Scholar] [CrossRef]
Sha, X.; Sun, W.; Liu, X.; Luo, Y.; Luo, C. Enhancing Edge-Assisted Federated Learning with Asynchronous Aggregation and Cluster Pairing. Electronics 2024, 13, 2135. [Google Scholar] [CrossRef]
Pfeiffer, K.; Khalili, R.; Henkel, J. Aggregating Capacity in FL through Successive Layer Training for Computationally-Constrained Devices. Adv. Neural Inf. Process. Syst. 2023, 36, 35386–35402. [Google Scholar]
Banbury, C.; Reddi, V.J.; Torelli, P.; Holleman, J.; Jeffries, N.; Király, C.; Montino, P.; Kanter, D.; Ahmed, S.; Pau, D.; et al. MLPerf Tiny Benchmark. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 1. [Google Scholar]
Choi, J.; Wang, Z.; Venkataramani, S.; Chuang, P.I.-J.; Srinivasan, V.; Gopalakrishnan, K. PACT: Parameterized Clipping Activation for Quantized Neural Networks. In Proceedings of the Sixth International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Gruslys, A.; Munos, R.; Danihelka, I.; Lanctot, M.; Graves, A. Memory-Efficient Backpropagation Through Time. In Advances in Neural Information Processing Systems 29 (NeurIPS 2016); Curran Associates, Inc.: Red Hook, NY, USA, 2016; pp. 4125–4133. [Google Scholar]
Cai, H.; Gan, C.; Zhu, L.; Han, S. TinyTL: Reduce Memory, Not Parameters for Efficient On-Device Learning. Adv. Neural Inf. Process. Syst. 2020, 33, 11285–11297. [Google Scholar]
Lin, J.; Zhu, L.; Chen, W.-M.; Wang, W.-C.; Cai, H.; Shi, L.; Han, S. On-Device Training Under 256KB Memory. Adv. Neural Inf. Process. Syst. 2022, 35, 22169–22183. [Google Scholar]
Kopparapu, K.; Lin, E.; Breslin, J.G.; Sudharsan, B. TinyFedTL: Federated Transfer Learning on Ubiquitous Tiny IoT Devices. In 2022 IEEE International Conference on Pervasive Computing and Communications Workshops and Other Affiliated Events (PerCom Workshops); IEEE: New York, NY, USA, 2022; pp. 79–81. [Google Scholar] [CrossRef]
Ren, H.; Li, X.; Anicic, D.; Runkler, T.A. TinyMetaFed: Efficient Federated Meta-Learning for TinyML. In Machine Learning and Principles and Practice of Knowledge Discovery in Databases, International Workshops of ECML PKDD 2023; Communications in Computer and Information Science, Volume 2136; Springer: Cham, Switzerland, 2024. [Google Scholar] [CrossRef]
Geerts, C. LittleFS—A Little Fail-Safe Filesystem Designed for Microcontrollers. GitHub Repository. 2017. Available online: https://github.com/littlefs-project/littlefs (accessed on 14 April 2026).
Buck, A.; Ganesan, K.; Enright Jerger, N. FlipBit: Approximate Flash Memory for IoT Devices. In 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA); IEEE: New York, NY, USA, 2024; pp. 876–890. [Google Scholar] [CrossRef]
Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; Technical Report; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; Ng, A.Y. Reading Digits in Natural Images with Unsupervised Feature Learning. In NeurIPS Workshop on Deep Learning and Unsupervised Feature Learning; Curran Associates, Inc.: Red Hook, NY, USA; Granada, Spain, 2011. [Google Scholar]
Caldas, S.; Duddu, S.M.K.; Wu, P.; Li, T.; Konečný, J.; McMahan, H.B.; Smith, V.; Talwalkar, A. LEAF: A Benchmark for Federated Settings. In Workshop on Federated Learning for Data Privacy and Confidentiality (NeurIPS 2019); Curran Associates, Inc.: Red Hook, NY, USA; Vancouver, BC, Canada, 2019. [Google Scholar]
Hsu, T.-M.H.; Qi, H.; Brown, M. Measuring the Effects of Non-Identical Data Distribution for Federated Visual Classification. In Workshop on Federated Learning for Data Privacy and Confidentiality (NeurIPS 2019); Curran Associates, Inc.: Red Hook, NY, USA; Vancouver, BC, Canada, 2019. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Le, Y.; Yang, X.S. Tiny ImageNet Visual Recognition Challenge; Stanford University CS231N Course Report; Stanford University: Stanford, CA, USA, 2015; Available online: https://cs231n.stanford.edu/2015/project.html (accessed on 15 April 2026).
Espressif Systems. ESP32-S3 Technical Reference Manual, version 1.4; Espressif Systems: Shanghai, China, 2023; Available online: https://www.espressif.com/sites/default/files/documentation/esp32-s3_technical_reference_manual_en.pdf (accessed on 14 April 2026).
Nordic Semiconductor. Power Profiler Kit II (PPK2) User Guide; Nordic Semiconductor: Trondheim, Norway, 2021; Available online: https://docs.nordicsemi.com/bundle/ug_ppk2/page/UG/ppk/PPK_user_guide_Intro.html (accessed on 14 April 2026).

Figure 1. Overview of TinySLFL. (Left) Clients perform layer-wise local training, keep intermediate updates in SRAM, and stream each completed layer to the server without intermediate flash writes. (Right) The server applies layer-wise learning-rate scheduling, loss-aware freezing, and accuracy-guided filtering before committing the aggregated layer to the global model.

Δ acc

is the proxy-set accuracy gain for accept/reject decisions, and

Δ F

or

Δ L

is the layer-wise loss reduction for freezing. Rejected candidates keep previous weights; frozen layers are skipped until reactivated.

Figure 1. Overview of TinySLFL. (Left) Clients perform layer-wise local training, keep intermediate updates in SRAM, and stream each completed layer to the server without intermediate flash writes. (Right) The server applies layer-wise learning-rate scheduling, loss-aware freezing, and accuracy-guided filtering before committing the aggregated layer to the global model.

Δ acc

is the proxy-set accuracy gain for accept/reject decisions, and

Δ F

or

Δ L

is the layer-wise loss reduction for freezing. Rejected candidates keep previous weights; frozen layers are skipped until reactivated.

Figure 2. Mean Top-1 accuracy (%) over communication rounds, averaged over five simulation seeds. The y axes in all three panels report the percent.

Figure 3. ESP32-S3 hardware validation setup. Panel (a) shows the inline PPK2-to-ESP32-S3 power-measurement topology, including 3.3 V/current sensing and the measured FL round operations. Panel (b) shows the federated deployment topology with three ESP32-S3 clients connected through a Wi-Fi AP to a Python aggregation server.

Figure 4. Ablation of the server-side aggregation design on CIFAR-10 (

α = 0.1

), reported as the five-seed mean.

Figure 4. Ablation of the server-side aggregation design on CIFAR-10 (

α = 0.1

), reported as the five-seed mean.

Table 5. Final Top-1 accuracy (%) from server-side simulation (mean ± standard deviation over five seeds). FedAvg and FedProx follow the flash-swap model of Lin et al. [16]. Best results are in bold.

Method	CIFAR-10	SVHN	FEMNIST
FedAvg	68.41 ± 0.23	90.83 ± 0.16	71.68 ± 0.23
FedProx	69.80 ± 0.31	91.22 ± 0.25	80.65 ± 0.26
SLT	71.31 ± 0.28	90.31 ± 0.17	80.54 ± 0.35
TinySLFL	76.55 ± 0.18	92.23 ± 0.21	81.82 ± 0.38

Table 6. Projected ESP32-S3 lifetime from measured FW/Round on CIFAR-10 under

α = 0.1

with a nominal

10^{5}

-cycle budget. The peak FW/Round values 558 and 18 are the measured per-round erase-count increments of the most heavily worn 4 KB flash block on the ESP32-S3 and were used only for the device lifetime projection.

Table 6. Projected ESP32-S3 lifetime from measured FW/Round on CIFAR-10 under

α = 0.1

with a nominal

10^{5}

-cycle budget. The peak FW/Round values 558 and 18 are the measured per-round erase-count increments of the most heavily worn 4 KB flash block on the ESP32-S3 and were used only for the device lifetime projection.

Method	Peak FW/Round	Lifetime (Light)	Lifetime (Frequent)
FedAvg	OOM	—	—
SLT	558	$\sim 179$ days	$\sim 9$ days
TinySLFL	18	>15 years	>9 months

Table 7. Per-phase energy breakdown (J) of a single FL round on CIFAR-10 (

α = 0.1

) measured via PPK2 trace segmentation, averaged over three ESP32-S3 boards. Inference energy is measured separately over a 1000-sample held-out test subset.

Table 7. Per-phase energy breakdown (J) of a single FL round on CIFAR-10 (

α = 0.1

) measured via PPK2 trace segmentation, averaged over three ESP32-S3 boards. Inference energy is measured separately over a 1000-sample held-out test subset.

Method	Snapshot (J)	Training (J)	Wi-Fi (J)	Idle (J)	Total/Round (J)	Inference/Sample (mJ)
SLT	$12.1$	$45.2$	$12.5$	$0.7$	$70.5$	$48.5$
TinySLFL	$2.8$	$41.6$	$12.5$	$0.7$	$57.6$	$48.5$

Table 8. Sensitivity of CIFAR-10 Top-1 accuracy (%) to

(δ, β)

under

α = 0.1

. The bold cell is the setting used in the main experiments. Values are five-seed means measured after 50 communication rounds.

Table 8. Sensitivity of CIFAR-10 Top-1 accuracy (%) to

(δ, β)

under

α = 0.1

. The bold cell is the setting used in the main experiments. Values are five-seed means measured after 50 communication rounds.

$δ ∖ β$	0.70	0.90	0.95
$0.005$	72.86	74.09	72.99
$0.010$	74.66	76.55	73.17
$0.020$	73.78	75.11	74.82

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tao, Y.; Jia, J.; Deng, T. TinySLFL: A Flash-Endurance-Aware Federated Edge Learning Framework with Layer-Wise Delayed Aggregation for Resource-Constrained Microcontrollers. Electronics 2026, 15, 2084. https://doi.org/10.3390/electronics15102084

AMA Style

Tao Y, Jia J, Deng T. TinySLFL: A Flash-Endurance-Aware Federated Edge Learning Framework with Layer-Wise Delayed Aggregation for Resource-Constrained Microcontrollers. Electronics. 2026; 15(10):2084. https://doi.org/10.3390/electronics15102084

Chicago/Turabian Style

Tao, Yiru, Juncheng Jia, and Tao Deng. 2026. "TinySLFL: A Flash-Endurance-Aware Federated Edge Learning Framework with Layer-Wise Delayed Aggregation for Resource-Constrained Microcontrollers" Electronics 15, no. 10: 2084. https://doi.org/10.3390/electronics15102084

APA Style

Tao, Y., Jia, J., & Deng, T. (2026). TinySLFL: A Flash-Endurance-Aware Federated Edge Learning Framework with Layer-Wise Delayed Aggregation for Resource-Constrained Microcontrollers. Electronics, 15(10), 2084. https://doi.org/10.3390/electronics15102084

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

TinySLFL: A Flash-Endurance-Aware Federated Edge Learning Framework with Layer-Wise Delayed Aggregation for Resource-Constrained Microcontrollers

Abstract

1. Introduction

2. Related Work

2.1. On-Device Training on Microcontrollers

2.2. Federated Learning Under Resource and Data Heterogeneity

2.3. Flash Memory Endurance and Wear-Aware Systems

3. Preliminaries and Problem Formulation

3.1. System Model and Federated Objective

3.2. Constraint I: The SRAM Memory Wall

3.3. Constraint II: Flash Endurance

3.4. Co-Optimization Objective

4. The TinySLFL Framework

4.1. Problem Formulation

4.2. Client Side: Layer-Wise Training with Delayed Aggregation

4.2.1. Layer-Wise Training Scheduler

4.2.2. Delayed-Aggregation Protocol

4.2.3. Fault Tolerance, State Alignment, and Straggler Control

4.3. Server Side: Dynamic Aggregation

4.3.1. Layer-Wise Learning-Rate Scheduling

4.3.2. Loss-Aware Freezing

4.3.3. Accuracy-Guided Selective Aggregation

4.4. Complexity, Overhead, and Formal Guarantees

5. Results

5.1. Experimental Setup

5.2. Accuracy and Wear-to-Target Comparison

Device Lifetime Projection

5.3. Real-Device Validation on ESP32-S3

5.4. Ablation Study and Sensitivity Analysis

5.4.1. Ablation Study

5.4.2. Freeze/Unfreeze Dynamics

5.4.3. Sensitivity Analysis

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI