1. Introduction
Battery-powered embedded systems are increasingly popular in edge applications, such as wearables, sensor nodes, drones and other IoT devices, etc. However, their small form factor limits battery capacity, compute resources, and heat dissipation. At the same time, the demand for multimodal sensing, improved privacy and lower latency push more processing to the edge [
1,
2,
3,
4], thereby increasing on-device compute demand and, in turn, energy consumption. Meanwhile, battery energy density has not improved as quickly as electronics, and energy harvesting is not yet sufficient to fully compensate consumption [
5,
6]. As a result, many devices still require frequent charging or battery replacement, which increases maintenance cost and environmental impact [
7,
8]. Furthermore, the rapid evolution of edge applications requires platforms with reconfigurability and field-update capability [
9], along with shorter development cycles. These growing and rapidly evolving requirements demand not only higher computational power and parallelism, but also system flexibility for reconfigurability, adaptability, and scalability, together with reduced time-to-market. Hence, the core challenge, in power-sensitive designs, is to deliver that computational performance and flexibility within tight energy budgets [
9,
10].
Application-specific integrated circuits (ASICs) and low-power microcontrollers (MCUs) have traditionally been used by developers to meet the energy constraints of power-sensitive devices. ASICs are developed and optimized for a specific applications, and therefore typically achieve the best performance and power efficiency. However, they lack flexibility of reconfigurability after deployment and usually come with higher initial development cost and longer development cycles [
11]. Similarly, low-power MCUs offer low energy consumption and the advantage of re-programmability to accommodate evolving requirements, but they generally lack a high level of parallelism or high computational throughput. This can constrain their performance in edge applications that increasingly incorporate complex AI, multimodal sensors, and a need to execute computation locally for privacy and security reasons. Conversely, general-purpose graphics-processing units (GPUs) and general-purpose processors (GPPs) provide high compute capability but their consumption often exceeds the power budget of small battery-powered devices. While MCUs and ASICs offer many advantages including power efficiency, their above mentioned limitations can make them a poor fit for such edge devices that require higher parallelism/computational power, flexibility of reconfigurability, adaptability, and quicker time to market, all at once.
FPGAs offer a middle ground by providing required computational power and reconfigurability with higher performance than typical MCUs and digital signal processors, better energy efficiency than GPUs and GPPs and flexibility than ASICs [
12]. However, they generally lag low-power MCUs and ASICs in energy efficiency, so optimization techniques are essential when FPGAs are chosen for power-sensitive application which needs their compute-capability and flexibility [
13]. Their power consumption comprises components like static leakage, dynamic switching, and start-up and configuration overheads [
14]. SRAM-based FPGAs store configuration in large arrays of six-transistor cells that must remain powered and is commonly reloaded at each boot, increasing standby and start-up consumption. By contrast, flash-based FPGAs embed configuration in non-volatile floating-gate cells and retain it when power is removed [
15]. This architecture enables near-instant-on behavior and reduce boot overheads and standby energy [
16].
Microchip’s flash-based families exploit the configuration retention advantage further to offer the Flash*Freeze (F*F) mode, where the FPGA fabric is power-down while configuration, register contents, and I/O states are retained [
17]. SmartFusion2 SoC FPGAs’ (Microchip Technology Inc., Chandler, AZ, USA) family leverage [
18] this capability to achieve ultra-low standby power [
11]. SmartFusion2 is a heterogeneous SoC that combines an ARM Cortex-M3-based microcontroller subsystem (MSS) with a flash-based FPGA fabric, along with vendor-supported low-power (LP) modes [
19]. The fabric can enter F*F to shut down fabric logic and exit from F*F through an already configured exit activity. In parallel, the Cortex-M3 can enter sleep mode, where the core clock is gated and execution resumes on an interrupt [
19,
20]. Furthermore, during F*F operation the MSS can also run from a selectable standby clock (1 MHz or 50 MHz) [
21], which can further reduce idle power at the cost of latency.
The above features provide orders-of-magnitude savings in reactive or periodic applications [
22]. Such low-duty-cycle applications, often dominated by idle energy, can benefit greatly from these LP features. These can reduce standby power from tens of milliwatts to as low as 1.92 mW, with LP mode entry and exit transitions on the order of 100 µs [
23]. Beyond LP modes, the heterogeneous SoC also enables multiple power-aware task partitioning options, ranging from MSS-only software to HW/SW co-design to fully hardware implementations in the fabric [
18]. However, the SW design using the Cortex-M3 processor is used as within-platform reference for other partitioning choices available on this Smartfusion2 SoC platform. It is not intended to represent other superior processor cores or MCU classes. The main clock can also be scaled to trade current against latency. However, together, these options creates a multi-dimensional design space that includes (i) HW/SW partitioning options, (ii) LP mode’s configuration choices, (iii) standby clock-frequency options, and (iv) the main clock frequency. These choices interact in non-trivial ways and involve energy-latency trade-offs.
In this work, we systematically explore this design space on a SmartFusion2 flash-based SoC FPGA using a sensor-driven heart-rate (HR) monitoring workload. We measured current consumption and latency across combinations of these configurations and analyzed how they affect energy consumption and latency, including their trade-offs. However, the HR-monitoring workload used in this study is lightweight and very-low-duty-cycle. In practical systems, heavier computation or different sensor characteristics (event rate, frame size, number of channels) can increase the effective duty cycle and may shift the energy-optimal operating point. To address this, we include a duty-cycle (activity-rate) scaling analysis. Based on these measurements and analyses, we derive practical guidelines for duty-cycle-aware selection of an energy-efficient operating point on a flash-based SoC FPGA under low-duty-cycle workloads.
The main contribution of this work, in context of energy-efficient implementation of low-duty-cycle workloads on a flash-based SmartFusion2 SoC FPGA, are listed below:
A measurement-based analysis of various LP mode configurations across different implementations (SW, Co-design, and HW) and their impact on active and idle energy consumption.
An experimental evaluation of the energy and latency trade-offs of the best LP mode configurations under main clock-frequency scaling.
An event-rate (duty-cycle) scaling analysis that highlights how repeated LP transitions can shift the energy-optimal operating point across LP modes and clock settings.
Practical guidelines for selecting an energy-efficient operating point as a function of best LP mode, event rate and main clock frequency.
The remainder of this paper is organized as follows.
Section 2 reviews the related work.
Section 3 describes equipment and methods including platforms, LP mode configurations, architectural approaches for task partitioning, the HR-monitoring algorithm and its task-partitioning, followed by the experimental setup.
Section 4 presents the results,
Section 5 discusses the results and provide guidelines, and
Section 6 concludes the paper.
2. Related Work
Energy-efficient execution on resource-constrained wearable and edge devices remained an active research topic. Prior work commonly targets (i)
energy-aware algorithm design [
24,
25,
26,
27,
28], (ii)
communication cost reduction [
29,
30,
31,
32], and (iii)
task scheduling [
33,
34,
35,
36] to maximize time in low-power states. However, measurement-based evaluations of vendor-provided FPGA low-power features are less common, particularly for flash-based FPGAs [
13]. Since this work is focused on exploring the low-power feature of flash-based FPGAs, we reviewed studies that explicitly exploited FPGA-supported low-power features, particularly in flash-based FPGAs, as part of their energy-reduction strategy.
Wulf et al. [
37] studied how to exploit the F*F capability of flash-based FPGAs in systems that run multiple periodic hardware tasks along with some aperiodic activity. Since F*F can only be applied to the entire FPGA fabric, they propose a cluster scheduling algorithm that schedules the tasks under real-time constraints to elongate idle windows, enabling the extension of F*F intervals with respect to baseline policy that enters F*F whenever the FPGA is idle. In the follow-up work [
38], they integrated the same scheduling concept into FreeRTOS. They provided an OS-level interface that hides device-specific details while still prolonging F*F phases for multi-task workloads.
Roukhaimi et al. [
39] implemented neural-network inference in hardware on low-power FPGAs (Lattice iCE40 and Microsemi IGLOO devices) using different architectural options and compared those with corresponding software execution on STM32 low-power MCUs. They exploited the F*F feature of IGLOO FPGAs to reduce power consumption. They also characterized the MCU-FPGA communication overhead and showed that, with sufficiently fast interfaces, FPGA offload can reduce inference latency and energy. Finally, they proposed the integration of an ultra-low-power FPGA with a low-power MCU and an evaluation under very-low-duty-cycle operation, which closely aligns with the motivation of our study. Similarly, the authors of [
40] implemented their design on the ultra-low-power Lattice iCE40 FPGA and reported a reduction in overall power. To fit the model within the device’s constraints, they used a binary convolutional neural network, which reduces both resource usage and power consumption.
Beyond flash-based FPGA-centric studies, several researchers reduced energy by exploiting other FPGA-specific features at the module level. One common direction is dynamic partial reconfiguration (DPR), not available in SmartFusion2 flash-based FPGAs, for loading or swapping hardware blocks only when needed. This approach has been used, for example, to swap feature extraction pipelines in biomedical detection systems [
41]. It has also been used to execute deep-learning operators in staged fashion, under real-time constraints [
42].
Another widely used approach is application-specific clock gating to reduce unnecessary switching activity to lower dynamic power. Prior work applied this idea in modular ECG pipelines to gating off the modules that were not required to be active [
43]. Similarly, this technique is also used to activate only the required number of instances, e.g., multipliers [
44]. Some studies also explored application-aware power modes to keep complex processing blocks in a lower-power state until required. For example, systems may switch between simple and complex processing paths based on context or event probability [
45].
Overall, prior work shows that FPGA energy consumption can be reduced by leveraging low-power mechanisms such as F*F, DPR and clock gating. However, most studies vary only one major dimension at a time such as scheduling policies that extend low-power intervals or a specific accelerator design on a given device. In contrast, there is limited measurement-driven guidance on how multiple interacting design choices—LP modes, HW/SW partitioning, and clock scaling—jointly affect system-level energy and latency on flash-based SoC FPGAs. Our work addresses this gap by experimentally characterizing these trade-offs and providing practical, duty-cycle-aware energy-efficient operating-point selection guidance for low-duty-cycle workloads on flash-based SoC FPGAs.
4. Evaluation and Results
Firstly, in this section, the correctness of signal acquisition from the sensor and HR-monitoring algorithm is presented. Then, the energy efficiency of different LP configurations and implementations is analyzed and reported. The analysis also includes the impact of frequency scaling vs. LP mode configurations, on both latency and energy efficiency. Together, these results provide a comparison of the different design choices in terms of energy consumption, latency and duty cycle.
4.1. Verification of Used HR-Monitoring Algorithm
Although the goal of this work is to explore power-optimization strategies and analyze their impact on energy and latency in low-duty-cycle applications rather than to prove algorithm robustness, we nevertheless performed two functional sanity checks to confirm that the case-study algorithm behaves as intended.
In the first check, the algorithm’s functional correctness was evaluated using the PhysioNet BIDMC PPG & Respiration dataset [
55]. Since we performed our power-measurement experiments in a lab environment while measuring HR in a resting state, the BIDMC dataset (collected in the resting state) was well-suited for functional-accuracy checks. Moreover, it provides PPG signals sampled at 125 Hz (close to our 100 Hz sampling rate) and includes reference HR values sampled once per second, enabling a direct comparison with our computed HR. From the dataset, each of the 53 PPG records (eight-minute each ≈ 60,001 samples each) was trimmed to 59,985 samples to fit our 31-sample frame, and we also re-tuned the beat acceptance window of our algorithm for the 125 Hz sampling rate. Since our HR algorithm requires 15–18 s to produce stable estimates due to its long averaging window, the first 15 s of results were excluded to ensure a fair comparison. The remaining HR estimates were then linearly interpolated to match the 1 s time grid of the reference HR, since our algorithm outputs one estimate every 0.744 s (93 samples), whereas the reference dataset provides values at 1 s intervals.
Using all 53 BIDMC recordings, the algorithm achieved an overall mean absolute error (MAE) = 2.72 BPM, root mean square error (RMSE) = 6.74 BPM, and bias = +1.49 BPM, with 87% of estimates within ±3 BPM of the reference, as detailed in
Table 3. These numbers were inflated by a few outliers recordings that showed persistent offsets or irregular detections. After identifying and excluding five such recordings, while leaving the remaining 48, records unchanged, the overall performance improved to MAE = 1.61 BPM, RMSE = 2.52 BPM, and bias = +0.78 BPM, with 92% and 97% of estimates within ±3 BPM and ±5 BPM, respectively. These results demonstrate the functional correctness of the HR-monitoring case-study algorithm.
In the second check, we tested the functional correctness of our hardware with the HR algorithm deployed on it. The sensor was configured to acquire the PPG signal at 100 Hz (800 sps with averaging over 8 samples). After connecting the setup, a finger was placed on the sensor, and values were logged via a MATLAB (R2024b for Academic use) serial terminal over the UART interface. The logged signal, shown in
Figure 6a, shows that our DAQ system preserves the key PPG features needed for heartbeat detection. The system’s HR values, with immediate validation using a medical pulse oximeter, are shown in
Figure 6b,c. Additionally, for the FPGA-based accelerators, verification is shown via a ModelSim simulation in
Figure 6d for heartbeat detection. Together, these results confirm the functional correctness of our sensing and processing pipeline on hardware as well.
4.2. Annotating Current Waveform with Different Operating Phases
Before delving into the analysis of our results, we first explain how the measured current profile relates to different operating phases in one period.
Figure 7 presents a current profile (edited for illustrative purpose) from SW-Impl where both F*F and sleep modes are applied during the idle phase with a standby clock of 1 MHz. The HW and Co-Impl also follow a similar current profile.
LP entry phase: This phase begins when the processor initiates an F*F request. The System-Controller first switches the MSS clock from the main clock (also called user clock) to the standby clock. It then powers down the fabric PLL/CCC and places the fabric into F*F. After F*F entry is completed, the processor enters sleep mode.
LP Exit phase: This phase starts upon the sensor’s interrupt. The processor wakes and initially runs on the standby clock while the System-Controller power-up clocking resources wait for the fabric PLL and MPLL to be locked. Once both are locked, the System-Controller switches the MSS clock from the standby clock back to the main clock. Until this switching is carried out, the processor keeps waiting for a lock status.
HR processing: Once switched, the processor starts executing the HR processing algorithm, which is a useful activity. These three phases are also collectively referred to as a total-active phase/period.
Idle/LP phase: The idle phase corresponds to the period where no activity is required to be performed, except for waiting for the next event. All evaluated LP configurations are applied during this phase. We also refer to the idle phase as the LP-phase when an LP configuration is active.
Therefore, the total power over one cycle can be categorized into
(active),
(idle),
(low-power entry), and
(low-power exit), while the total energy can be given as below.
4.3. Evaluation of LP Mode Configurations Across Implementations
The current consumption measured for the three implementations, operating at a main clock of 89.6 MHz under different LP mode configurations, is presented in
Figure 8. It compares the measured active- and idle-phase (LP-phase) current consumption of the three implementations under different LP configurations.
Idle-phase consumption: Across all implementations, the highest idle current is observed for the
_None configuration, (
Figure 8a), in which no LP mode is enabled. In this case, the SW-Impl exhibits the lowest idle current because no logic is instantiated in the FPGA fabric, whereas the Co-Impl and HW-Impl suffer additional consumption due to the presence of fabric modules. Placing only the processor to sleep mode during the idle phase (
_S config) reduces the idle current in all implementations (
Figure 8b). The SW-Impl continues to show the lowest idle current, as it does not have any fabric module. Relatively more reductions are achieved in
_FF1 config where the fabric is placed in F*F while the processor remains active and operates from the 1 MHz standby clock (
Figure 8d). Here, the idle current further decreases in all implementations as compared to the corresponding
_S config. This occurs due to the fabric logic, CCCs, and associated PLLs being powered down, leaving only the MSS active. However, when the fabric is placed in F*F and the processor operates from the 50 MHz standby clock (
_FF50 config), the idle current is higher than in the
_FF1 case (
Figure 8c). This is due to the fact that even when the processor is sleeping, the standby clock continues to drive always-on MSS logic (e.g., interrupt controller and any enabled peripherals), so a 50 MHz standby clock consumes more than 1 MHz. The lowest idle current is observed when F*F is combined with processor sleep (
_FF1_S and
_FF50_S). In these configurations, the fabric is in F*F while the processor is in sleep mode (
Figure 8e,f). As expected,
_FF1_S achieves a lower idle current than
_FF50_S due to the lower standby clock frequency.
It is also worth noting that, for a given LP configuration, all three implementations exhibit nearly identical idle current consumption except in the _None and _S configurations. Here, the fabric is not placed in F*F and therefore remains powered during the idle phase, leading to an implementation-dependent idle current.
Active-phase consumption: During the active phase, the current consumption trends differ from those observed in the idle phase and depend on task partitioning. For the
_None and
_S configurations, the SW-Impl consumes less of the active current than the Co-Impl and HW-Impl (
Figure 8a,b), since computation is performed entirely on the processor and no fabric logic is present. In contrast, the HW-Impl exhibits a significantly lower active current when the processor is also placed in sleep mode during the active phase while the fabric is busy due to processing. This behavior of HW-Impl is evident in its
HW 89.6_SS config (
Figure 8b) and becomes more prominent in
HW89.6_FF1_SS (
Figure 8f) and
HW89.6_FF50_SS (
Figure 8e) configurations, where the processor remains in sleep mode not only during idle periods but also for most of the active phase. As a result, in these HW-only configurations, the active-phase current of the HW implementation is lower than that of the corresponding SW and Co-Impl cases.
Furthermore, the entry and exit latencies with a 50 MHz standby clock are significantly shorter than a 1 MHz standby clock. This can be beneficial in very-high-duty-cycle applications where the idle intervals are very short. A 50 MHz standby clock is also advantageous in scenarios where only the fabric has idle periods, while the software on the processor must remain on and execute high-speed tasks that cannot be completed in time if the MSS is limited to 1 MHz before the next active period begins.
Figure 9 presents the energy consumption over a 20 s interval (corresponding to one stable HR estimate) for the three implementations under the different LP configurations. While the expected reduction in energy across LP configurations is noticeable, the energy difference between implementations for a given configuration is negligible, except for the
_None and
_S cases. This behavior is expected because of the very-low duty cycle (approximately 2%); hence, the total energy is dominated by the idle interval. The idle current, in those configurations, is nearly identical across implementations. Consequently, reductions achieved during the short active phase have a limited impact on the 20 s energy consumption. As the duty cycle increases, the active phase contribution grows and implementation-level differences become more noticeable. This trend is analyzed further in
Section 4.5. Nevertheless, the best-performing LP configuration against SW and Co-Impl is
_F1_S while for HW-Impl it is
_F1_SS at standby clock of 1 MHz. Similarly,
_F50_S for SW and Co-Impl while
_F50_SS for HW-Impl at 50 MHz of standby clock. In the upcoming sections, we only used these best-performing configurations for further analysis. Finally, the resource consumption for each implementation is presented in
Table 4.
4.4. Energy and Latency Trade-Offs Under Frequency Scaling
Since the LP configurations tested in the previous section focused on reducing the idle current, the lowest idle consumption was achieved when the FPGA fabric was placed in F*F and the processor was placed in sleep mode. In these configurations, the idle current is nearly identical across implementations because both the fabric and processor core clock are gated off during the idle phase. In such configurations, the remaining idle consumption is only due to a small set of always-on MSS blocks that operate from the standby clock and that are largely independent of the main clock. In contrast, during the active interval when the processor and/or fabric logic is running, the energy consumption depends on dynamic switching activity and therefore varies with the chosen main clock.
This section evaluates the impact of main clock-frequency downscaling on energy consumption and on the associated latencies. For each implementation, we selected the best-performing LP configurations identified in the previous section (one for each standby clock option). We also included _None as a baseline reference for comparison across frequencies.
Since dynamic power scales approximately as
, downscaling the main clock reduces the active phase current consumption, as observed for the
SW_Impl in
Figure 10a. However, lowering the clock frequency increases the clock period and therefore extends the execution time of the software workload, leading to a longer active interval.
Figure 10b compares the energy per period for
_None,
_F1,
_F50,
_F1_S,
_F1_SS,
_F50_S and
_F50_SS for three implementations for different main clocks. In the
_None config, the energy decreases as the operating frequency is reduced. This happens because the processor and fabric modules remain clocked during the long idle interval. Reducing the clock therefore lowers dynamic consumption in the idle phase, and the effect becomes visible when looking at total energy consumption. In contrast,
_F1_S,
_F1_SS,
_F50_S and
_F50_SS show negligible energy reduction with frequency downscaling. In these configurations, the fabric is placed in F*F and the processor is in sleep during the idle phase. As a result, the idle energy is largely independent of the main clock. Although frequency downscaling reduces the active phase energy, the active interval contributes little to the total energy due to the very-low duty cycle. The overall energy therefore changes only slightly because it is dominated by the long idle period, whose consumption remains nearly constant across the tested frequencies.
The trends of active time can be explained by the dominant communication interface latencies. The sample acquisition is constrained by data rate (400 kHz) of I
2C bus, and the result reporting is constrained by the baud rate (115,200) of the UART. Hence, the frequency downscaling mainly affects the HR DSP-chain computation time. From
Figure 11a, it can be observed that this change is noticeable in the SW-Impl because the DSP-chain is implemented sequentially on the processor. However, in the Co-Impl, the DSP-chain runs in the fabric and processes each sample in nine clock cycles. As a result, the increase in computation time due to a lower clock frequency is small over one period. On the other hand, in the HW-Impl, the DSP-chain is fully pipelined with the I
2C acquisition, so its operation is completed within the acquisition latency. The I
2C module also performs an additional read of the sensor interrupt register to clear the interrupt. Consequently, the DSP-chain finishes before the I
2C transaction ends; thereby, the frequency scaling effect on latency is not visible over the active period. Thus, there is a trade-off between the power consumption and active duty cycle. If the duty cycle is relaxed enough, a lower frequency can be used to save further power.
From
Figure 10a, it can be seen that the LP-mode entry and exit latencies also change with the operating frequency of the processor. Entering and exiting the F*F also introduces some latencies. During F*F entry, the processor’s firmware first performs the required register configuration and then requests the System-Controller to place the fabric into F*F. The System-Controller switches the MSS clock from the main clock to the standby clock, powers down PLLs/CCCs, and notifies the firmware when the F*F entry sequence is complete. After receiving the “F*F done” notification, the firmware’s F*F routine returns and the processor executes sleep instructions. In this work, the
LP entry interval is defined as the time from calling the firmware F*F routine to the execution of the sleep instructions. On an exit event (sensor interrupt), the System-Controller initiates the wake-up sequence by powering up the PLLs/CCCs and notifying the firmware via an interrupt. It then waits for the fabric PLL and MPLL to acquire a lock. Afterwards, it switches the MSS clock back to the main clock from the standby clock. Upon receiving the exit notification, the firmware exit routine concludes and returns. In this work, the
LP exit interval is defined as the time from the external interrupt assertion to the return from the firmware F*F routine.
Since both LP entry and exit sequences involve some steps executed on the main clock and some on the standby clock, we evaluated these intervals across different main clocks for both standby clock options.
Figure 11b shows that the F*F exit time increases slightly as the main clock decreases. The exit time is also consistently higher when using the 1 MHz standby clock than the 50 MHz standby clock. In contrast, the F*F entry time shows the opposite trend.
Figure 11c indicates that the entry interval is larger at higher main clocks and smaller at lower clocks. This dependence is more pronounced with the 1 MHz standby clock and is negligible with the 50 MHz standby clock. The cause of the entry-time trend is not fully clear. However, the trend is consistent across repeated runs. One possible reason is that the System-Controller might need more time for a clean shutdown when the fabric is running at a higher frequency. However, we treated it as an empirical observation.
In LP-optimized configurations, frequency downscaling mainly reduces active-phase energy. Its impact on total period energy is therefore small due to the very-low duty cycle. In the next section, we evaluate how this behavior changes if the duty cycle or event rate increase and dominate the total period.
4.5. Effect of Event-Rate/Duty-Cycle Scaling on Energy Consumption
The HR-monitoring workload used in this work is lightweight and very-low-duty-cycle. In reality, the edge workloads could be more compute-intensive, and sensor characteristics may vary. These factors can increase the active time and thereby duty cycle, which can shift the energy-optimal operating point to another set of an LP configuration. One of the effects of a higher computational demand could be an increase in per-event processing (active) time, thereby increasing the duty cycle for a given event period. Likewise, the duty cycle could also increase due to having a different sensor type, having a different event rate (interrupt or sampling rate) or data volume per event (sample width, frame size, or number of channels), or a different interface transfer rate. For example, for a fixed interface rate, larger frames or more channels increase acquisition time and therefore increase the duty cycle. Similarly, a higher event rate lowers the event-period, thus reducing the available idle time and thereby increasing the duty cycle. This section analyzes the impact of a duty-cycle increase on energy consumption and operating-point selection using a scaling model.
In general, an increased duty cycle can be interpreted in two equivalent ways: (i) per-event active time increases for given event-period, or (ii) the event-rate increases which shortens the event-period. We adopt the second approach because it provides a simple way for us to study higher duty-cycle operations without changing the algorithm or sensor types. Instead of modifying the system under study and re-measuring at higher duty cycles, we extrapolated from the measured single-event case from the previous section. We assume that k identical events occur within the same 310 ms window, and that each event triggers a full cycle of activity followed by the F*F entry and exit phases. Under these assumptions, the measured currents and latencies from the previous section (k = 1 case) remain representative. Moreover, the total time spent in active and LP-entry/exit phases scales with k and thereby reflects an increased duty cycle.
To determine the maximum feasible event rate
k, we evaluated the best LP configuration from each implementation under both standby-clock options and across main clock-frequency scaling, as shown in
Figure 12. We reserved at least 10% of the period for the idle phase. In our HR case study, each implementation has a different active time per event. Therefore, the same 310 ms window allows for different maximum
k values. The baseline (
k = 1) active time follows the trend
>
>
. Therefore, SW-Impl supports a lower event rate than Co and HW-impl for the same clocking and LP configuration. The main clock also affects higher
k feasibility. At lower main clocks, the active time per event is generally longer. This reduces the maximum feasible
k. This can be observed for SW and Co-Impl from
Figure 12a,b, where the maximum event rate is lower at 24 MHz than at 89.6 MHz. However, the HW implementation shows a different scaling trend across main clocks, i.e., a lower
k at 89.6 MHz than at 24 MHz. This follows the same I/O-bound versus compute-bound behavior discussed in the previous section. In HW-Impl, the HR DSP-chain is pipelined with a sample reading process, so its latency is largely hidden under the I
2C acquisition time. Therefore, the event interval is almost fixed and set by the fixed rates of the I
2C and UART modules. Since the event interval in HW-Impl is almost independent of the main clock, the LP entry latency becomes more influential. As the measured LP entry time is higher at higher main clocks (
Figure 11c), it can result in an increased total active time at higher frequencies and thereby reduce the maximum feasible
k.
Since the F*F entry time limits the maximum feasible event rate, we also evaluated the 50 MHz standby clock option. With a 50 MHz standby clock, the firmware overhead during the entry routine is shorter, so the F*F entry latency is reduced. As a result,
Figure 12c,d shows that a higher
k becomes feasible due to a 50 MHz standby clock. It is also observed from
Figure 12 that the difference between the maximum feasible
k between the three implementations is larger at 24 MHz than at 89.6 MHz. At lower main clocks, the HR DSP-chain computation time is more visible in SW and Co-Impl and therefore differentiates the implementations more strongly. At higher main clocks, the DSP-chain latencies shrink and the fixed-rate I
2C and UART latencies dominate. This reduces the difference in the total active time and makes the feasible
k values be closer to each other. It can be inferred that when the application-processing interval is short and comparable to the F*F entry time, a 1 MHz standby clock can noticeably extend the total active time compared to a 50 MHz standby clock.
Next, we evaluated the energy consumption of the best LP configurations over event-rate scaling, against both standby clocks (50 Mhz and 1 MHz) and using two different main clocks (89.6 Mhz and 24 MHz), as shown in
Figure 13. The observed energy consumption difference between a 50 MHz and 1 MHz standby clock for the same configurations (e.g.,
HW24_F1_SS and HW24_F50_SS) arises from the standby clock’s impact on transition overhead and the idle current. During the active phase, both configurations run from the main clock, so the active phase clocks are the same. The difference appears during F*F entry, exit and during idle phases, where the standby clock is used. A 1 MHz standby clock increases entry and exit latency, while a 50 MHz standby clock increases the idle current.
It can also be noted that at low event rates, the idle period dominates. Therefore, the standby clock choice has a strong impact on average power. In this case, the energy difference between the 1 MHz and 50 MHz standby configurations is large. As the event rate increases, the active and F*F transition intervals occupy a larger fraction of the 310 ms window and the idle time shrinks. As a result, the energy difference between the two configurations is reduced, as shown in
Figure 13. At very high event rates, the idle time becomes comparable to the F*F entry/exit overhead. In this case, the energy benefit of a 1 MHz standby clock during an idle period becomes small. The 50 MHz standby clock can then be preferable because it reduces entry/exit latency and allows for a higher feasible event rate, with only a small penalty in average power.
Finally,
Figure 13 also provides an insight into the best operating points as the event rate increases. For
,
HW 24_F50_SS yields lower energy than
SW 89.6_F1_S and
Co 89.6_F1_S. For
,
SW 24_F50_S,
Co 24_F50_S, and
HW 89.6_F50_SS also outperform
SW 89.6_F1_S and
Co 89.6_F1_S. For
,
HW 24_F50_SS further outperforms
SW 24_F1_S,
Co 24_F1_S, and
HW 89.6_F1_SS. These results show that the optimal choice of implementation, standby clock, and the main clock depends on the event rate, and that selecting the appropriate operating point can reduce energy consumption.
5. Discussion and Guidelines
From
Section 4.3, it is observed that, for low-duty-cycle workloads, the idle-phase energy consumption dominates the active-phase consumption. Therefore, the best strategy is to focus on applying the fabric F*F and processor sleep, as shown in
Figure 8f. However, in practice, the edge application can have heavier computations, or different sensor characteristics (e.g., higher frame size), increasing the effective duty cycle. The duty-cycle scaling analysis in
Section 4.4 shows that the active phase starts dominating with an increasing duty cycle; consequently, more effective optimization requires frequency scaling along with F*F and the sleep mode, as shown in
Figure 13. In case the event rate increases due to higher sample rates or multimodal/multichannel sensors, the number of LP entry and exit transitions will also increase, making LP transition energy and latencies noticeable. In such cases, a configuration with a standby clock of 50 MHz can help reduce LP entry and exit latencies as well as energy, as indicated by
Figure 11 and
Figure 13.
In LP configurations where the fabric is placed in F*F and the processor is placed in sleep, idle consumption becomes nearly constant and largely independent of the main clock, as shown in
Figure 8. As a result, frequency downscaling reduces the active-phase current, but its impact on period-level energy is small at a low duty cycle as observed from
Figure 10a.
It is also observed from
Section 4.3 that active-phase consumption depends on task partitioning choices. The SW-implementation executes sample acquisition, DSP-chain, and result reporting in software, so its active time is the largest, as seen from
Figure 11a. The Co-implementation reduces the active time by moving the DSP-chain to the fabric for speedup, but the processor still performs sample acquisition and result reporting, which are transfer-rate-bound. The HW-implementation offloads sample acquisition, the DSP-chain and result reporting to the fabric. This allows the processor to remain in sleep mode during most of the active interval, which reduces the active current, as seen from
Figure 8. Since this platform has the Cortex-M3 processor, it is modest compared to other advanced cores or independent MCUs. The active period latency and consumption on other platforms, having different processor cores, might appear differently.
The fixed-rate communication interfaces can also limit active-time scaling with frequency in low-duty-cycle applications. The I
2C interface operates at 400 kHz, and UART transmission occurs at a fixed baud rate of 115,200. These I/O phases set a lower bound on active time. Therefore, frequency downscaling does not increase the end-to-end active interval in relation to the clock period. From
Figure 11a, it can be observed that this effect is more visible in HW-implementations where the DSP-chain is pipelined with I/O interfaces. The main effect of frequency downscaling is a reduction in the active current (see
Figure 10a, while the time impact depends on how much computation is serialized with serial interfaces.
The standby clock affects both transition overhead (LP entry and exit latency) and idle current consumption. A 1 MHz standby clock reduces idle current under the LP mode. However, it increases LP entry and exit latency, as shown in
Figure 11b,c. A 50 MHz standby clock reduces entry/exit latency and supports higher event rates, but it increases the idle current under the LP mode, as seen from
Figure 8e,f. This occurs because some always-on MSS parts remain clocked at this higher standby clock during processor sleep mode [
20].
The energy-efficient operating point can shift with a higher event rate/duty-cycle. It can be observed from
Section 4.5 that when the event rate increases (larger
k),
shrinks and
becomes more significant. Under these conditions, the minimum-energy operating point can shift to the configurations that were not optimal at a low duty cycle. In particular, configurations using the 50 MHz standby clock can become preferable when entry/exit overhead limits feasibility or when wake-up latency is important, as seen from
Figure 13.
Although we measured the current consumption of the whole FPGA development board (thus including DC–DC conversion losses from 5 V to 1.2 V and 3.3 V, as well as other PCB components and status LEDs), the results still show meaningful energy savings. From
Figure 8f, we can notice that when the fabric is put into F*F and the processor into sleep mode, the board current in the idle phase reduces from ≈55 mA to ≈25 mA. The remaining ≈20 mA is due to always-on blocks of the SoC, static leakage, and other PCB components. This indicates that, even with fix baseline consumption, applying these low-power modes removes a large portion of the avoidable idle overhead.
Furthermore, since the heart-rate workload is lightweight, the idle energy wastage during the idle period is lower due to reduced resource utilization in the FPGA fabric. However, with an increasing computational demand, the resource utilization in the fabric increases, thus increasing not only the active consumption but also the idle consumption due to clocking and leakage of the additional logic resources. In such scenarios, the difference becomes even more pronounced when low-power modes are applied during the idle phase, resulting in further savings in idle energy consumption.
Moreover, from
Figure 13, it can also be noticed that as the event rate/duty cycle increases, the difference in energy consumption among different task-partitioning variants also increases. For example, the energy consumption difference between
SW_89.6_F1_S and
HW_24_F1_SS at an event rate of
increases by 4× (from 1 mJ to 4 mJ). This awareness can therefore help lower a significant amount of energy consumption at higher duty cycles or event rates.
These savings justify the added complexity of LP modes’ application; however, the savings themselves are highly dependent on computational intensity and duty cycle.
The above findings are based on a low-duty-cycle workload running on the SmartFusion2 SoC platform and can be generalized to similar platforms for low-duty-cycle workloads. Although the idea of duty-cycle-aware application of low-power modes can be applied to SRAM-based FPGAs, a direct comparison is not straightforward. In SRAM-based FPGAs, power gating typically clears the configuration memory. Therefore, recovering from a power-gated state requires reconfiguration, which adds extra latency and energy overheads during the low-power exit phase. As a result, (i) the effective duty cycle becomes more sensitive to the LP-exit phase due to a higher reconfiguration latency, and (ii) the net energy saving from power gating becomes more sensitive to the LP-exit phase due to reconfiguration energy, as part of the “saved” idle energy is wasted by reconfiguration energy. Furthermore, the reconfiguration overhead also depends on bitstream size. Nonetheless, duty-cycle-aware application of low-power modes can still be beneficial on SRAM-based FPGAs in applications where (i) the duty cycle is low enough to tolerate the reconfiguration latency, and (ii) the reconfiguration energy overhead is sufficiently lower than the savings achieved through power gating.
On the other hand, the heart-rate monitoring workload is a very low-duty-cycle application. In practice, an edge application can have heavier computations or different sensor characteristics (sampling/interrupt rate, sample width, frame size, interface transfer rate), increasing the effective duty cycle. We compensate for this with duty-cycle scaling analysis to provide an insight into higher-duty-cycle applications. Furthermore, a higher computational demand can also increase per-event active current by activating more resources or increasing switching activity. In such cases, the event-rate scaling model does not directly predict new energy values; however, the qualitative guidance from the duty-cycle scaling analysis still applies. Although the active-phase energy will be higher for compute-intensive algorithms, the same trend holds: when the active phase becomes a larger fraction of the period, the energy-optimal operating point can shift to a different configuration.
Guidelines
Finally, we propose the following guidelines, which can be applied as a decision process. The reader can start from the application duty cycle and latency requirements, and then select the most suitable operating point.
Minimize idle energy first for low-duty-cycle workloads: If the application spends most of its time being idle, prioritize configurations that place the fabric in F*F and the processor in sleep mode (e.g., _F1_S or _F50_S), as demonstrated in
Figure 8e,f. In this regime, changes in the main clock have a small effect on period-level energy, as seen in
Figure 10b, because the idle phase dominates.
Choose the standby clock based on wake-up latency and idle duration: If long idle intervals are expected and wake-up latency is not strict, use a 1 MHz standby clock to minimize idle current, as demonstrated in
Figure 10b. If wake-up latency is critical or idle intervals are short due to a higher event rate or duty-cycle, use a 50 MHz standby clock to reduce LP entry and exit time, as observed from
Figure 11b,c. The event-rate scaling analysis from
Figure 13 shows that the energy penalty of switching from a 1 MHz standby clock to a 50 MHz standby clock becomes small at high event rates because the idle time is reduced.
Use frequency downscaling when active energy matters: Main-clock downscaling reduces the active current, as observed from
Figure 10a. It provides the most benefit when active time is a significant fraction of the period, as demonstrated in
Figure 13. This occurs at higher event rates, or when computation dominates the active interval. For very-low duty cycles, frequency downscaling may not noticeably reduce energy per period even if the active current decreases.
Prefer Co or HW-Impl when compute-bound tasks dominates: If the workload includes significant computation, moving the DSP-chain to the fabric reduces the active time and can reduce energy. From
Figure 11a, it can be observed that Co-Impl provides a middle ground in terms of active time reduction while HW-Impl offers the lowest active time.
If the interface dominates active period, optimize the LP and idle phases: When the workload is bounded by fixed-rate I/O interfaces, e.g., I
2C or UART transfers, changing the main clock has a limited impact on end-to-end active time, as explained in
Section 4.4. In this case, focus on reducing the LP transition overhead and idle current. Use a 50 MHz standby clock if frequent transitions are required, and a 1 MHz standby clock if the idle period dominates, as demonstrated in
Figure 13.
Check feasibility at high event rates: For increased event-rate scenarios, verify that the schedule fits within the period budget, as observed from
Figure 12. If the configuration becomes infeasible to fit within the period, switch to an operating point with a (i) lower transition overhead, e.g., a 50 MHz standby clock (
Figure 11c), (ii) a faster main clock, or (iii) an implementation with more acceleration from FPGA fabric (
Figure 11a).
6. Conclusions
Currently, energy-constrained edge systems are required to offer both reconfigurability and higher computational capability to accommodate rapidly evolving and growing workloads. Flash-based SoC FPGAs offer an attractive middle ground by combining reconfigurable acceleration with LP modes. However, the most energy-efficient operating point depends on how these LP modes interact with HW/SW task partitioning and clock scaling. In this work, we comparatively evaluated the effect of these interactions on energy consumption and latency using a sensor-driven heart-rate monitoring application on Smartfusion2 SoC FPGAs. We evaluated the impact of F*F, processor sleep and standby-clock selection on SW, Co, and HW implementations in terms of energy consumption and latencies.
For the measured low-duty-cycle workload, total energy is dominated by idle consumption, and the lowest energy is achieved by combining fabric F*F with processor sleep.
Task partitioning mainly effects the active phase by changing active time; however, its effect is negligible on period-level energy under very-low duty cycles. Similarly, although main-clock downscaling reduces the active current, it still yields a limited reduction in total energy per period for low-duty-cycle workloads. Moreover, the variation in active-phase time under frequency scaling is also limited when fixed-rate I/O dominates the active period. The standby clock presents a clear trade-off: 1 MHz minimizes the idle current, while 50 MHz reduces entry/exit latency and enables higher feasible event rates. Using a scaling model, we also observed that the energy-optimal operating point can shift as the event rate increases, and operating-point selection should therefore consider duty-cycle and latency constraints. The presented results provide practical guidelines for choosing an energy-efficient configuration across implementation styles, standby clocks, and main clocks on similar flash-based SoC FPGAs. As future work, we plan to validate these findings on other FPGA types with computationally intensive workloads.