1. Introduction
The implementation of algorithms in digital devices, such as microcontrollers, Field-Programmable Gate Arrays (FPGAs), and Systems-on-Chip (SoCs), is fundamental to the development of control systems, signal processing solutions, and intelligent applications. These systems must simultaneously meet requirements for high performance, low energy consumption, and short development cycles, all under strict resource constraints. A widely adopted strategy to balance these demands is hardware/software (HW/SW) partitioning, which involves deciding which parts of an algorithm are implemented in reconfigurable hardware and which are executed on an embedded processor, according to specific design metrics and constraints.
However, selecting an optimal partition is challenging, as it requires balancing multiple objectives such as execution time, hardware area, and power consumption. Although most current methodologies consider these parameters as primary objectives, memory usage, a critical resource in embedded systems, is rarely integrated explicitly into the decision-making process. When it is included, its estimation is often approached or carried out only at late stages of development, limiting its value as an early design decision parameter.
In the state of the art, the HW/SW partitioning problem has been addressed using several approaches. One line of work relies on algorithm profiling [
1], where blocks with the highest computational load are migrated to hardware. Other studies employ exact methods such as Integer Linear Programming [
2] and Branch and Bound [
3]. More recently, the dominant trend has been the use of heuristic algorithms, particularly evolutionary approaches [
4]. For instance, ref. [
5] presents two hybrid algorithms: the first combines Lagrangian Relaxation (LR) with the Subgradient method, while the second integrates LR with the 0–1 Knapsack problem and a Genetic Algorithm. In [
6], a game-theory-based approach combining the GO game and Minmax algorithm is introduced. Other heuristic techniques include an immune algorithm-based partitioning method [
7], as well as multi-objective extensions such as the fireworks-based algorithm proposed in [
8]. Although these methods have demonstrated improvements in the hardware area and execution time, they remain primarily theoretical and do not address memory as a design metric.
In recent years, several works have highlighted the increasing relevance of memory behavior and data-movement constraints in modern embedded and reconfigurable architectures. For instance, ref. [
1] propose a HW/SW partitioning strategy for real-time object detection on SoCs that explicitly models memory bandwidth limitations to improve performance. More recent studies extend this perspective: ref. [
9] introduces MEDEA, a design-time multi-objective manager that incorporates “memory-aware” mechanisms, such as tiling, DVFS, and task scheduling, to optimize heterogeneous systems. Similarly, ref. [
10] present a holistic optimization framework for FPGA accelerators that jointly considers partitioning, scheduling, and data-movement costs, demonstrating that memory constraints increasingly drive architectural decisions. Other works focus on memory-specific optimizations in hardware design flows, such as the pattern-morphing–based memory partitioning technique proposed in [
11] for reducing access conflicts in HLS-generated architectures. Recent co-design surveys, for example, ref. [
12], emphasize that modern AI-oriented embedded systems critically depend on efficient memory utilization throughout the HW/SW co-design process.
However, none of these recent methodologies provide a detailed, module-level extraction of memory usage in C-based implementations, nor do they integrate this information into a multi-objective HW/SW partitioning flow. This gap highlights the need for methodologies that incorporate memory as a first-class design metric from the early stages of system development.
In this work, an HW/SW partitioning methodology that explicitly incorporates a memory usage metric is proposed, together with hardware area, within a multi-objective optimization framework. The novel contributions of this work are as follows:
An objective function adapted to SoCs with a hard-core processor, accounting for resources associated with HW/SW communication.
A procedure to extract memory metrics from detailed analysis of memory mapping in C-based software implementations.
An evaluation flow that considers aspects such as auxiliary data conversion functions and HW/SW synchronization, as well as correction factors to avoid overestimation due to shared libraries.
The proposed methodology is validated on a PD-type fuzzy controller for a DC motor implemented on a Xilinx Zynq® SoC (San Jose, CA, USA). This controller architecture, including its FPGA implementation, was introduced in [
13]. A PD-type fuzzy controller is selected because it provides a good trade-off between robustness and implementation cost: the fuzzy rule base improves the handling of nonlinearities and uncertainties while using only the error and its derivative, as reported in recent motion-control applications such as bridge cranes, lane-keeping systems, and cable-driven robots [
14,
15,
16]. At the same time, the controller remains simpler than more elaborate nonlinear schemes, and its limited rule base and Mamdani-type inference mechanism lead to moderate memory and hardware requirements, in line with recent sparse fuzzy PID implementations [
17]. This makes the PD-type fuzzy controller a convenient benchmark to study memory- and area-aware HW/SW partitioning on SoC platforms under the resource bounds adopted in this work (see Remark 1). The optimization problem is solved using the Non-dominated Sorting Genetic Algorithm II (NSGA-II). Thus, this methodology aims to guide design decisions for digital devices with limited memory resources.
Novelty and Contributions
While the introduction presents a broad discussion of related work, this subsection summarizes the specific contributions of the proposed methodology and clarifies how it differs from representative approaches in the literature.
Table 1 highlights key distinctions in partitioning strategies, memory-awareness, analysis granularity, and reported contributions.
This paper is organized as follows.
Section 2 reviews the objective functions found in the literature that explicitly consider memory.
Section 3 introduces the proposed methodology for extracting and integrating the memory metric into the partitioning process.
Section 4 applies this methodology to a case study, performing the HW/SW partitioning of a fuzzy control algorithm with hardware resource consumption and memory usage as the main metrics.
Section 5 reports the experimental results of the selected HW/SW configuration. Finally,
Section 6 presents the conclusions.
2. Previous Work
The objective functions for memory usage reported in the literature are not very diverse; however, they play a crucial role in the optimization process, since the quality of the objective function has a direct impact on the quality of the obtained solutions. One representative example is the mono-objective function proposed in [
18], where a single objective function combines multiple design metrics, as shown below:
where
and
denote the execution time and the memory requirement, respectively, when all modules are implemented in software.
is the execution time when all modules are implemented in hardware, while
and
represent the execution time of the solution when implemented in software and hardware, respectively. The last two are defined as follows:
where
denotes the vector of binary decision variables, and
n specifies its dimension, i.e., the total number of modules in the system.
and
represent the execution times of the
i-th module in software and hardware, respectively. Each binary variable
indicates whether the
i-th module is mapped to software (
) or to hardware (
). Finally,
M denotes the memory requirements of the components assigned to the software architecture. The total memory consumption is obtained as:
where
represents the software cost for the
i-th module.
On the other hand, the work [
19] presents a multi-objective approach, where each design metric is modeled by an independent objective function. In addition, parallelism is considered through the introduction of a binary variable
, which indicates whether a module can be executed in parallel (i.e., it has no dependency on other modules). The objective functions to be minimized are:
where
represents the hardware area,
the hardware multipliers,
the memory blocks, and
the execution time. The vector
again denotes the decision variables, and
specifies whether module
i can be executed in parallel (i.e., has no dependencies).
denotes the hardware resources consumed by module
i (LUTs or FFs),
represents the memory blocks used by module
i when implemented in software,
and
are the execution times of module
i in hardware and software, respectively, and
corresponds to the number of hardware multipliers (DSP units) used by module
i.
The above objective functions are constrained by the following expressions:
where
S,
, and
H are, respectively, the area, memory size, and number of hardware multipliers available for the design.
corresponds to the maximum allowable execution time.
It is worth noting that the work in [
19] considers a soft-core processor. Therefore, the variable
represents the resources consumed by the processor, the bus, and its peripherals, while
accounts for the DSP units used to implement the processor on the FPGA. Considering all these aspects, this multi-objective formulation is the most suitable approach when the goal is to perform a Pareto-based optimization. Thus, the trade-offs among metrics can be analyzed more effectively, and the constraints can be applied directly to the Pareto front, facilitating the identification of feasible configurations that satisfy the system requirements.
3. Proposed Methodology for Objective Functions of Hardware Area and Memory
The objective functions commonly reported in the literature are simplified and do not fully reflect real implementation behavior. To address this limitation, a methodology is proposed to construct practical objective functions for hardware area and memory, based on extracting accurate resource usage from system modules and incorporating correction and synchronization factors. These functions can then be directly integrated into HW/SW partitioning optimization processes.
3.1. Memory Usage
Memory usage in software implementations consists of two components: the intrinsic memory required by each functional module and the fixed memory inherent to the processor architecture. The proposed methodology provides a systematic procedure to estimate the intrinsic memory consumption of each module and include it explicitly as a metric in the HW/SW partitioning process.
3.1.1. Generalization of Memory Consumption Extraction
To estimate the memory consumption of each module, the following steps are performed:
Determine the minimum system memory
A minimal C project containing only the processor initialization logic is built to determine the lower bound of required system memory, referred to as the minimum memory.
Implement each module individually
Each module is implemented and built independently within the chosen development platform (e.g., Vitis™). This provides the memory usage report and the corresponding memory mapping information.
Compute intrinsic memory consumption
The intrinsic memory of each module is obtained by subtracting the minimum memory from the memory reported for the module, isolating the memory attributable to its functionality.
3.1.2. General Objective Function of Memory Usage
Using the extracted data, the proposed general memory objective function is defined as:
where
represents the sum of the intrinsic memory of all modules implemented in software, and
n is the total number of modules. The term
is the minimum system memory, and
represents the additional memory required by synchronization libraries when hardware and software coexist.
The correction factor subtracts the memory consumption associated with libraries that are shared across different modules in order to avoid counting them multiple times. Since the functions were implemented separately to obtain their individual metrics, directly summing the reported memory usage would lead to an overestimation whenever common libraries are included in more than one module. In practice, however, each shared library needs to be loaded only once, regardless of how many functions use it. This factor can be estimated as follows:
Analyze the source code files (.c) of each module to identify libraries that are repeatedly included.
Locate these libraries in the memory mapping files (.mem or equivalent) to determine their memory usage.
Subtract the duplicated consumption from the total estimation, thereby avoiding an overestimation of the actual memory usage in the final configuration.
The operator
proposed activates the synchronization term only when both hardware and software modules are present:
3.2. Hardware Area Usage
The hardware area determines the feasibility of an implementation on a given device. In FPGAs, this cost is expressed in logic resources such as LUTs, FFs, and hardware multipliers (DSP blocks). The proposed objective functions extend baseline formulations by adding communication overhead and auxiliary hardware when HW and SW coexist.
For LUTs and FFs, the generalized objective function proposed is:
where
is the baseline expression,
accounts for communication-related hardware, and
represents auxiliary hardware resources.
For hardware multipliers, the proposed formulation is:
where
is the baseline usage,
captures communication overhead, and
includes multipliers required outside the main modules.
3.3. Constraints
Each objective function is subject to the following resource constraints:
where
S,
, and
H denote the available LUT/FF area, hardware multipliers, and memory capacity, respectively.
3.4. Limitations of the Methodology
Regarding the proposed methodology for obtaining and using memory utilization in HW/SW partitioning, it is important to note that, due to its higher complexity compared to the hardware usage counterpart, the memory-oriented flow presents additional limitations.
First, based on the proposed Equation (
13), the main limitation arises from the term
, since it requires identifying and analyzing, on a case-by-case basis, all modules that share common memory libraries. Consequently, when the number of modules is large, this process may become time-consuming during the design stage. A possible solution is the development of an automated script capable of parsing and analyzing memory-mapping files to detect shared libraries, which we consider as future work. In the present manuscript, the focus was placed on exploring and validating the methodology rather than fully automating this analysis.
Second, concerning the applicability of the methodology to other SoC vendors or device families, the main requirement is the availability of a memory-mapping file. Even if the file format differs across toolchains (e.g., Xilinx versus Intel/Altera), the methodology remains valid as long as the necessary address and memory-allocation information can be extracted.
Third, different hardware memory configurations may alter the interpretation of the metric. For example, FPGAs may use distributed RAM, multiple independent BRAM/URAM banks, or local scratchpad memories. Systems with caches, DMA buffers, or FIFO-based communication introduce further variability, since the effective number of memory accesses may differ from the nominal access count. While the methodology can be extended to these cases, such extensions were beyond the scope of this manuscript and represent an opportunity for future extensions.
Finally, the limitations of the hardware objective functions, Equations (
15) and (
16), are less restrictive. In general, FPGA toolchains from major vendors (such as Xilinx or Intel/Altera) provide detailed reports on resource utilization after synthesis and implementation. These reports can be used directly within the proposed HW/SW partitioning framework without requiring additional processing.
4. Case Study: PD Fuzzy Controller for DC Motor
The structure of the fuzzy PD control system is shown in
Figure 1. Considering the granularity classification proposed in [
19], the present work includes modules with level-1 granularity (arithmetic/logical operators) and level-3 granularity (functional modules). This modular approach preserves the physical significance of each partition, enabling a more intuitive design process that is easier to debug and analyze in case of failures. The design and hardware implementation of the fuzzy PD control system, including the controller architecture, have been detailed in [
13]. Therefore, only a brief description of the modules that comprise the control system is provided below:
M1 (level-1 granularity) consists of a subtractor, the generation of a reference signal, and the computation of the tracking error.
M2 (level-3 granularity) includes a robust sliding mode differentiator and a pair of multipliers to apply the proportional () and derivative () gains.
M3 (level-3 granularity) contains a Mamdani-type fuzzy PD controller and a multiplier to apply the output gain ().
M4 (level-3 granularity) implements a decoder for signals from a quadrature encoder.
M5 (level-3 granularity) is responsible for normalizing the speed signal; in the hardware implementation, it also performs word-length reduction from 32 bits to 16 bits.
M6 (level-3 granularity) contains a pulse-width modulation (PWM) generator.
M7 (level-3 granularity) consists of a digital low-pass filter.
Figure 1.
System with granularity proposed.
Figure 1.
System with granularity proposed.
4.1. Initial Considerations for Adaptation of Particular Objective Functions
The proposed HW/SW partitioning approach considers two primary design objectives: hardware resource utilization and memory usage. Each objective is formulated to enable evaluation within a multi-objective optimization framework. The number of LUTs was selected as the main metric for hardware, while the number of memory blocks is used as a software metric. Additionally, FFs and DSPs will be monitored to ensure they remain within acceptable bounds. In general, the objective functions rely on Equations (
13), (
15) and (
16), but in particular, the objective functions are proposed by considering the following:
A hard core is used in the system.
Communications between the processor and the FPGA are considered.
The frequency dividers used for the operation of the hardware modules are also considered.
Remark 1.
No previous reports were found regarding the resource consumption of a fuzzy PD controller implemented on an FPGA or processor. Therefore, the hardware and software constraints were defined for academic purposes, taking as reference the resources available in a Xilinx Zynq SoC. In particular, the limits were set as follows: memory < 95 kB and area < 2000 LUTs. These constraints are not intended to represent any specific commercial implementation but rather to provide a realistic reference scenario that allows evaluating the behavior of the proposed hardware/software partitioning method.
4.2. Memory Metric and Objective Function
This subsection deals with the application of the methodology presented in
Section 3 to our case study, i.e., the obtaining of the minimum memory, taking into account that the software implementation will be carried out through a description in the C language.
4.2.1. Memory Consumption Extraction
In order to analyze the memory sections of a software implementation, the memory map file (with the .map extension), which is generated after building the project, is used. The memory segments listed can be found in this file, together with their addresses and lengths, as a summary of their content and sub-segments as well.
The first step consists of obtaining the minimum requirements of the system in terms of memory occupation for the proper operation of the processor, so a project free of variables or logic operations was built in C with the minimum program as shown in Listing 1.
| Listing 1. Minimal C project |
| i n t main ( void ) { |
| return 0 ; |
| } |
After reviewing the memory mapping generated during the project build process, the segments were classified into two groups: the constant segments shown in
Table 2, which include the Heap, the Stack (segments with user-defined lengths), and reserved sections that are typically not modified since they are related to the internal operation of the processor [
20].
In the second group shown in
Table 3, segments varying according to the implemented algorithm are found. In such segments, variables and machine code to be used by the processor are stored. Since the program contains the most basic structure possible, it is concluded that the shown values for the segments are minimal. Which implies that any increase in these values will be attributed to the implemented algorithm. In addition, considering the memory needed for constant-length segments, a software implementation requires at least 42.504 kB unless the predefined system configurations are modified. Another point to mention is that the variable segments include rodata, which is not one of those predefined by the theory but is specific to the architecture and contains read-only data.
Once the minimum system memory has been determined, each of the modules designed in the Vitis™ 2020.2 software platform is individually built to obtain its corresponding memory usage report. Based on this information, the actual memory consumption of each module is calculated, and the results are presented in
Table 4.
4.2.2. Objective Function Construction
The memory objective function derived from Equation (
13) takes the following form for the control algorithm:
where
represents the minimum memory requirements necessary for system operation, as shown in
Table 2 and
Table 3. It also includes the memory associated with the printf function, which is always incorporated when the processor is used to display the measured speed values, and the custom function ElapsedTime, since one of the key requirements of the test case is to maintain a consistent time constant across all implementations (fully software or HW/SW). This function ensures that the required time constant is preserved in every configuration. The constant
accounts for the memory required to support type conversion functions, since hardware modules operate with fixed-point representation while the software uses floating-point. Proper conversion is therefore essential. In addition,
includes the C usleep function, which allows the execution to pause briefly, ensuring that communication control flags remain active for the necessary duration and enabling proper HW/SW synchronization. These functions are used exclusively in HW/SW configurations and are not required in fully hardware or fully software implementations. The correction factor proposed for this case is defined by:
In Equation (
21), the constants
,
,
, and
were obtained from the memory consumption of their homonymous libraries, which were identified as common—and therefore repeated—across the implemented functions. Below, these constants are listed together with a brief description of their corresponding libraries:
: contains functions for managing the processor’s input/output ports as well as its interrupts.
: provides a lightweight implementation of the printf function. Although it lacks floating-point support, it is suitable for printing integers or characters.
: It allows 32-bit unsigned integer division (where ‘usi’ stands for “unsigned short int” in a GCC internal context, referring to a standard integer size, and ‘3’ refers to the number of operands).
: This library provides functions for initializing the UART, sending/receiving data, checking status, and handling interrupts.
Finally,
Table 5 presents the values of the constants related to Equations (
20) and (
21).
4.3. Area Objective Function
Based on Equation (
15), and considering the characteristics of the control algorithm together with the use of a hard-core processor in this project, the constant
can be excluded. This term reflects FPGA resources associated with soft-core implementations, which are not relevant here. The resulting expression is:
Here,
represents the PL resources required to implement the Xilinx
® (San Jose, CA, USA) Intellectual Property (IP) core AXI4-Lite Interface Wrapper, which enables PL–PS communication through the AXI protocol. The term
represents the additional hardware required for generating the system clock signals and is proposed as follows:
where
and
denote the resources required by the frequency dividers that generate the 62.5 MHz and 54 kHz signals, respectively. It should be noted that (
22) works for LUTs or FFs. After implementing each module individually, the hardware metrics summarized in
Table 6 were obtained.
Regarding hardware multipliers (
), the formulation follows the same structure as (
22). However, since neither the frequency dividers (
,
) nor
consume DSP resources—as shown in
Table 6—the expression proposed simplifies to:
4.4. Performance Estimation of Modified Objective Functions
To validate the proposed objective functions, area and memory usage are estimated for fully hardware and entirely software implementations. These estimations are then compared with the actual results obtained after implementation in Vivado
® 2020.2 and construction in Vitis™, respectively.
Table 7 presents the results related to hardware area consumption. Based on these reference values, the relative estimation error was calculated, yielding 2.11% for LUTs, 1.26% for FFs, and 0% for DSPs. These results indicate good accuracy, particularly the 2.11% error for LUT estimation, which compares favorably with the 2.24% LUT area estimation error reported in [
19].
Regarding memory usage,
Table 8 shows the estimation results. A relative error of 1.16% was obtained, which is considered satisfactory given the complexity involved in estimating memory consumption. Moreover, to the best of the authors’ knowledge, there is a lack of prior work providing comparative data for this metric. While memory usage is briefly addressed in a few works such as [
19,
21], the handling of this metric is generally not detailed. These works typically present only the objective function (e.g., Equation (
7)) without discussing the accuracy or performance of the corresponding estimations.
After obtaining the hardware and software module metrics and validating the performance of the proposed objective functions, the next step is the solution search phase, which is carried out using the NSGA-II multi-objective optimization algorithm.
4.5. Obtaining the Pareto Front Using NSGA-II
To search for solutions that satisfy the imposed constraints, the Pareto front obtained through the NSGA-II algorithm is used, which has shown excellent results in previous works such as [
19,
22] for addressing HW/SW partitioning problems. Additionally, since NSGA-II is a genetic algorithm, it benefits from a chromosome-based representation, which enables simpler encoding and clearer visualization of solutions.
In the context of genetic algorithms, chromosomes, or individuals within a generation, are composed of
n basic units called genes. For our case,
n is set to 7, corresponding to the number of system modules, as illustrated in
Figure 2. Each gene represents a module to be implemented and is encoded using a binary variable
. This variable determines the implementation type for the corresponding module: when
, the module
is implemented in hardware; when
, the implementation is in software. Therefore, each chromosome represents a specific partitioning solution.
Now, regarding the implementation of NSGA-II, Algorithm 1 shows a compact version of the standard NSGA-II workflow (adapted from [
23]). The full algorithm was implemented following the canonical steps, but with two key modifications required for binary chromosome representations. First, the original real-coded chromosome initialization was replaced with a binary initialization procedure (Algorithm 2). Second, the genetic operators were adapted to binary encoding instead of the real-coded SBX operator used in the classical NSGA-II. The modified crossover and mutation operators are described in Algorithm 3.
| Algorithm 1 NSGA-II Main Loop (Adapted from [23]) |
- Require:
Population size N, number of generations G - Ensure:
Final non-dominated set - 1:
InitializePopulation ▹ Uses Algorithm 2 - 2:
EvaluateObjectives() - 3:
for to G do - 4:
- 5:
FastNonDominatedSort - 6:
- 7:
- 8:
while do - 9:
ComputeCrowdingDistance() - 10:
- 11:
- 12:
end while - 13:
ComputeCrowdingDistance() - 14:
SelectBest - 15:
GeneticOperators ▹ Uses Algorithm 3 - 16:
EvaluateObjectives() - 17:
end for return Final non-dominated solutions in
|
| Algorithm 2 Binary Initialization of Chromosomes (Modified) |
- Require:
Population size N, chromosome length L - Ensure:
Population P - 1:
- 2:
for to N do - 3:
Create chromosome - 4:
for to L do - 5:
RandomBit() // Uniform - 6:
end for - 7:
- 8:
end for return
P
|
Algorithms 2 and 3 correspond to the components modified in this work to support binary encoding. All other steps (ranking, crowding distance, fast non-dominated sorting, and selection) follow the original NSGA-II procedure described in [
24]. Finally, the NSGA-II algorithm was implemented in
Matlab® with the following configuration parameters: 50 generations, a population size of 30, a crossover rate of 0.9, and a mutation rate of 0.1. The resulting Pareto front obtained after execution is depicted in
Figure 3, where the design constraints introduced at the beginning of this section are also illustrated.
Each blue dot in the figure represents an individual. The region below the black line and to the left of the gray line corresponds to the set of individuals that satisfy both constraints. These individuals are identified as
solutions. The next step is to select the most suitable solution for implementation, which is guided by analyzing the characteristics of each candidate, as shown in
Table 9.
| Algorithm 3 Binary Genetic Operators (Modified) |
- Require:
Parent population P, crossover rate , mutation rate - Ensure:
Offspring population Q - 1:
- 2:
while
do - 3:
Select parents via binary tournament - 4:
if Random then - 5:
RandomInteger - 6:
- 7:
- 8:
else - 9:
- 10:
- 11:
end if - 12:
for to L do - 13:
if Random then - 14:
▹ Bit-flip mutation - 15:
end if - 16:
end for - 17:
for to L do - 18:
if Random then - 19:
- 20:
end if - 21:
end for - 22:
- 23:
end while return
Q
|
Based on the analysis of the optimization results, solution S1 is selected as the best candidate. It requires the least hardware area (in terms of LUTs, FFs, and DSPs). Compared to S4, the configuration with the lowest memory usage, S1 achieves an 11.82% reduction in LUTs and a 10.23% reduction in FFs, at the cost of only a 0.28% increase in memory usage. Regarding DSP utilization (HM), both S1 and S2 use 16% fewer DSPs than S3 and S4.
Finally, when comparing S1 and S2, both solutions are similar in terms of metrics; however, S1 requires fewer communications between the Processing System (PS) and the Programmable Logic (PL). This is inferred from the number of transitions (from 0 to 1 or vice versa) in the chromosome configuration, which corresponds to communication interfaces. S1 has four transitions, while S2 has five, making S1 slightly simpler to implement in the final system. Based on these observations, solution S1 is selected for implementation.