1. Introduction
Convolution operations are a fundamental computational primitive in embedded signal processing, digital filtering, and machine learning inference. They appear in applications ranging from wearable sensing platforms and biomedical instrumentation to embedded vision and latency measurement systems. Although convolution can be implemented efficiently on general-purpose processors, the increasing demand for deterministic execution timing and energy-efficient operation in resource-constrained field-programmable gate array (FPGA) platforms motivates the development of specialized architectural support for convolution workloads.
This work presents a convolution-aware soft processor architecture for resource-constrained wearable systems. This processor architecture was designed specifically to improve the execution efficiency of mixed-precision convolution workloads. The proposed processor integrates a two-lane 32-bit SIMD integer datapath with a custom IEEE-754 [
1] single-precision floating-point convolution pipeline. This pipeline distributes multiplication and accumulation across execution stages. Unlike conventional fused multiply–accumulate implementations, the proposed pipeline overlaps accumulation with operand streaming from memory. This overlapping improves arithmetic unit utilization while maintaining compatibility with single-cycle block RAM access.
To support the concurrent execution of control logic and convolution kernels, the processor employs separate integer and floating-point register files, enabling simultaneous address generation, operand fetch, and accumulation operations without structural hazards. Convolution kernels are implemented as assembly-level macro routines callable from C, allowing deterministic scheduling of vector and floating-point execution resources without extending the base instruction set architecture.
Simple soft-core processors implemented on FPGAs provide an attractive solution for embedded applications requiring flexibility and tight integration with custom hardware logic. Widely used soft processors such as MicroBlaze offer configurable execution pipelines and floating-point support; however, they remain general-purpose architectures that do not explicitly optimize convolution execution. As a result, convolution workloads implemented on such processors typically suffer from limited arithmetic unit utilization and inefficient memory-access scheduling.
At the opposite end of the performance spectrum, dedicated accelerator overlays such as the Deep Processing Unit (DPU) provide high-throughput convolution capability for neural network inference. While these architectures achieve substantial acceleration, they require significant logic, memory, and DSP resources, restricting their deployment to larger FPGA platforms. In low-resource devices such as the Artix-7 XC7A35T used in wearable and portable embedded systems, accelerator-class solutions may exceed available resources or introduce unnecessary architectural complexity.
1.1. Motivation for Wearable Signal Processing Workloads
Convolutions are a fundamental operation in wearable biomedical and inertial sensing pipelines, where they are commonly used for filtering and feature extraction from physiological and motion signals. Typical examples include electrocardiogram (ECG) denoising and QRS detection [
2,
3], electromyography (EMG) envelope extraction [
4], inertial measurement unit (IMU) activity classification [
5], and bioimpedance signal conditioning [
6]. Unlike accelerator-class solutions that rely on separate DSP pipelines or external coprocessors, integrating convolution support directly within the processor datapath reduces memory traffic and control overhead while improving arithmetic utilization. As a result, the proposed architecture targets embedded wearable platforms where efficient streaming convolution must be performed locally with limited FPGA resources.
In contrast, conventional soft-core solutions such as MicroBlaze rely on scalar execution combined with external DSP primitives, which introduces additional register-transfer overhead and increases memory access pressure during sliding-window convolution operations.
1.2. Outline
The rest of this paper is organized as follows:
Section 2 surveys related work.
Section 3 introduces the processor architecture, and
Section 4 details the implementation methodology.
Section 5 presents the experimental results, followed by a general discussion in
Section 6.
Section 7 provides concluding remarks.
3. Processor Architecture
The proposed processor is a convolution-aware soft-core architecture designed to accelerate convolution workloads while maintaining compatibility with a general-purpose embedded execution model. The processor follows a five-stage pipelined organization derived from the classical RISC pipeline structure, with extensions that introduce dedicated execution paths for floating-point accumulation and SIMD integer convolution operations.
Figure 1 illustrates the top-level datapath organization of the processor.
The processor separates scalar integer, scalar floating-point, and vector execution domains through independent register files and specialized functional units. This organization enables overlapping arithmetic and memory operations during convolution execution while preserving compatibility with standard control and arithmetic instruction sequences.
3.1. Pipeline Organization
The processor implements a five-stage pipeline consisting of instruction fetch (IF), instruction decode (ID), execute (EX), memory access (MEM), and write-back (WB) stages. Instruction fetch operations are performed from on-chip block memory, while decode logic generates control signals for scalar, vector, and floating-point execution paths.
Unlike conventional scalar soft processors, the execution stage includes multiple parallel functional units supporting scalar arithmetic operations, SIMD integer multiplication, and floating-point convolution primitives. Hazard detection and forwarding logic are incorporated to maintain pipeline throughput during dependent instruction sequences, particularly for multiply–accumulate operations spanning multiple pipeline stages.
This staged organization enables partial overlap between multiplication, accumulation, and operand fetch operations during convolution execution, reducing pipeline stalls compared to conventional scalar floating-point pipelines.
3.2. Register File Organization
To support heterogeneous execution domains, the processor maintains three independent register files: an integer register file for scalar arithmetic and control operations, a floating-point register file compliant with IEEE-754 single-precision operands, and a vector register file supporting two-lane SIMD integer operations.
The separation of register files allows simultaneous operand access across execution units without introducing structural hazards. This organization is particularly beneficial during convolution workloads, where scalar control logic, floating-point accumulation, and SIMD integer multiplication may proceed concurrently.
3.3. Floating-Point Convolution Datapath
A key architectural feature of the processor is the split-stage floating-point convolution pipeline. Unlike conventional scalar floating-point units that perform multiplication and accumulation within a single execution stage, the proposed processor separates these operations across consecutive pipeline stages. As shown in
Figure 2, floating-point multiplication is performed in the EX stage, while accumulation is completed in the MEM stage using a dedicated floating-point adder. This staged multiply–accumulate organization enables overlap between arithmetic operations and operand fetch, improving execution efficiency during convolution workloads.
Because accumulation occurs in a later pipeline stage, the processor can continue with the next instructions and not have to wait for the completion of this operation. This reduces pipeline idle cycles and increases arithmetic unit utilization during convolution execution sequences. In addition to the convolution datapath, a scalar floating-point execution unit supports conversion, comparison, and memory-transfer instructions required for compatibility with the RV32F instruction subset. These operations include floating-point load/store instructions as well as format conversion and comparison primitives used in control-flow decisions.
One limitation of the split-stage floating-point accumulation pipeline is reduced numerical precision compared to IEEE-754 fused multiply–add (FMA). The multiplication result is rounded before accumulation because the operations are separated across pipeline stages, introducing additional intermediate rounding error. In contrast, a fused MAC performs the multiply and add with a single final rounding step. During testing, this loss in precision was considered an acceptable trade-off for improved hardware simplicity. One possible mitigation would be to use a wider accumulator (e.g., 64-bit accumulation for 32-bit operands) to preserve more intermediate precision and reduce the accumulated rounding error.
3.4. SIMD Integer Convolution Datapath
The processor also includes a two-lane SIMD 32-bit integer execution path designed to accelerate integer convolution workloads. This datapath performs parallel 32-bit multiplications and additions within a single cycle using dedicated DSP resources mapped to vector multiply–accumulate operations, as shown in
Figure 3. Vector instructions operate on operands stored in the vector register file and are invoked through custom macro-instruction sequences integrated into the software toolchain. These instructions enable efficient execution of fixed-point convolution kernels commonly used in embedded signal-processing applications. By supporting both floating-point and SIMD integer convolution pipelines, the architecture provides flexibility across multiple precision domains while maintaining low hardware complexity relative to accelerator-class solutions.
Custom SIMD Instruction Set Extension
To expose the SIMD datapath to software, the processor implements a lightweight custom instruction set extension integrated into the base RISC-V execution model. These instructions map directly to dedicated vector execution hardware and enable efficient multiply–accumulate operations required by convolution workloads while preserving compatibility with standard toolchain workflows through assembler macro wrappers.
The vector multiply–accumulate operation is implemented through the vmac macro as shown in Listing 1, which performs lane-wise multiplication followed by accumulation into the destination vector register. In addition, the vadd macro provides vector addition between two source vector operands, while vmove transfers data from an integer register into a vector register to support pointer initialization and operand setup.
| Listing 1. Custom vector MAC encoding. |
1.macro vmac rd, rs1, rs2 2 .insn r 0x5B, 0x00, 0x01, \rd, \rs1, \rs2 3.endm |
Memory access support for vector execution is provided through specialized load and store macros. The vload instruction loads two contiguous values from memory into a vector register, enabling efficient access to adjacent input samples during convolution. The vsload instruction loads a single scalar value from memory and replicates or inserts it into a vector register, which is particularly useful for kernel coefficients shared across SIMD lanes. The vstore instruction writes vector results back to memory.
Additional vector-immediate macros support address calculation and loop-index updates. The vaddi macro performs vector addition with an immediate operand, while vslli applies lane-wise logical left shifts to generate byte offsets for memory addressing. The vcaddi macro is similar to the vaddi macro, but instead of pasting the same result accross all lanes, the result is added +1 per lane (0 for lane 1, +1 for lane 2). Finally, vauipc is used for vector-aware upper-immediate address construction.
3.5. Hazard Detection and Forwarding Network
To sustain throughput during convolution execution, the processor implements separate hazard detection and forwarding networks for the floating-point datapath and the integer datapath (including scalar and vector operations). Each path maintains independent validity and dependency flags to prevent incorrect operand forwarding and eliminate spurious hazard conditions across mixed execution streams. The floating-point datapath is optimized to minimize hazards by exploiting branch–flush intervals, enabling results to be written and forwarded just prior to the next MAC instruction, and effectively hiding pipeline latency during chained convolution operations.
3.6. DSP-Based Execution Units
Dedicated DSP resources are used to implement both scalar and vector multiply–accumulate operations. The internal architecture diagram of the DSP can be seen in
Figure 4 and shows how it is used to implement MAC operations. Scalar DSP units support floating-point operations within the convolution datapath, while vector DSP units provide parallel integer multiplication and addition for SIMD execution. Mapping convolution primitives directly onto FPGA DSP slices reduces logic utilization and improves arithmetic throughput compared to LUT-based implementations. This approach enables the processor to achieve significant convolution acceleration while remaining within the resource constraints of mid-range Artix-7 devices such as the XC7A35T.
4. Implementation Details
4.1. FPGA Implementation
The proposed convolution-aware processor was implemented on a Digilent Basys3 development board featuring an Artix-7 XC7A35T FPGA. This device provides a compact logic fabric suitable for evaluating lightweight processor architectures intended for wearable and resource-constrained embedded systems while still offering dedicated DSP resources for arithmetic acceleration.
The processor operates at a 50 MHz system clock derived from the onboard 100 MHz oscillator through a clock-division stage. This operating frequency was selected to ensure stable timing closure while enabling direct comparison with a MicroBlaze soft processor synthesized under identical clock conditions and floating-point precision support.
Instruction and data memories were implemented using on-chip block RAM resources configured for single-cycle access latency. This organization aligns with the processor pipeline structure and enables operand fetch operations to overlap with floating-point accumulation during convolution execution.
The processor was synthesized and implemented using Vivado 2025.2, and the resources utilized are summarized in
Table 2. Post-implementation timing analysis confirmed stable operation at 50 MHz with positive slack across all critical paths, indicating that the split-stage multiply–accumulate pipeline does not introduce timing bottlenecks on the XC7A35T device.
The moderate LUT utilization reflects the inclusion of dedicated SIMD datapaths, independent floating-point accumulation hardware, and hazard-control logic required to support concurrent scalar, vector, and floating-point execution domains. These structures enable overlap between multiplication, accumulation, and operand streaming operations, which directly contributes to reduced convolution execution latency.
Despite supporting both floating-point and SIMD integer multiply–accumulate execution, the processor requires only seven DSP48E1 slices. This modest DSP utilization results from distributing multiplication and accumulation across pipeline stages rather than replicating independent accelerator units, allowing efficient arithmetic acceleration within the resource limits of the XC7A35T device.
Post-implementation power analysis of the proposed processor was performed using the Vivado 2025.2 power estimator under identical operating conditions on the XC7A35T device, and the power consumption results are summarized in
Table 3. The measured dynamic power consumption reflects the activity of the SIMD datapaths, floating-point execution pipeline, and hazard-control structures required to sustain concurrent arithmetic operations across multiple execution domains. Additional switching activity arises from operand routing between pipeline stages and the register files supporting scalar, vector, and floating-point execution. Despite the inclusion of these parallel execution resources, the overall power consumption remains low and very reasonable for the XC7A35T device, confirming that the proposed architecture achieves increased computational throughput without requiring excessive arithmetic or memory resources.
Post-implementation power analysis was also performed for the MicroBlaze, with the resource utilization results shown in
Table 4 and power consumption summarized in
Table 5. Compared with the MicroBlaze configuration evaluated under identical operating conditions, the proposed processor requires approximately twice the LUT resources while maintaining comparable DSP utilization. Both the proposed processor and the MicroBlaze baseline rely on on-chip block RAM (BRAM) resources for instruction and data storage, resulting in comparable memory organization between the two implementations. Despite this increase in logic utilization, the overall resource footprint remains well within the capacity of the XC7A35T device, leaving sufficient headroom for future architectural extensions such as additional vector lanes, specialized acceleration units, or expanded memory interfaces.
By comparing both implementations, we can see that the proposed processor exhibits a moderate increase in total on-chip power consumption, due primarily to the higher dynamic switching activity associated with the SIMD datapaths and floating-point execution pipeline. While the static power component remains unchanged between both implementations, the additional dynamic power reflects the increased arithmetic concurrency and operand routing required to support convolution-aware execution. Despite this increase, the higher computational throughput achieved by the proposed architecture results in improved energy efficiency per convolution operation.
4.2. Software Toolchain Integration
The process of programming the processor is shown in
Figure 5. Application programs targeting the proposed processor are compiled using a bare-metal RISC-V toolchain [
24] and linked against a lightweight support library that exposes memory-mapped control registers for the scalar, vector, and floating-point execution domains, as well as a timer and the UART Rx/Tx with handshake protocols. Convolution kernels are implemented as assembly-level macro routines callable from C, enabling deterministic scheduling of SIMD and floating-point operations without modifying the base instruction set architecture. The compilation flow produces executables that are converted into block RAM initialization files (.coe), allowing programs to be embedded directly into on-chip instruction memory at synthesis time. Additional tcl scripts are provided for building the Vivado 2025.2 project and reprogramming the firmware. This approach eliminates the need for external program loaders and supports repeatable deployment of convolution workloads on the FPGA platform.
5. Experimental Results
The performance of the proposed convolution-aware processor was evaluated using one-dimensional floating-point convolution workloads and compared against a MicroBlaze soft processor configured with hardware single-precision floating-point support. Both processors were implemented on the same Artix-7 XC7A35T FPGA device using a Digilent Basys3 development board and operated at an identical clock frequency of 50 MHz to ensure a fair architectural comparison.
The benchmark consisted of convolving an input signal of length 1024 with kernels of varying size composed of unit-valued coefficients. This benchmark is commonly used in wearable systems to detect signal patterns. All experiments were executed using IEEE-754 single-precision arithmetic, and the execution time was measured in processor clock cycles.
To evaluate scalability with respect to convolution depth, experiments were performed using kernel sizes of 3, 5, 7, 9, and 11. The results are summarized in
Figure 6 and
Figure 7. The proposed processor consistently outperformed the MicroBlaze implementation across all evaluated configurations, by at least
.
The performance advantage increases slightly with kernel size, as shown in
Figure 8, indicating improved arithmetic utilization as the accumulation depth grows. This behavior reflects the effectiveness of the split-stage floating-point pipeline, which overlaps multiplication and accumulation across execution stages while simultaneously streaming operands from memory. In contrast, the MicroBlaze processor executes convolution using a scalar floating-point pipeline without convolution-aware scheduling support, resulting in reduced pipeline utilization during extended accumulation sequences.
The execution speedup is computed as
where
corresponds to the execution time measured on the MicroBlaze processor and
corresponds to the execution time of the proposed convolution-aware processor under identical clock frequency and precision conditions.
To further normalize performance across kernel sizes, the average number of cycles per output sample was computed and placed in
Figure 9. For an input signal of length
N and a kernel of length
K, the number of valid output samples is
. Therefore, the average cycles per output sample is defined as
where
is the total measured cycle count for the convolution. Using
, the results show that the proposed processor consistently requires substantially fewer cycles per output sample than the MicroBlaze implementation, with the performance gap increasing as the kernel length grows. This trend reflects improved arithmetic utilization of the split-stage floating-point accumulation pipeline and the SIMD datapath, which reduce pipeline stalls during longer convolution sequences and enable more efficient overlap between multiplication, accumulation, and operand fetch operations.
The energy per output sample was computed and shown in
Figure 10 as follows:
where
is the measured implementation power,
is the convolution runtime,
N is the input length, and
K is the kernel size. These results demonstrate that the proposed processor achieves substantial improvements in floating-point convolution efficiency while remaining compatible with resource-constrained FPGA platforms.
6. Discussion
6.1. Architectural Performance Analysis
The experimental results demonstrate that the proposed processor achieves a consistent performance improvement over the MicroBlaze baseline across all evaluated convolution kernel sizes while operating at the same clock frequency and numerical precision. This improvement is primarily enabled by the split-stage floating-point accumulation pipeline, which allows multiplication and accumulation operations to overlap across pipeline stages, reducing execution stalls during convolution sequences.
While the split-stage floating-point accumulation pipeline improves throughput by overlapping multiplication and accumulation stages, it does not provide IEEE-754 fused multiply–add semantics. Intermediate multiplication results are rounded prior to accumulation, introducing additional rounding error compared to fused MAC implementations that perform a single final rounding step. For the evaluated convolution workloads, this reduction in numerical precision was considered an acceptable trade-off for improved pipeline simplicity and implementation efficiency. Future implementations could mitigate the accumulated rounding error through wider accumulation registers or extended-precision accumulation paths.
In addition to floating-point acceleration, the inclusion of a two-lane SIMD integer datapath further improves arithmetic throughput for convolution workloads by enabling parallel multiply–accumulate operations using FPGA DSP resources. Unlike conventional scalar soft processors, which execute convolution as a sequence of dependent arithmetic instructions, the proposed architecture increases arithmetic unit utilization through explicit support for convolution-oriented execution patterns.
Both the MicroBlaze baseline and the proposed processor were implemented on the same FPGA device and, therefore, had access to the same DSP48E1 resources. However, the MicroBlaze baseline was not configured with an equivalent convolution-specific multiply–accumulate datapath or split-stage floating-point accumulation pipeline. In the proposed processor, the performance improvement comes from the architectural organization of the datapath and pipeline, rather than from simply having access to DSP48 blocks. Therefore, the comparison reflects the benefit of the proposed architectural features under the evaluated configuration, not the maximum possible performance of a fully customized or hardware-extended MicroBlaze design.
The proposed architecture does not rely on conventional software loop unrolling. Instead, it improves convolution throughput through a custom datapath and split-stage multiply–accumulate pipeline that overlaps multiplication, accumulation, and operand preparation across pipeline stages.
6.2. Scalability and Resource Trade-Offs
The proposed architecture improves convolution throughput, but scalability introduces additional resource and memory-system constraints.
The observed increase in speedup with larger kernel sizes indicates that the processor benefits from improved pipeline efficiency as the convolution depth increases. This behavior confirms that distributing multiply–accumulate operations across multiple pipeline stages reduces the relative overhead associated with loop control and memory access operations, resulting in improved scalability for longer convolution filters.
The architecture is extensible, but additional parallelism is not obtained through simple duplication of integer, floating-point, or load/store pipelines. Increasing the SIMD width or adding additional MAC lanes could improve performance by enabling more convolution operations to be executed in parallel, particularly for larger kernels where more multiply–accumulate operations are available per output sample. However, such extensions require corresponding increases in register-file bandwidth, forwarding paths, hazard detection logic, load/store throughput, and memory-system bandwidth. Since FPGA BRAMs provide a limited number of read and write ports, additional execution lanes may become underutilized if the memory subsystem cannot supply operands at the required rate.
Future scalability must be co-designed with mechanisms such as BRAM banking, operand buffering, and sliding-window reuse. While these extensions could increase throughput and reduce the execution time, they would also increase DSP utilization, routing complexity, switching activity, and dynamic energy consumption. As a result, the current two-lane SIMD and split-stage floating-point implementation represents a balanced design point between performance improvement, resource usage, and energy efficiency for resource-constrained embedded workloads.
Although the proposed architecture increases LUT utilization relative to lightweight scalar soft processors, the additional hardware cost may be justified for wearable applications requiring continuous real-time signal processing or low-latency inference. However, applications dominated by long idle intervals may benefit less from the increased parallel hardware resources due to higher static resource utilization.
The implemented design operates at 50 MHz to ensure reliable timing closure on the Artix-7 FPGA platform. Floating-point multiplication, normalization, and accumulation introduce relatively long combinational paths that become increasingly difficult to close at higher frequencies without additional pipelining or architectural optimization. Although higher operating frequencies are likely achievable on more advanced FPGA platforms or through deeper pipelining, such modifications would introduce additional design complexity and verification overhead.
Compared with accelerator-class architectures such as the Xilinx DPU, the proposed processor achieves lower peak throughput but requires substantially fewer hardware resources and maintains compatibility with a lightweight embedded processor model. This makes the architecture suitable for mid-range FPGA platforms such as the Artix-7 XC7A35T, where accelerator overlays may exceed available DSP and memory resources.
These characteristics position the proposed processor as an intermediate solution between general-purpose soft processors and dedicated convolution accelerators, providing improved convolution performance while preserving flexibility for embedded signal-processing applications. While the proposed SIMD floating-point architecture is suitable for many convolution and DSP-oriented edge AI workloads, future edge inference applications may increasingly favor lower-precision arithmetic formats and tensor-oriented execution units to improve energy efficiency and throughput.
Additionally, because the current implementation relies entirely on on-chip BRAM resources, very large activation or weight tensors may eventually require external memory support, introducing additional memory hierarchy complexity and bandwidth overhead.
6.3. Limitations and Future Work
Despite the demonstrated performance improvements, the current implementation retains several architectural and evaluation limitations.
The current implementation targets fixed-width two-lane SIMD execution and relies on software-managed address generation for multidimensional convolution workloads. The current design does not support hardware address generation for multidimensional convolutions, dynamic precision switching, or out-of-order execution. Although higher-dimensional convolutions can be executed through standard indexing transformations, future work may explore hardware-supported window generation and operand reuse mechanisms to further improve efficiency for image-processing applications. Additionally, future evaluation will include comparisons against representative wearable-class embedded processors executing ECG, EMG, and IMU filtering pipelines in order to quantify energy efficiency under realistic sensing workloads. Such analysis would further clarify the suitability of the proposed architecture for continuous low-latency edge signal processing in resource-constrained wearable platforms.
Wider SIMD lanes would necessitate the implementation of a complex memory hierarchy. Two lanes were chosen, because BRAM on the Artix-7 is dual-channel by default.
The limitation of high LUT usage is not always justified for wearable systems that require continuous real-time signal processing or neural network inference. In certain cases, it can raise the static power usage; this is not the case in this design when compared to MicroBlaze.
The current power and performance evaluation is primarily based on synthetic convolution benchmarks, which are intended to isolate datapath behavior and arithmetic throughput rather than represent full application workloads.
Ultra-low-power techniques such as clock gating and duty cycling are promising directions for future work to further reduce dynamic power consumption during periods of low computational activity.
The present work does not evaluate sensitivity to temperature, voltage, or process variations under wearable operating conditions. While the relatively modest operating frequency may provide additional timing margin, a full characterization across environmental and process corners remains an open area for future work.
7. Conclusions
This work presents a convolution-aware processor architecture designed to improve the execution efficiency of one-dimensional convolution workloads on resource-constrained FPGA platforms. The proposed design extends a conventional five-stage pipeline with a split-stage floating-point accumulation datapath and a two-lane SIMD 32-bit integer execution unit combined with DSP units, which enable multiplication and addition in the same cycle, enabling overlap between multiplication, accumulation, and operand fetch operations during convolution execution.
Experimental evaluation on a Digilent Basys3 FPGA demonstrated consistent performance improvements over a MicroBlaze soft processor operating under identical clock frequency and precision conditions. The proposed processor achieved speedups exceeding six-fold across multiple kernel sizes while maintaining a lightweight hardware footprint suitable for mid-range Artix-7 devices.
Unlike accelerator-class solutions such as the Xilinx DPU, which prioritize maximum throughput at the cost of increased resource utilization, the proposed architecture provides a balanced trade-off between performance and implementation complexity while preserving compatibility with embedded processor workflows.
Future work will investigate extensions to wider SIMD datapaths, as well as integration with higher-level compiler toolchains to enable automated mapping of convolution workloads to the proposed execution model.