Next Article in Journal
Digital-Twin-Assisted Adaptive Sensor Scheduling for Energy Optimization in Battery-Powered Indoor Air Quality (IAQ) IoT Nodes
Previous Article in Journal
DFE-Net: A Dual-Frequency Enhancement Network for Low-Light and Overexposed Image Restoration
Previous Article in Special Issue
Asynchronized Jacobi Solver on Heterogeneous Mobile Devices
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Lightweight Convolution-Aware RISC-V Soft Processor for Intelligent Wearable Systems

by
Fernando L. Pizarro Diaz
*,
Booker A. Robinson
and
Juan F. Patarroyo Montenegro
*
Department of Computer Science and Engineering, University of Puerto Rico, Mayaguez, PR 00680, USA
*
Authors to whom correspondence should be addressed.
Electronics 2026, 15(11), 2399; https://doi.org/10.3390/electronics15112399
Submission received: 1 May 2026 / Revised: 25 May 2026 / Accepted: 28 May 2026 / Published: 1 June 2026
(This article belongs to the Special Issue Ubiquitous Computing and Mobile Computing)

Abstract

Resource-constrained wearable systems often need to be able to execute signal processing and AI workloads. There are many trade-offs to consider for this type of application. This paper presents a lightweight convolution-aware soft processor for embedded signal-processing on resource-constrained wearable devices. This architecture represents a middle ground for signal-processing applications between dedicated accelerators and lightweight soft processors. The proposed architecture integrates a two-lane SIMD integer datapath with a split-stage IEEE-754 floating-point accumulation pipeline. The split-stage design enables overlap between multiplication, accumulation, and operand fetch, improving arithmetic utilization while maintaining low resource costs. The processor was implemented on the Artix-7-based Basys3 platform and evaluated using one-dimensional convolution workloads. The experimental results demonstrate a 6 × speedup over MicroBlaze-class soft processors while maintaining the same static power usage (0.073 W), and only requiring 44% higher dynamic power consumption. The architecture achieves this with significantly fewer FPGA resources than accelerator-based solutions such as DPU overlays. The proposed architecture provides a practical alternative for wearable and resource-constrained FPGA systems requiring deterministic convolution performance, demonstrating a balanced design point for embedded wearable platforms where software-defined flexibility and convolution acceleration are both required.

1. Introduction

Convolution operations are a fundamental computational primitive in embedded signal processing, digital filtering, and machine learning inference. They appear in applications ranging from wearable sensing platforms and biomedical instrumentation to embedded vision and latency measurement systems. Although convolution can be implemented efficiently on general-purpose processors, the increasing demand for deterministic execution timing and energy-efficient operation in resource-constrained field-programmable gate array (FPGA) platforms motivates the development of specialized architectural support for convolution workloads.
This work presents a convolution-aware soft processor architecture for resource-constrained wearable systems. This processor architecture was designed specifically to improve the execution efficiency of mixed-precision convolution workloads. The proposed processor integrates a two-lane 32-bit SIMD integer datapath with a custom IEEE-754 [1] single-precision floating-point convolution pipeline. This pipeline distributes multiplication and accumulation across execution stages. Unlike conventional fused multiply–accumulate implementations, the proposed pipeline overlaps accumulation with operand streaming from memory. This overlapping improves arithmetic unit utilization while maintaining compatibility with single-cycle block RAM access.
To support the concurrent execution of control logic and convolution kernels, the processor employs separate integer and floating-point register files, enabling simultaneous address generation, operand fetch, and accumulation operations without structural hazards. Convolution kernels are implemented as assembly-level macro routines callable from C, allowing deterministic scheduling of vector and floating-point execution resources without extending the base instruction set architecture.
Simple soft-core processors implemented on FPGAs provide an attractive solution for embedded applications requiring flexibility and tight integration with custom hardware logic. Widely used soft processors such as MicroBlaze offer configurable execution pipelines and floating-point support; however, they remain general-purpose architectures that do not explicitly optimize convolution execution. As a result, convolution workloads implemented on such processors typically suffer from limited arithmetic unit utilization and inefficient memory-access scheduling.
At the opposite end of the performance spectrum, dedicated accelerator overlays such as the Deep Processing Unit (DPU) provide high-throughput convolution capability for neural network inference. While these architectures achieve substantial acceleration, they require significant logic, memory, and DSP resources, restricting their deployment to larger FPGA platforms. In low-resource devices such as the Artix-7 XC7A35T used in wearable and portable embedded systems, accelerator-class solutions may exceed available resources or introduce unnecessary architectural complexity.

1.1. Motivation for Wearable Signal Processing Workloads

Convolutions are a fundamental operation in wearable biomedical and inertial sensing pipelines, where they are commonly used for filtering and feature extraction from physiological and motion signals. Typical examples include electrocardiogram (ECG) denoising and QRS detection [2,3], electromyography (EMG) envelope extraction [4], inertial measurement unit (IMU) activity classification [5], and bioimpedance signal conditioning [6]. Unlike accelerator-class solutions that rely on separate DSP pipelines or external coprocessors, integrating convolution support directly within the processor datapath reduces memory traffic and control overhead while improving arithmetic utilization. As a result, the proposed architecture targets embedded wearable platforms where efficient streaming convolution must be performed locally with limited FPGA resources.
In contrast, conventional soft-core solutions such as MicroBlaze rely on scalar execution combined with external DSP primitives, which introduces additional register-transfer overhead and increases memory access pressure during sliding-window convolution operations.

1.2. Outline

The rest of this paper is organized as follows: Section 2 surveys related work. Section 3 introduces the processor architecture, and Section 4 details the implementation methodology. Section 5 presents the experimental results, followed by a general discussion in Section 6. Section 7 provides concluding remarks.

2. Related Work

Field-programmable gate arrays (FPGAs) support a wide spectrum of computation architectures, ranging from general-purpose soft processors to highly specialized convolution accelerators. The proposed lightweight convolution-aware processor is designed to occupy an intermediate position, introducing specialized execution mechanisms within a flexible embedded processor architecture.

2.1. General-Purpose FPGA Soft Processors and RISC-V Foundations

Soft processors such as the Xilinx MicroBlaze architecture provide flexible embedded computation platforms that are widely used in FPGA-based signal-processing applications. However, convolution operations executed on scalar pipelines require sequential multiply–accumulate scheduling and repeated register accesses, limiting arithmetic unit utilization during convolution workloads. Although floating-point support improves numerical precision, conventional soft processors do not restructure their pipeline organization to exploit overlap between the multiplication and accumulation stages [7].
In contrast, RISC-V soft processors have emerged as a flexible foundation for implementing specialized convolution accelerators on FPGAs. The open-source RISC-V instruction set architecture provides modularity and simplicity, enabling the design of lightweight processor cores suitable for resource-constrained platforms [8]. Recent work demonstrates that five-stage pipelined RISC-V implementations can achieve competitive performance with general-purpose soft processors while maintaining a low resource footprint—designs on Artix-7 platforms consume as little as 5.5K LUTs and deliver 3.7× speedup for ML inference [9]. These resource-efficient implementations achieve maximum operating frequencies competitive with general-purpose soft cores while providing sufficient performance for embedded signal-processing tasks [10].
Several open-source RISC-V soft processors further demonstrate the maturity of configurable processor architectures for FPGAs and embedded systems. VexRiscv is a highly configurable 32-bit RISC-V processor implemented using SpinalHDL, where architectural features such as pipeline organization, bus interfaces, caches, debug support, and arithmetic extensions can be selected through a plugin-based design methodology [11]. NEORV32 provides a compact MCU-class RISC-V processor and microcontroller-like SoC written in platform-independent VHDL, emphasizing portability, configurability, and suitability for resource-constrained FPGA deployments [12]. Ibex is a production-quality open-source RV32 processor core written in SystemVerilog and designed for embedded control applications, with support for configurable ISA extensions and extensive verification [13]. PULPino, from the Parallel Ultra-Low-Power platform ecosystem, provides a single-core RISC-V microcontroller system targeting low-power embedded and IoT applications [14]. These processors provide flexible general-purpose or microcontroller-class computation platforms; however, they are not primarily organized around convolution-specific pipeline overlap or dedicated sliding-window execution. In contrast, the processor proposed in this work specializes the embedded processor datapath for convolution-heavy signal-processing workloads while preserving a lightweight programmable execution model.
While these open-source RISC-V designs establish flexible processor baselines, convolution-heavy workloads often require additional architectural support beyond conventional scalar execution. This motivates the use of custom instruction extensions and tightly coupled acceleration mechanisms for improving multiply–accumulate throughput.

2.2. Custom Instruction Extensions and Convolution Acceleration

RISC-V’s extensibility permits targeted acceleration of convolution-intensive workloads without requiring substantial redesign of the base processor architecture. Integration of custom instruction extensions and tightly coupled coprocessors enables deterministic execution of convolution kernels while preserving processor flexibility. Multi-mode convolution coprocessors based on RISC-V can accelerate different convolution modes through custom instruction subsets, achieving speedups of 8.74× for specific workload classes compared to standard instruction set implementations [15]. Similarly, dedicated convolution IP coprocessors integrated via standard bus interfaces (such as AXI4-Lite) enable six-fold performance improvements with minimal hardware overhead  [16]. Custom instruction extensions for specialized convolution algorithms, such as Winograd-based approaches, demonstrate significant cycle reductions when integrated into RISC-V processors [17]. These approaches support diverse precision domains from fixed-point to floating-point operations [18].

2.3. Accelerator-Class Architectures and Resource Trade-Offs

At the opposite end of the architectural spectrum, accelerator-class convolution engines such as the Xilinx Deep Processing Unit (DPU) achieve high throughput through massively parallel multiply–accumulate arrays and specialized memory hierarchies optimized for neural network inference workloads. These architectures sustain large numbers of concurrent convolution operations but typically require substantial DSP, block memory (BRAM), and routing resources that exceed the capacity of mid-range FPGA devices such as the Artix-7 XC7A35T. Accelerator overlays are generally designed as coprocessor fabrics rather than standalone programmable processors, limiting their suitability for resource-constrained embedded systems that must execute both control logic and signal-processing workloads within a unified architecture [19]. The limited resource capacity of FPGAs in edge servers imposes significant challenges when attempting to deploy multiple DPU instances or when targeting smaller FPGA platforms [20].

2.4. Intermediate Architectural Approaches

Previous research has demonstrated that intermediate approaches—those neither fully general-purpose nor fully dedicated accelerator-class—can achieve effective performance–resource trade-offs through custom instruction extensions and pipelined multiply–accumulate structures [21]. These designs integrate multiply–accumulate extensions and SIMD datapaths into embedded processor architectures to improve performance for signal-processing applications. However, most existing implementations retain single-stage floating-point execution structures and, therefore, do not exploit pipeline-level overlap between multiplication and accumulation operations during convolution execution.
Recent work in FPGA-based pipelining for CNN acceleration demonstrates the effectiveness of distributed arithmetic operations across multiple pipeline stages [22]. By distributing computation across stages while maintaining compatibility with standard embedded processor workflows, such architectures achieve significant convolution acceleration on resource-constrained FPGA platforms while preserving the programmability required for embedded systems.
The proposed processor addresses this gap by introducing convolution-aware pipeline mechanisms with split-stage floating-point multiply–accumulate operations and a two-lane SIMD integer datapath. This architecture enables arithmetic unit overlap through staged operations and separate execution domains for heterogeneous workload support, occupying an intermediate position between general-purpose soft processors and accelerator-class convolution engines while maintaining both performance and resource efficiency suitable for embedded FPGA platforms. These comparisons can be better viewed on Table 1.

3. Processor Architecture

The proposed processor is a convolution-aware soft-core architecture designed to accelerate convolution workloads while maintaining compatibility with a general-purpose embedded execution model. The processor follows a five-stage pipelined organization derived from the classical RISC pipeline structure, with extensions that introduce dedicated execution paths for floating-point accumulation and SIMD integer convolution operations. Figure 1 illustrates the top-level datapath organization of the processor.
The processor separates scalar integer, scalar floating-point, and vector execution domains through independent register files and specialized functional units. This organization enables overlapping arithmetic and memory operations during convolution execution while preserving compatibility with standard control and arithmetic instruction sequences.

3.1. Pipeline Organization

The processor implements a five-stage pipeline consisting of instruction fetch (IF), instruction decode (ID), execute (EX), memory access (MEM), and write-back (WB) stages. Instruction fetch operations are performed from on-chip block memory, while decode logic generates control signals for scalar, vector, and floating-point execution paths.
Unlike conventional scalar soft processors, the execution stage includes multiple parallel functional units supporting scalar arithmetic operations, SIMD integer multiplication, and floating-point convolution primitives. Hazard detection and forwarding logic are incorporated to maintain pipeline throughput during dependent instruction sequences, particularly for multiply–accumulate operations spanning multiple pipeline stages.
This staged organization enables partial overlap between multiplication, accumulation, and operand fetch operations during convolution execution, reducing pipeline stalls compared to conventional scalar floating-point pipelines.

3.2. Register File Organization

To support heterogeneous execution domains, the processor maintains three independent register files: an integer register file for scalar arithmetic and control operations, a floating-point register file compliant with IEEE-754 single-precision operands, and a vector register file supporting two-lane SIMD integer operations.
The separation of register files allows simultaneous operand access across execution units without introducing structural hazards. This organization is particularly beneficial during convolution workloads, where scalar control logic, floating-point accumulation, and SIMD integer multiplication may proceed concurrently.

3.3. Floating-Point Convolution Datapath

A key architectural feature of the processor is the split-stage floating-point convolution pipeline. Unlike conventional scalar floating-point units that perform multiplication and accumulation within a single execution stage, the proposed processor separates these operations across consecutive pipeline stages. As shown in Figure 2, floating-point multiplication is performed in the EX stage, while accumulation is completed in the MEM stage using a dedicated floating-point adder. This staged multiply–accumulate organization enables overlap between arithmetic operations and operand fetch, improving execution efficiency during convolution workloads.
Because accumulation occurs in a later pipeline stage, the processor can continue with the next instructions and not have to wait for the completion of this operation. This reduces pipeline idle cycles and increases arithmetic unit utilization during convolution execution sequences. In addition to the convolution datapath, a scalar floating-point execution unit supports conversion, comparison, and memory-transfer instructions required for compatibility with the RV32F instruction subset. These operations include floating-point load/store instructions as well as format conversion and comparison primitives used in control-flow decisions.
One limitation of the split-stage floating-point accumulation pipeline is reduced numerical precision compared to IEEE-754 fused multiply–add (FMA). The multiplication result is rounded before accumulation because the operations are separated across pipeline stages, introducing additional intermediate rounding error. In contrast, a fused MAC performs the multiply and add with a single final rounding step. During testing, this loss in precision was considered an acceptable trade-off for improved hardware simplicity. One possible mitigation would be to use a wider accumulator (e.g., 64-bit accumulation for 32-bit operands) to preserve more intermediate precision and reduce the accumulated rounding error.

3.4. SIMD Integer Convolution Datapath

The processor also includes a two-lane SIMD 32-bit integer execution path designed to accelerate integer convolution workloads. This datapath performs parallel 32-bit multiplications and additions within a single cycle using dedicated DSP resources mapped to vector multiply–accumulate operations, as shown in Figure 3. Vector instructions operate on operands stored in the vector register file and are invoked through custom macro-instruction sequences integrated into the software toolchain. These instructions enable efficient execution of fixed-point convolution kernels commonly used in embedded signal-processing applications. By supporting both floating-point and SIMD integer convolution pipelines, the architecture provides flexibility across multiple precision domains while maintaining low hardware complexity relative to accelerator-class solutions.

Custom SIMD Instruction Set Extension

To expose the SIMD datapath to software, the processor implements a lightweight custom instruction set extension integrated into the base RISC-V execution model. These instructions map directly to dedicated vector execution hardware and enable efficient multiply–accumulate operations required by convolution workloads while preserving compatibility with standard toolchain workflows through assembler macro wrappers.
The vector multiply–accumulate operation is implemented through the vmac macro as shown in Listing 1, which performs lane-wise multiplication followed by accumulation into the destination vector register. In addition, the vadd macro provides vector addition between two source vector operands, while vmove transfers data from an integer register into a vector register to support pointer initialization and operand setup.
Listing 1. Custom vector MAC encoding.
1.macro vmac rd, rs1, rs2
2        .insn r 0x5B, 0x00, 0x01, \rd, \rs1, \rs2
3.endm
Memory access support for vector execution is provided through specialized load and store macros. The vload instruction loads two contiguous values from memory into a vector register, enabling efficient access to adjacent input samples during convolution. The vsload instruction loads a single scalar value from memory and replicates or inserts it into a vector register, which is particularly useful for kernel coefficients shared across SIMD lanes. The vstore instruction writes vector results back to memory.
Additional vector-immediate macros support address calculation and loop-index updates. The vaddi macro performs vector addition with an immediate operand, while vslli applies lane-wise logical left shifts to generate byte offsets for memory addressing. The vcaddi macro is similar to the vaddi macro, but instead of pasting the same result accross all lanes, the result is added +1 per lane (0 for lane 1, +1 for lane 2). Finally, vauipc is used for vector-aware upper-immediate address construction.

3.5. Hazard Detection and Forwarding Network

To sustain throughput during convolution execution, the processor implements separate hazard detection and forwarding networks for the floating-point datapath and the integer datapath (including scalar and vector operations). Each path maintains independent validity and dependency flags to prevent incorrect operand forwarding and eliminate spurious hazard conditions across mixed execution streams. The floating-point datapath is optimized to minimize hazards by exploiting branch–flush intervals, enabling results to be written and forwarded just prior to the next MAC instruction, and effectively hiding pipeline latency during chained convolution operations.

3.6. DSP-Based Execution Units

Dedicated DSP resources are used to implement both scalar and vector multiply–accumulate operations. The internal architecture diagram of the DSP can be seen in Figure 4 and shows how it is used to implement MAC operations. Scalar DSP units support floating-point operations within the convolution datapath, while vector DSP units provide parallel integer multiplication and addition for SIMD execution. Mapping convolution primitives directly onto FPGA DSP slices reduces logic utilization and improves arithmetic throughput compared to LUT-based implementations. This approach enables the processor to achieve significant convolution acceleration while remaining within the resource constraints of mid-range Artix-7 devices such as the XC7A35T.

4. Implementation Details

4.1. FPGA Implementation

The proposed convolution-aware processor was implemented on a Digilent Basys3 development board featuring an Artix-7 XC7A35T FPGA. This device provides a compact logic fabric suitable for evaluating lightweight processor architectures intended for wearable and resource-constrained embedded systems while still offering dedicated DSP resources for arithmetic acceleration.
The processor operates at a 50 MHz system clock derived from the onboard 100 MHz oscillator through a clock-division stage. This operating frequency was selected to ensure stable timing closure while enabling direct comparison with a MicroBlaze soft processor synthesized under identical clock conditions and floating-point precision support.
Instruction and data memories were implemented using on-chip block RAM resources configured for single-cycle access latency. This organization aligns with the processor pipeline structure and enables operand fetch operations to overlap with floating-point accumulation during convolution execution.
The processor was synthesized and implemented using Vivado 2025.2, and the resources utilized are summarized in Table 2. Post-implementation timing analysis confirmed stable operation at 50 MHz with positive slack across all critical paths, indicating that the split-stage multiply–accumulate pipeline does not introduce timing bottlenecks on the XC7A35T device.
The moderate LUT utilization reflects the inclusion of dedicated SIMD datapaths, independent floating-point accumulation hardware, and hazard-control logic required to support concurrent scalar, vector, and floating-point execution domains. These structures enable overlap between multiplication, accumulation, and operand streaming operations, which directly contributes to reduced convolution execution latency.
Despite supporting both floating-point and SIMD integer multiply–accumulate execution, the processor requires only seven DSP48E1 slices. This modest DSP utilization results from distributing multiplication and accumulation across pipeline stages rather than replicating independent accelerator units, allowing efficient arithmetic acceleration within the resource limits of the XC7A35T device.
Post-implementation power analysis of the proposed processor was performed using the Vivado 2025.2 power estimator under identical operating conditions on the XC7A35T device, and the power consumption results are summarized in Table 3. The measured dynamic power consumption reflects the activity of the SIMD datapaths, floating-point execution pipeline, and hazard-control structures required to sustain concurrent arithmetic operations across multiple execution domains. Additional switching activity arises from operand routing between pipeline stages and the register files supporting scalar, vector, and floating-point execution. Despite the inclusion of these parallel execution resources, the overall power consumption remains low and very reasonable for the XC7A35T device, confirming that the proposed architecture achieves increased computational throughput without requiring excessive arithmetic or memory resources.
Post-implementation power analysis was also performed for the MicroBlaze, with the resource utilization results shown in Table 4 and power consumption summarized in Table 5. Compared with the MicroBlaze configuration evaluated under identical operating conditions, the proposed processor requires approximately twice the LUT resources while maintaining comparable DSP utilization. Both the proposed processor and the MicroBlaze baseline rely on on-chip block RAM (BRAM) resources for instruction and data storage, resulting in comparable memory organization between the two implementations. Despite this increase in logic utilization, the overall resource footprint remains well within the capacity of the XC7A35T device, leaving sufficient headroom for future architectural extensions such as additional vector lanes, specialized acceleration units, or expanded memory interfaces.
By comparing both implementations, we can see that the proposed processor exhibits a moderate increase in total on-chip power consumption, due primarily to the higher dynamic switching activity associated with the SIMD datapaths and floating-point execution pipeline. While the static power component remains unchanged between both implementations, the additional dynamic power reflects the increased arithmetic concurrency and operand routing required to support convolution-aware execution. Despite this increase, the higher computational throughput achieved by the proposed architecture results in improved energy efficiency per convolution operation.

4.2. Software Toolchain Integration

The process of programming the processor is shown in Figure 5. Application programs targeting the proposed processor are compiled using a bare-metal RISC-V toolchain [24] and linked against a lightweight support library that exposes memory-mapped control registers for the scalar, vector, and floating-point execution domains, as well as a timer and the UART Rx/Tx with handshake protocols. Convolution kernels are implemented as assembly-level macro routines callable from C, enabling deterministic scheduling of SIMD and floating-point operations without modifying the base instruction set architecture. The compilation flow produces executables that are converted into block RAM initialization files (.coe), allowing programs to be embedded directly into on-chip instruction memory at synthesis time. Additional tcl scripts are provided for building the Vivado 2025.2 project and reprogramming the firmware. This approach eliminates the need for external program loaders and supports repeatable deployment of convolution workloads on the FPGA platform.

5. Experimental Results

The performance of the proposed convolution-aware processor was evaluated using one-dimensional floating-point convolution workloads and compared against a MicroBlaze soft processor configured with hardware single-precision floating-point support. Both processors were implemented on the same Artix-7 XC7A35T FPGA device using a Digilent Basys3 development board and operated at an identical clock frequency of 50 MHz to ensure a fair architectural comparison.
The benchmark consisted of convolving an input signal of length 1024 with kernels of varying size composed of unit-valued coefficients. This benchmark is commonly used in wearable systems to detect signal patterns. All experiments were executed using IEEE-754 single-precision arithmetic, and the execution time was measured in processor clock cycles.
To evaluate scalability with respect to convolution depth, experiments were performed using kernel sizes of 3, 5, 7, 9, and 11. The results are summarized in Figure 6 and Figure 7. The proposed processor consistently outperformed the MicroBlaze implementation across all evaluated configurations, by at least 6 × .
The performance advantage increases slightly with kernel size, as shown in Figure 8, indicating improved arithmetic utilization as the accumulation depth grows. This behavior reflects the effectiveness of the split-stage floating-point pipeline, which overlaps multiplication and accumulation across execution stages while simultaneously streaming operands from memory. In contrast, the MicroBlaze processor executes convolution using a scalar floating-point pipeline without convolution-aware scheduling support, resulting in reduced pipeline utilization during extended accumulation sequences.
The execution speedup is computed as
Speedup = T baseline T proposed ,
where T baseline corresponds to the execution time measured on the MicroBlaze processor and T proposed corresponds to the execution time of the proposed convolution-aware processor under identical clock frequency and precision conditions.
To further normalize performance across kernel sizes, the average number of cycles per output sample was computed and placed in Figure 9. For an input signal of length N and a kernel of length K, the number of valid output samples is N K + 1 . Therefore, the average cycles per output sample is defined as
Cycles per Output Sample = C total N K + 1 ,
where C total is the total measured cycle count for the convolution. Using N = 1024 , the results show that the proposed processor consistently requires substantially fewer cycles per output sample than the MicroBlaze implementation, with the performance gap increasing as the kernel length grows. This trend reflects improved arithmetic utilization of the split-stage floating-point accumulation pipeline and the SIMD datapath, which reduce pipeline stalls during longer convolution sequences and enable more efficient overlap between multiplication, accumulation, and operand fetch operations.
The energy per output sample was computed and shown in Figure 10 as follows:
E sample = P total · T exec N K + 1 ,
where P total is the measured implementation power, T exec is the convolution runtime, N is the input length, and K is the kernel size. These results demonstrate that the proposed processor achieves substantial improvements in floating-point convolution efficiency while remaining compatible with resource-constrained FPGA platforms.

6. Discussion

6.1. Architectural Performance Analysis

The experimental results demonstrate that the proposed processor achieves a consistent performance improvement over the MicroBlaze baseline across all evaluated convolution kernel sizes while operating at the same clock frequency and numerical precision. This improvement is primarily enabled by the split-stage floating-point accumulation pipeline, which allows multiplication and accumulation operations to overlap across pipeline stages, reducing execution stalls during convolution sequences.
While the split-stage floating-point accumulation pipeline improves throughput by overlapping multiplication and accumulation stages, it does not provide IEEE-754 fused multiply–add semantics. Intermediate multiplication results are rounded prior to accumulation, introducing additional rounding error compared to fused MAC implementations that perform a single final rounding step. For the evaluated convolution workloads, this reduction in numerical precision was considered an acceptable trade-off for improved pipeline simplicity and implementation efficiency. Future implementations could mitigate the accumulated rounding error through wider accumulation registers or extended-precision accumulation paths.
In addition to floating-point acceleration, the inclusion of a two-lane SIMD integer datapath further improves arithmetic throughput for convolution workloads by enabling parallel multiply–accumulate operations using FPGA DSP resources. Unlike conventional scalar soft processors, which execute convolution as a sequence of dependent arithmetic instructions, the proposed architecture increases arithmetic unit utilization through explicit support for convolution-oriented execution patterns.
Both the MicroBlaze baseline and the proposed processor were implemented on the same FPGA device and, therefore, had access to the same DSP48E1 resources. However, the MicroBlaze baseline was not configured with an equivalent convolution-specific multiply–accumulate datapath or split-stage floating-point accumulation pipeline. In the proposed processor, the performance improvement comes from the architectural organization of the datapath and pipeline, rather than from simply having access to DSP48 blocks. Therefore, the comparison reflects the benefit of the proposed architectural features under the evaluated configuration, not the maximum possible performance of a fully customized or hardware-extended MicroBlaze design.
The proposed architecture does not rely on conventional software loop unrolling. Instead, it improves convolution throughput through a custom datapath and split-stage multiply–accumulate pipeline that overlaps multiplication, accumulation, and operand preparation across pipeline stages.

6.2. Scalability and Resource Trade-Offs

The proposed architecture improves convolution throughput, but scalability introduces additional resource and memory-system constraints.
The observed increase in speedup with larger kernel sizes indicates that the processor benefits from improved pipeline efficiency as the convolution depth increases. This behavior confirms that distributing multiply–accumulate operations across multiple pipeline stages reduces the relative overhead associated with loop control and memory access operations, resulting in improved scalability for longer convolution filters.
The architecture is extensible, but additional parallelism is not obtained through simple duplication of integer, floating-point, or load/store pipelines. Increasing the SIMD width or adding additional MAC lanes could improve performance by enabling more convolution operations to be executed in parallel, particularly for larger kernels where more multiply–accumulate operations are available per output sample. However, such extensions require corresponding increases in register-file bandwidth, forwarding paths, hazard detection logic, load/store throughput, and memory-system bandwidth. Since FPGA BRAMs provide a limited number of read and write ports, additional execution lanes may become underutilized if the memory subsystem cannot supply operands at the required rate.
Future scalability must be co-designed with mechanisms such as BRAM banking, operand buffering, and sliding-window reuse. While these extensions could increase throughput and reduce the execution time, they would also increase DSP utilization, routing complexity, switching activity, and dynamic energy consumption. As a result, the current two-lane SIMD and split-stage floating-point implementation represents a balanced design point between performance improvement, resource usage, and energy efficiency for resource-constrained embedded workloads.
Although the proposed architecture increases LUT utilization relative to lightweight scalar soft processors, the additional hardware cost may be justified for wearable applications requiring continuous real-time signal processing or low-latency inference. However, applications dominated by long idle intervals may benefit less from the increased parallel hardware resources due to higher static resource utilization.
The implemented design operates at 50 MHz to ensure reliable timing closure on the Artix-7 FPGA platform. Floating-point multiplication, normalization, and accumulation introduce relatively long combinational paths that become increasingly difficult to close at higher frequencies without additional pipelining or architectural optimization. Although higher operating frequencies are likely achievable on more advanced FPGA platforms or through deeper pipelining, such modifications would introduce additional design complexity and verification overhead.
Compared with accelerator-class architectures such as the Xilinx DPU, the proposed processor achieves lower peak throughput but requires substantially fewer hardware resources and maintains compatibility with a lightweight embedded processor model. This makes the architecture suitable for mid-range FPGA platforms such as the Artix-7 XC7A35T, where accelerator overlays may exceed available DSP and memory resources.
These characteristics position the proposed processor as an intermediate solution between general-purpose soft processors and dedicated convolution accelerators, providing improved convolution performance while preserving flexibility for embedded signal-processing applications. While the proposed SIMD floating-point architecture is suitable for many convolution and DSP-oriented edge AI workloads, future edge inference applications may increasingly favor lower-precision arithmetic formats and tensor-oriented execution units to improve energy efficiency and throughput.
Additionally, because the current implementation relies entirely on on-chip BRAM resources, very large activation or weight tensors may eventually require external memory support, introducing additional memory hierarchy complexity and bandwidth overhead.

6.3. Limitations and Future Work

Despite the demonstrated performance improvements, the current implementation retains several architectural and evaluation limitations.
The current implementation targets fixed-width two-lane SIMD execution and relies on software-managed address generation for multidimensional convolution workloads. The current design does not support hardware address generation for multidimensional convolutions, dynamic precision switching, or out-of-order execution. Although higher-dimensional convolutions can be executed through standard indexing transformations, future work may explore hardware-supported window generation and operand reuse mechanisms to further improve efficiency for image-processing applications. Additionally, future evaluation will include comparisons against representative wearable-class embedded processors executing ECG, EMG, and IMU filtering pipelines in order to quantify energy efficiency under realistic sensing workloads. Such analysis would further clarify the suitability of the proposed architecture for continuous low-latency edge signal processing in resource-constrained wearable platforms.
Wider SIMD lanes would necessitate the implementation of a complex memory hierarchy. Two lanes were chosen, because BRAM on the Artix-7 is dual-channel by default.
The limitation of high LUT usage is not always justified for wearable systems that require continuous real-time signal processing or neural network inference. In certain cases, it can raise the static power usage; this is not the case in this design when compared to MicroBlaze.
The current power and performance evaluation is primarily based on synthetic convolution benchmarks, which are intended to isolate datapath behavior and arithmetic throughput rather than represent full application workloads.
Ultra-low-power techniques such as clock gating and duty cycling are promising directions for future work to further reduce dynamic power consumption during periods of low computational activity.
The present work does not evaluate sensitivity to temperature, voltage, or process variations under wearable operating conditions. While the relatively modest operating frequency may provide additional timing margin, a full characterization across environmental and process corners remains an open area for future work.

7. Conclusions

This work presents a convolution-aware processor architecture designed to improve the execution efficiency of one-dimensional convolution workloads on resource-constrained FPGA platforms. The proposed design extends a conventional five-stage pipeline with a split-stage floating-point accumulation datapath and a two-lane SIMD 32-bit integer execution unit combined with DSP units, which enable multiplication and addition in the same cycle, enabling overlap between multiplication, accumulation, and operand fetch operations during convolution execution.
Experimental evaluation on a Digilent Basys3 FPGA demonstrated consistent performance improvements over a MicroBlaze soft processor operating under identical clock frequency and precision conditions. The proposed processor achieved speedups exceeding six-fold across multiple kernel sizes while maintaining a lightweight hardware footprint suitable for mid-range Artix-7 devices.
Unlike accelerator-class solutions such as the Xilinx DPU, which prioritize maximum throughput at the cost of increased resource utilization, the proposed architecture provides a balanced trade-off between performance and implementation complexity while preserving compatibility with embedded processor workflows.
Future work will investigate extensions to wider SIMD datapaths, as well as integration with higher-level compiler toolchains to enable automated mapping of convolution workloads to the proposed execution model.

Author Contributions

Conceptualization, F.L.P.D. and J.F.P.M.; methodology, F.L.P.D., B.A.R. and J.F.P.M.; software, F.L.P.D.; validation, F.L.P.D. and J.F.P.M.; formal analysis, F.L.P.D.; investigation, F.L.P.D.; resources, J.F.P.M.; data curation, F.L.P.D.; writing—original draft preparation, F.L.P.D.; writing—review and editing, F.L.P.D., B.A.R. and J.F.P.M.; visualization, F.L.P.D.; supervision, J.F.P.M.; project administration, J.F.P.M.; funding acquisition, J.F.P.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received financial support from the NSF CAREER Award under Grant No. OAC-2439345.

Data Availability Statement

The repository with all of the scripts for the creation of this project and the recreation of tests can be found in the following GitHub repository: https://github.com/Embedded-Autonomy-UPRM/RISCV-Adaptation (accessed on 15 May 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations and Acronyms

The following abbreviations and acronyms are used in this manuscript:
FPGAField-Programmable Gate Array
LUTLook-Up Table
RISCReduced Instruction Set Computer
DPUDeep Processing Unit
SIMDSingle Instruction, Multiple Data
DSPDigital Signal Processor
BRAMBlock Random Access Memory
MACMultiply and Accumulate
CNNConvolutional Neural Network
XC7A35TXilinx Artix-7 FPGA Device Identifier
ECGElectrocardiogram
EMGElectromyography
IEEE-754IEEE Standard for Floating-Point Arithmetic
IMUInertial Measurement Unit
QRSQRS Complex (electrocardiogram waveform component)

References

  1. IEEE Std 754-2019; IEEE Standard for Floating-Point Arithmetic. IEEE: New York, NY, USA, 2019. Available online: https://standards.ieee.org/ieee/754/6210/ (accessed on 22 May 2026).
  2. Pan, J.; Tompkins, W.J. A Real-Time QRS Detection Algorithm. IEEE Trans. Biomed. Eng. 1985, BME-32, 230–236. [Google Scholar] [CrossRef] [PubMed]
  3. Kamga, P.; Mostafa, R.; Zafar, S. The Use of Wearable ECG Devices in the Clinical Setting: A Review. Curr. Emerg. Hosp. Med. Rep. 2022, 10, 67–72. [Google Scholar] [CrossRef] [PubMed]
  4. Phinyomark, A.; Phukpattaranont, P.; Limsakul, C. Feature Reduction and Selection for EMG Signal Classification. Expert Syst. Appl. 2012, 39, 7420–7431. [Google Scholar] [CrossRef]
  5. Bulling, A.; Blanke, U.; Schiele, B. A tutorial on human activity recognition using body-worn inertial sensors. ACM Comput. Surv. 2014, 46, 33. [Google Scholar] [CrossRef]
  6. Scharfetter, H.; Monif, M.; László, Z.; Lambauer, T.; Hutten, H.; Hinghofer-Szalkay, H. Effect of postural changes on the reliability of volume estimations from bioimpedance spectroscopy data. Kidney Int. 1997, 51, 1078–1087. [Google Scholar] [CrossRef] [PubMed]
  7. Rosso, D.A.; Zerbini, C.A.; Riva, G.G. Exploring Hardware/Software Trade-Offs for CORDIC Acceleration in a Customized Processor. In International Symposium on Computational Aesthetics in Graphics, Visualization, and Imaging; IEEE: New York, NY, USA, 2026. [Google Scholar] [CrossRef]
  8. Li, J.; Shao, C.; Li, H.; Tang, Z. YOLOv5n Edge Accelerator Design and Energy Efficiency Optimization Through RISC-V and FPGA Collaboration. In 2025 5th International Conference on Artificial Intelligence, Big Data and Algorithms (CAIBDA); IEEE: New York, NY, USA, 2025. [Google Scholar] [CrossRef]
  9. Galetakis, M.; Kalapothas, S.; Flamis, G.; Kitsos, P.; Plessas, F. Design and implementation of a compact RISC-V based Machine Learning accelerator on Low End FPGA. In Proceedings & Highlights—Emerging Tech Conference Edge Intelligence 2023; Hellenic Emerging Technologies Industry Association: Athens, Greece, 2026. [Google Scholar] [CrossRef]
  10. Zheng, T.; Cai, G.; Huang, Z. A Soft RISC-V Processor IP with High-performance and Low-resource consumption for FPGA. In International Symposium on Circuits and Systems; IEEE: New York, NY, USA, 2022. [Google Scholar] [CrossRef]
  11. SpinalHDL. VexRiscv: A FPGA Friendly 32-Bit RISC-V CPU Implementation. Available online: https://github.com/SpinalHDL/VexRiscv (accessed on 22 May 2026).
  12. Nolting, S. NEORV32 RISC-V Processor. Available online: https://github.com/stnolting/neorv32 (accessed on 22 May 2026).
  13. lowRISC. Ibex RISC-V Core. Available online: https://github.com/lowRISC/ibex (accessed on 22 May 2026).
  14. PULP Platform. PULPino: An Open-Source RISC-V Microcontroller System. Available online: https://github.com/pulp-platform/pulpino (accessed on 22 May 2026).
  15. Gong, W.; Zhou, F.; Ge, F. A Multi-mode Convolution Coprocessor Based on RISC-V Instruction Set Architecture. In International Conference on ASIC; IEEE: New York, NY, USA, 2023. [Google Scholar] [CrossRef]
  16. Cantero, D.; Ugena, A.; Sanz, L.; Arteaga, A.; Astarloa, A. Performance Analysis of Convolution Function for IA Edge Computing Acceleration Using a 32-bit RISC-V CPU Implementation. In Conference on Design of Circuits and Integrated Systems; IEEE: New York, NY, USA, 2025. [Google Scholar] [CrossRef]
  17. Wang, S.; Zhu, J.; Wang, Q.; He, C.; Ye, T.T. Customized Instruction on RISC-V for Winograd-Based Convolution Acceleration. In IEEE International Conference on Application-Specific Systems, Architectures, and Processors; IEEE: New York, NY, USA, 2021. [Google Scholar] [CrossRef]
  18. Kudachi, U.; Siddamal, S.V.; Alawandi, S.; Beedanal, A. RISC-V Based SoC Design Featuring Fixed and Floating Point Units with Integrated FFT Accelerator for DSP Workloads. In 2025 IEEE 32nd International Conference on High Performance Computing, Data and Analytics Workshop (HiPCW); IEEE: New York, NY, USA, 2025. [Google Scholar] [CrossRef]
  19. Gao, C.; Saha, S.; Lu, Y.; Saha, R.; McDonald-Maier, K.D.; Zhai, X. Deep Learning on FPGAs with Multiple Service Levels for Edge Computing. In International Conference on Automation and Computing; IEEE: New York, NY, USA, 2022. [Google Scholar] [CrossRef]
  20. Etokebe, A.; Zhai, L.; Cui, S. A Comprehensive Review of Design and Implementation of Hardware Accelerators. IEEE Access 2025, 14, 49563–49581. [Google Scholar] [CrossRef]
  21. Parameshwara, A. SynapticCore-X: A Modular Neural Processing Architecture for Low-Cost FPGA Acceleration. arXiv 2025, arXiv:2511.12616. [Google Scholar]
  22. Gheni, A.; Ali, A.H.A. Convolutional Neural Networks using FPGA-based Pipelining. Iraqi J. Comput. Sci. Math. 2023, 4, 19. [Google Scholar] [CrossRef]
  23. AMD Xilinx. 7 Series DSP48E1 Slice User Guide (UG479), Version 1.13; AMD: Santa Clara, CA, USA, 2023.
  24. RISC-V International. RISC-V GNU Toolchain Contributors. RISC-V GNU Toolchain. 2025. Available online: https://github.com/riscv-collab/riscv-gnu-toolchain.git (accessed on 22 April 2026).
Figure 1. Top-level pipeline organization of the proposed convolution-aware processor. The architecture extends a conventional five-stage pipeline with dedicated floating-point accumulation and SIMD integer convolution execution units. The colored regions distinguish the main pipeline stages and functional groups: instruction fetch, decode/register access, execution units, memory access, write-back, and forwarding/control logic.
Figure 1. Top-level pipeline organization of the proposed convolution-aware processor. The architecture extends a conventional five-stage pipeline with dedicated floating-point accumulation and SIMD integer convolution execution units. The colored regions distinguish the main pipeline stages and functional groups: instruction fetch, decode/register access, execution units, memory access, write-back, and forwarding/control logic.
Electronics 15 02399 g001
Figure 2. Operands A, B, and C are read from the floating-point register file during the ID stage. The product A × B is computed in the EX stage using the floating-point multiplier, and the result is accumulated with C in the MEM stage to produce A × B + C, enabling efficient pipelined support for convolution-style operations. The color-coded blocks distinguish pipeline registers, register-file access, arithmetic execution units, accumulation logic, and write-back/output paths.
Figure 2. Operands A, B, and C are read from the floating-point register file during the ID stage. The product A × B is computed in the EX stage using the floating-point multiplier, and the result is accumulated with C in the MEM stage to produce A × B + C, enabling efficient pipelined support for convolution-style operations. The color-coded blocks distinguish pipeline registers, register-file access, arithmetic execution units, accumulation logic, and write-back/output paths.
Electronics 15 02399 g002
Figure 3. Two-lane vector execution pipeline showing the vector register file in the ID stage and parallel lane-level ALU and DSP units in the EX stage. The diagram highlights operand distribution across both lanes and the subsequent flow through memory access and write-back stages within the pipelined datapath.
Figure 3. Two-lane vector execution pipeline showing the vector register file in the ID stage and parallel lane-level ALU and DSP units in the EX stage. The diagram highlights operand distribution across both lanes and the subsequent flow through memory access and write-back stages within the pipelined datapath.
Electronics 15 02399 g003
Figure 4. Annotated DSP48E1 datapath used for convolution-oriented multiply–accumulate execution by [23]. The highlighted path shows the multiplier and accumulator routes used by the integer/SIMD MAC datapath. Floating-point operations use vendor floating-point IP cores that internally map portions of the arithmetic onto DSP48E1 resources, together with additional LUT-based logic for IEEE-754 processing.
Figure 4. Annotated DSP48E1 datapath used for convolution-oriented multiply–accumulate execution by [23]. The highlighted path shows the multiplier and accumulator routes used by the integer/SIMD MAC datapath. Floating-point operations use vendor floating-point IP cores that internally map portions of the arithmetic onto DSP48E1 resources, together with additional LUT-based logic for IEEE-754 processing.
Electronics 15 02399 g004
Figure 5. Software integration diagram, from source code to memory files for the FPGA.
Figure 5. Software integration diagram, from source code to memory files for the FPGA.
Electronics 15 02399 g005
Figure 6. Integer convolution execution time comparison across kernel sizes at 50 MHz.
Figure 6. Integer convolution execution time comparison across kernel sizes at 50 MHz.
Electronics 15 02399 g006
Figure 7. Floating-point convolution execution time comparison across kernel sizes at 50 MHz.
Figure 7. Floating-point convolution execution time comparison across kernel sizes at 50 MHz.
Electronics 15 02399 g007
Figure 8. Execution speedup of the proposed processor relative to MicroBlaze across convolution kernel sizes at 50 MHz.
Figure 8. Execution speedup of the proposed processor relative to MicroBlaze across convolution kernel sizes at 50 MHz.
Electronics 15 02399 g008
Figure 9. Average cycles per output sample for the proposed processor and the MicroBlaze implementation across different convolution kernel sizes.
Figure 9. Average cycles per output sample for the proposed processor and the MicroBlaze implementation across different convolution kernel sizes.
Electronics 15 02399 g009
Figure 10. Energy per output sample comparison between the proposed processor and MicroBlaze at 50 MHz.
Figure 10. Energy per output sample comparison between the proposed processor and MicroBlaze at 50 MHz.
Electronics 15 02399 g010
Table 1. Comparison of FPGA platform, SIMD support, operating frequency, and resource utilization among the proposed processor, MicroBlaze, and a related FPGA soft processor implementation.
Table 1. Comparison of FPGA platform, SIMD support, operating frequency, and resource utilization among the proposed processor, MicroBlaze, and a related FPGA soft processor implementation.
DesignFPGASIMDFrequencyLUTsDSPs
ProposedBasys3Yes50 MHz10,9117
MicroBlazeBasys3No50 MHz54666
[10]ZedboardNo193 MHz24304
Table 2. Post-implementation resource utilization of the proposed processor on XC7A35T.
Table 2. Post-implementation resource utilization of the proposed processor on XC7A35T.
ResourceUsedAvailableUtilization (%)
LUT10,91120,80052.46
LUTRAM1296000.13
Flip-Flops572341,60013.76
BRAM135026.00
DSP Slices7907.78
BUFG2326.25
MMCM1520.00
Table 3. Post-implementation power consumption of the proposed processor on XC7A35T.
Table 3. Post-implementation power consumption of the proposed processor on XC7A35T.
ComponentPower (W)Contribution (%)
Clocks0.0125
Signals0.05024
Logic0.03617
BRAM0.0063
DSP0.0011
MMCM0.10650
I/O0.002<1
Dynamic Power0.21374
Static Power0.07326
Total Power0.286100
Table 4. Post-implementation resource utilization of the MicroBlaze processor on XC7A35T.
Table 4. Post-implementation resource utilization of the MicroBlaze processor on XC7A35T.
ResourceUsedAvailableUtilization (%)
LUT546620,80052.46
LUTRAM16796000.13
Flip-Flops316241,60013.76
BRAM165026.00
DSP Slices6907.78
BUFG2326.25
MMCM1520.00
Table 5. Post-implementation power consumption of the MicroBlaze processor on XC7A35T.
Table 5. Post-implementation power consumption of the MicroBlaze processor on XC7A35T.
ComponentPower (W)Contribution (%)
Clocks0.0096
Signals0.01510
Logic0.0149
BRAM0.0032
DSP0.0021
MMCM0.10671
I/O<0.001<1
Dynamic Power0.14867
Static Power0.07333
Total Power0.221100
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Pizarro Diaz, F.L.; Robinson, B.A.; Patarroyo Montenegro, J.F. A Lightweight Convolution-Aware RISC-V Soft Processor for Intelligent Wearable Systems. Electronics 2026, 15, 2399. https://doi.org/10.3390/electronics15112399

AMA Style

Pizarro Diaz FL, Robinson BA, Patarroyo Montenegro JF. A Lightweight Convolution-Aware RISC-V Soft Processor for Intelligent Wearable Systems. Electronics. 2026; 15(11):2399. https://doi.org/10.3390/electronics15112399

Chicago/Turabian Style

Pizarro Diaz, Fernando L., Booker A. Robinson, and Juan F. Patarroyo Montenegro. 2026. "A Lightweight Convolution-Aware RISC-V Soft Processor for Intelligent Wearable Systems" Electronics 15, no. 11: 2399. https://doi.org/10.3390/electronics15112399

APA Style

Pizarro Diaz, F. L., Robinson, B. A., & Patarroyo Montenegro, J. F. (2026). A Lightweight Convolution-Aware RISC-V Soft Processor for Intelligent Wearable Systems. Electronics, 15(11), 2399. https://doi.org/10.3390/electronics15112399

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop