Adaptive Energy—Accuracy Trade-Offs in Configurable MAC Architectures for AI Acceleration

Alnuayri, Turki; Haddadi, Ibrahim

doi:10.3390/electronics15051129

Open AccessArticle

Adaptive Energy—Accuracy Trade-Offs in Configurable MAC Architectures for AI Acceleration

by

Turki Alnuayri

^1,2,*

and

Ibrahim Haddadi

^1,2

¹

Department of Computer Engineering, College of Computer Science and Engineering, Taibah University, Madinah 42353, Saudi Arabia

²

Energy, Industry, and Advanced Technologies Research Center, Taibah University, Madinah 41477, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(5), 1129; https://doi.org/10.3390/electronics15051129

Submission received: 1 February 2026 / Revised: 22 February 2026 / Accepted: 4 March 2026 / Published: 9 March 2026

Download

Browse Figures

Versions Notes

Abstract

Energy efficiency has become a primary bottleneck in hardware platforms supporting machine learning workloads, particularly as modern inference and training tasks demand sustained high-throughput computation. This challenge is further amplified in energy-harvesting and intermittently powered systems, where the available energy budget varies over time. This work introduces a run-time configurable multiply–accumulate (MAC) architecture that dynamically adjusts arithmetic precision to match instantaneous energy availability. The proposed design relies on an internally adaptive multiplier based on bit-level logic compression, enabling controlled modulation of power consumption while preserving numerical robustness. Crucially, the MAC maintains a fixed external operand interface, allowing for seamless precision adaptation without operand reformulation or datapath disruption. The architecture is implemented in System Verilog and evaluated using both ASIC synthesis in a 90 nm CMOS technology and FPGA deployment. Experimental results demonstrate approximately a fourfold improvement in power–delay product (PDP) relative to full-precision operation, with only limited degradation in inference accuracy.

Keywords:

approximate computing; configurable Multiply–Accumulate (MAC) design; artificial neural networks (ANNs); convolutional neural networks (CNNs); energy-aware computing; Internet of Things (IoT)

1. Introduction

The practice of approximate computing has evolved as an efficient mechanism for mitigating the escalating energy demands of modern digital systems by intentionally relaxing numerical accuracy in favor of reductions in power consumption, silicon area, and execution latency [1]. Early studies in imprecise arithmetic demonstrated that many applications inspired by biological processing or statistical inference can tolerate bounded computational errors without significant loss in output quality [2]. These observations catalyzed extensive research into approximation techniques spanning circuit, architecture, and system levels.

Subsequent work explored a broad spectrum of approximation strategies, including logic simplification [3], precision scaling [4], and combined hardware–software approximation techniques for neural networks operating under constrained energy budgets [5,6]. Such approaches have proven particularly effective for Internet-of-Things (IoT) platforms and embedded convolutional neural networks (CNNs). However, the majority of these designs rely on static approximation, where the precision level is fixed at design time, limiting adaptability during operation.

This limitation is especially critical in arithmetic-intensive components such as multipliers. Numerous approximate multiplier architectures have been proposed, including configurable multipliers [7], dynamic-range designs [8], analog approximation [9], logarithmic arithmetic [10], and approximate floating-point units [11]. While these designs offer substantial energy savings, they typically lack mechanisms for continuous run-time adaptation in response to changing operating conditions or accuracy requirements.

Multipliers play a dominant role in modern computing systems for two fundamental reasons [12]. First, their complex logic and long carry propagation paths often define the critical path, making them major contributors to power and area. Second, machine learning workloads rely heavily on multiply–accumulate (MAC) operations, which constitute the bulk of computation during both inference and training. Consequently, improving multiplier efficiency directly impacts system-level energy efficiency.

Approximate multipliers can be broadly categorized into several classes. Voltage overscaling techniques [13] reduce energy by operating below nominal voltage levels but introduce timing-related errors that may affect high-significance bits. Truncation-based designs [14,15] eliminate low-significance partial products, trading accuracy for reduced complexity. Modular approximation approaches replace exact sub-blocks with simpler logic but often incur additional control overhead. More recent significance-aware designs preserve high-order bits while approximating lower-order logic [16].

The need for adaptability becomes particularly pronounced in edge and pervasive computing systems powered by energy harvesting. In such platforms, available energy fluctuates with environmental conditions, making static hardware configurations inefficient or even infeasible [3,17]. To ensure sustained operation, hardware must dynamically balance computational accuracy and energy consumption.

Despite extensive research on approximate arithmetic, relatively limited attention has been paid to run-time configurable MAC architectures that adapt approximation levels based on instantaneous energy availability while maintaining compatibility with neural network datapaths. This work addresses this gap by proposing a configurable MAC architecture specifically designed for machine learning acceleration in energy-variable environments.

2. Related Work

Run-time adaptability in arithmetic units has become increasingly important as computing systems transition toward energy-constrained, edge, and intermittently powered environments. Conventional approximate arithmetic techniques typically rely on fixed approximation levels selected at design time, limiting their ability to respond to dynamic operating conditions [1,3]. Other approaches support multiple precision levels by instantiating parallel datapaths or modular arithmetic blocks [14,15], which introduces non-negligible area and control overheads and reduces scalability. To overcome these limitations, a run-time configurable multiplier architecture is considered that supports multiple operating modes within a single hardware instance. Rather than duplicating datapaths for exact and approximate computation, configurability is achieved through controlled modulation of internal logic activity. This unified approach enables multiple approximation levels to coexist within a single multiplier core, resulting in only a marginal increase in silicon area when compared to a conventional full-precision multiplier. Similar efficiency benefits from logic reuse have been observed in significance-aware and logic-compression-based approximate arithmetic structures [18,19].

A defining feature of the architecture is that approximation is introduced exclusively through bit-level logic compression within the multiplier. This strategy exploits the unequal contribution of operand bits to the final numerical result by selectively simplifying or disabling logic associated with low-significance bits, while preserving computation along high-significance paths [16]. Compared to truncation-based designs [15], bit-level compression provides finer granularity in accuracy control. Moreover, unlike voltage-overscaled approaches [13], the resulting approximation behavior is deterministic and does not depend on timing violations.

Importantly, operand widths, numerical formats, and external interfaces remain unchanged across all operating modes. This transparency eliminates the need for operand rescaling, quantization-aware retraining, or modification of higher-level algorithms, which are commonly required in precision-scaling and approximate neural-network accelerators [4,5]. Consequently, the configurable multiplier can be seamlessly integrated into standard multiply–accumulate (MAC) datapaths used in machine learning accelerators [11,12].

The ability to dynamically adjust arithmetic fidelity at run time makes the architecture particularly suitable for energy-harvesting and pervasive computing systems, where available energy fluctuates over time [17]. In such environments, static precision configurations can lead to either wasted energy or unacceptable degradation in output quality. Run-time configurable arithmetic therefore provides a foundation for continuously balancing energy consumption and computational accuracy in response to instantaneous operating conditions.

2.1. Problem Formulation

The objective of the proposed energy-aware configurable MAC architecture is to dynamically balance computational accuracy and energy consumption under time-varying and potentially intermittent energy availability. This problem is particularly relevant for energy-harvesting, IoT, and edge-AI platforms, where the instantaneous energy budget fluctuates due to environmental conditions and workload dynamics.

Let

E (t)

denote the instantaneous available energy at time t, as measured through on-chip voltage, current, or energy-buffer sensing mechanisms. Let

M = {HP, MP, LP}

represent the set of supported MAC operating modes, ordered in descending accuracy. Each operating mode

m \in M

is characterised by a fixed energy cost

E_{m}

per MAC operation and an associated computational accuracy

A_{m}

, where

A_{HP} \geq A_{MP} \geq A_{LP}

and

E_{HP} \geq E_{MP} \geq E_{LP}

.

At each decision point, the run-time configuration problem can be formulated as the following constrained optimization:

m^{*} (t) = \arg \max_{m \in M} A_{m} subject to E_{m} \leq E (t) .

(1)

If no operating mode satisfies the energy constraint, i.e.,

E (t) < E_{LP}

, computation is deferred until sufficient energy becomes available. This formulation ensures energy-safe execution while prioritizing the highest achievable computational accuracy under the prevailing energy conditions. The proposed Energy-Aware MAC Configuration Algorithm (EAMCA), formally described in Algorithm 1, implements this optimization using a lightweight, deterministic decision process with constant time complexity. Since the number of supported MAC operating modes is small and fixed, configuration selection incurs negligible computational overhead compared to the cost of MAC operations in ANN and CNN workloads. As a result, run-time adaptation can be performed continuously without impacting overall system throughput or learning performance.

Algorithm 1 Energy-Aware MAC Configuration Algorithm (EAMCA)

Inputs: Instantaneous available energy $E (t)$ ;
MAC operating modes $M = {HP, MP, LP}$ ordered by decreasing computational accuracy;
Energy cost per MAC operation ${E_{HP}, E_{MP}, E_{LP}}$ .
Output: Selected operating mode $m^{*} \in M$ .
    1.
Acquire the instantaneous energy budget $E (t)$ from the on-chip energy monitoring unit.
    2.
If $E (t) < E_{LP}$ , suspend MAC execution and transition to the energy-monitoring state.
    3.
Otherwise, evaluate modes $m \in M$ in descending order of computational accuracy:
    (a)
If $E_{m} \leq E (t)$ , assign $m^{*} \leftarrow m$ .
    (b)
Terminate the search.
    4.
Update the logic-compression control register to activate configuration $m^{*}$ .
    5.
Execute MAC operations under configuration $m^{*}$ .
    6.
Repeat the procedure at the next scheduling interval.

2.2. Motivation and Contributions

Recent advances in approximate arithmetic have demonstrated that significant energy savings can be achieved through bit-level logic compression in arithmetic units. While prior studies have shown the effectiveness of such techniques at the standalone circuit level, they do not address how configurable arithmetic can be systematically incorporated into neural architectures, nor how approximation levels can be selected dynamically in response to fluctuating energy availability. As a result, configurable arithmetic has largely remained an isolated circuit-level optimization rather than a system-level design paradigm.

The primary motivation of this work is to elevate configurable arithmetic from an isolated hardware technique to an energy-adaptive computation framework for machine learning accelerators. In energy-constrained and energy-harvesting systems, available power varies over time due to environmental and workload conditions. Static approximation strategies or fixed-precision designs are fundamentally limited in such environments, as they cannot adapt computation fidelity at run time to maintain an appropriate balance between energy consumption and inference accuracy.

To address these challenges, this paper introduces a configurable multiply–accumulate (MAC) architecture built based on an adaptive logic compression multiplier. The proposed MAC enables run-time modulation of arithmetic complexity through internally controlled approximation levels, allowing accuracy–energy trade-offs to be adjusted dynamically and reversibly. Crucially, this adaptability is achieved without modifying operand widths, external interfaces, or neural network models, preserving compatibility with existing training and deployment pipelines.

The main contributions of this work are summarized as follows:

A novel configurable MAC (CM) architecture that supports run-time selection of approximation levels through internal bit-level logic compression, enabling dynamic energy–accuracy trade-offs without datapath duplication;
Integration of the proposed configurable MAC into both Artificial Neural Network (ANN) and Convolutional Neural Network (CNN) architectures, with comprehensive evaluation across multiple datasets to quantify accuracy, energy consumption, and performance trade-offs;
End-to-end validation of the proposed energy-adaptive neuron architecture through ASIC synthesis and FPGA implementation, demonstrating practical feasibility, scalability, and effectiveness beyond standalone arithmetic units;
An Energy-Aware MAC Configuration Algorithm (EAMCA) that dynamically selects the optimal MAC operating mode based on instantaneous power or energy availability at run time.

The remainder of this paper is organized as follows. Section 3 describes the overall proposed methodology. Section 4.1 presents bit-level logic compression in configurable multipliers and the proposed configurable MAC architecture and its integration within neural network datapaths in Section 4.2. Section 4.3 demonstrates the area, delay, and power trade-offs in ASIC implementations, and FPGA implementations are discussed in Section 4.4. Section 5 introduces the Energy-Aware MAC Configuration Algorithm (EAMCA) and analyzes its operation under variable energy conditions. Section 6 provides a comparative analysis against recent configurable approximate MAC designs. Finally, Section 7 concludes the paper.

3. Methodology

The methodology adopted in this work follows a structured, end-to-end design flow. First, a configurable multiplier employing bit-level logic compression is selected as the core arithmetic primitive. Second, this multiplier is integrated with a Carry Look-Ahead Adder (CLA) to form a configurable MAC unit. Third, the MAC is embedded within the neuron datapaths of both Convolutional Neural Network (CNN) and Artificial Neural Network (ANN) architectures. Fourth, the resulting designs are evaluated through ASIC synthesis using a 90 nm CMOS technology and through FPGA implementation. Finally, a run-time Energy-Aware MAC Configuration Algorithm (EAMCA) dynamically selects the MAC operating mode based on instantaneous energy availability. This systematic approach enables comprehensive evaluation of energy–accuracy trade-offs across circuit, architecture, and application levels.

3.1. Energy Model and Definition

To ensure consistent and reproducible evaluation across circuit-level, architectural, and application-level experiments, energy is defined using a unified hardware-grounded model. At the hardware level, the energy consumed by a single MAC operation in operating mode

m \in {HP, MP, LP}

is expressed as

E_{m} = P_{m} \times T_{clk},

(2)

where

P_{m}

denotes the average dynamic power consumption of the MAC operating in mode m, and

T_{clk}

represents the global system clock period.

The dynamic power

P_{m}

is obtained from post-synthesis switching activity analysis using Synopsys Design Compiler (Version R-2022.09), Synopsys Inc., Mountain View, CA, USA under an identical voltage of 1.0 V, temperature, load, and clock constraints for all operating modes. Switching activity is estimated using a uniform random input distribution with an activity factor of 0.5 to ensure consistent, mode-independent comparisons. Since operand bitwidths and external datapath interfaces remain identical across modes, variations in power consumption arise exclusively from internal logic suppression within the multiplier.

The registered clock period

T_{clk}

is fixed for all modes and is determined by the worst-case critical path delay of HP-Mode (2.66 ns). Although LP-Mode exhibits a shorter post-synthesis combinational delay (1.54 ns), and MP-Mode shows a slightly longer delay (2.88 ns) due to internal logic restructuring and modified carry-propagation paths, the system clock is not varied across configurations. All modes operate under a common synchronous clock frequency defined by the HP-Mode timing constraint. Consequently, reported energy improvements arise solely from reduced dynamic switching activity rather than frequency scaling.

The non-monotonic delay behavior observed in MP-Mode is attributed to additional internal control logic introduced during partial-product suppression, which alters the critical path relative to HP-Mode. In contrast, LP-Mode achieves a shorter delay through more aggressive pruning of low-significance logic clusters, effectively shortening the carry-propagation depth.

At the application level, the total energy required for a neural network inference is computed as the cumulative energy of all executed MAC operations:

E_{\inf} = \sum_{i = 1}^{N_{MAC}} E_{m_{i}},

(3)

where

N_{MAC}

denotes the total number of MAC operations required during forward propagation, and

m_{i}

represents the operating mode selected for the i-th MAC instance. This formulation naturally supports hybrid configurations in which different layers operate under different approximation levels.

In energy-aware execution scenarios, the instantaneous available energy

E (t)

represents the energy budget safely expendable within a scheduling interval. The Energy-Aware MAC Configuration Algorithm (EAMCA) compares this budget against the per-MAC energy costs

{E_{HP}, E_{MP}, E_{LP}}

and selects the most accurate mode satisfying the constraint

E_{m} \leq E (t)

.

Static leakage power is excluded from the primary energy comparison because all operating modes share identical technology, supply voltage, and standby conditions. Post-synthesis analysis indicates that leakage contributes less than 15% of total power in HP-Mode under nominal 90 nm CMOS conditions. Since internal logic suppression primarily reduces switching capacitance, the dominant variation across modes arises from dynamic power reduction. A discussion of expected scaling behavior in advanced technology nodes, where leakage may represent a larger power fraction, is provided in Section 4.3.5.

3.2. Numeric Representation and Training Configuration

All hardware implementations employ signed fixed-point arithmetic using a uniform 16-bit Q1.15 format in two’s-complement representation. This format allocates one sign bit and fifteen fractional bits, with no explicit integer magnitude bit. The representable numeric range is

[- 1, 1 - 2^{- 15}],

with a resolution of

2^{- 15}

. Both weights and activations are stored in this format, ensuring high fractional precision while maintaining compact ASIC and FPGA implementation.

Multiplication is performed using two 16-bit operands, producing a 32-bit intermediate product. Accumulation is carried out in full 32-bit precision to prevent overflow during multi-term MAC operations. The final MAC output is reduced to 16-bit Q1.15 format using deterministic round-to-nearest-even rounding with saturation logic to avoid overflow. No stochastic rounding is applied, ensuring fully deterministic hardware behavior.

Training is performed offline using the PyTorch (Version 2.2.0, PyTorch Foundation, https://pytorch.org/, accessed on 2 December 2025) framework. Stochastic Gradient Descent (SGD) is employed with a learning rate of 0.01 and a batch size of 128. Weights are initialized using Xavier initialization. Cross-entropy loss is used for MNIST classification, while Mean Squared Error (MSE) is used for XOR and IRIS datasets.

For MNIST, the standard 60,000/10,000 training/testing split is used. For UCI Breast Cancer and Binary IRIS datasets, 60% of samples are used for training and 40% for testing. All datasets are normalized prior to training. MNIST pixel values are scaled to the interval

[0, 1]

.

To ensure statistical rigor, each experiment is repeated five times with independent random seeds. Reported accuracy values correspond to the mean across runs, and 95% confidence intervals are computed as

μ \pm 1.96 σ / \sqrt{5}

.

After floating-point training, weights are quantized to Q1.15 using symmetric uniform quantization:

w_{int} = clip (⌊w_{f} \cdot 2^{15}⌉, - 2^{15}, 2^{15} - 1), w_{q} = \frac{w_{int}}{2^{15}},

where

⌊ \cdot ⌉

denotes round-to-nearest-even rounding. The same quantized weight set is deployed across HP-, MP-, and LP-modes. No quantization-aware retraining is performed, ensuring that accuracy variations arise solely from arithmetic approximation.

4. Proposed Architectures and Implementations

4.1. Bit-Level Logic Compression in Configurable Multipliers

Figure 1 illustrates the configurable multiplier architecture proposed in this work, which forms the arithmetic foundation of the energy-adaptive MAC design. The multiplier employs significance-aware internal logic suppression, where low-significance partial-product logic is selectively disabled while preserving all high-significance computation paths. This approach enables controlled approximation entirely within the multiplier datapath, without modifying operand formats or external interfaces.

In the HP-Mode (High-Precision), the multiplier operates equivalently to a conventional exact multiplier, generating and accumulating all partial products to produce a full-precision result. In contrast, the MP-Mode (Moderate-Precision) and LP-Mode (Low-Precision) progressively suppress or simplify logic associated with lower-significance partial products. Since these bits contribute marginally to the final numerical value, their controlled elimination significantly reduces switching activity and hardware complexity while preserving acceptable accuracy.

A key advantage of the proposed architecture is that all precision modes are realized within a single hardware instance. The same network of single-bit adders is reused across operating modes, eliminating the need for duplicated datapaths commonly found in multi-precision designs. As a result, run-time precision scalability is achieved with negligible area overhead and without structural reconfiguration.

When operating in reduced-precision modes, the diminished active logic leads to lower dynamic power consumption, reduced effective critical path length, and decreased capacitive switching. These effects collectively translate into substantial energy savings, particularly for error-tolerant machine learning workloads. Unlike designs that alter operand bitwidths at the interface, the proposed significance-aware internal logic suppression mechanism preserves a fixed external data format, simplifying integration into existing neural network architectures and enabling seamless run-time adaptation.

Run-Time Mode Switching and Transient Behavior

A key implementation concern in run-time configurable arithmetic units is the management of transient behavior during operating-mode transitions. In the proposed architecture, mode switching is performed in a strictly clock-synchronous manner and only at safe MAC-boundary decision points. The operating mode (HP, MP, or LP) is selected through a registered control signal generated by the Energy-Aware MAC Configuration Algorithm (EAMCA). This control signal is sampled on the rising edge of the system clock and applied to internal logic-compression enable signals in the subsequent cycle. Importantly, switching is not permitted during partial accumulation or within a pipeline stage. Instead, configuration updates occur only between MAC operations or at layer-level scheduling boundaries. Because operand widths and external datapath interfaces remain unchanged across all operating modes, no structural modification of the pipeline occurs during transitions. Timing hazards, metastability conditions, and combinational feedback paths are eliminated because the Carry Look-Ahead Adder (CLA) stage and accumulation registers function identically in each mode.

Regarding transient power behavior, HP-Mode represents the worst-case dynamic switching envelope. Transitions from HP to LP reduce active logic and therefore do not introduce additional switching spikes beyond characterized bounds. Transitions from LP to HP re-enable previously suppressed low-significance logic; however, the resulting activity remains within the nominal HP-mode switching profile already captured during post-synthesis analysis. Consequently, no additional guard cycles or pipeline stalls are required, and mode switching incurs zero cycle latency overhead.

4.2. Configurable Neuron Architecture

The proposed neuron architecture is centered around a run-time configurable Multiply–Accumulate (MAC) unit, where the multiplier constitutes the dominant source of energy consumption and performance variation. The Carry Look-Ahead Adder (CLA) was implemented for the accumulation stage, as it offers a favorable balance between speed and power efficiency compared to ripple-carry or carry-select alternatives [20].

A defining property of the neuron design is that all operands are supplied at full precision, irrespective of the selected operating mode. Unlike approaches that adjust operand bitwidths or apply external truncation, the proposed neuron preserves a fixed N-bit input interface for both weights and activations. Run-time adaptability is achieved exclusively within the multiplier through bit-level logic compression. Two elements must be considered, and they are as follows.

Internal precision adaptation. The neuron supports three internal operating modes: HP-Mode, MP-Mode, and LP-Mode. These modes determine the extent to which low-significance partial products are generated and reduced inside the multiplier. Higher-significance logic is always preserved to protect the dominant numerical contribution of the multiplication, while progressively lower-significance logic is suppressed as the operating mode shifts from HP to LP. The CLA consistently receives a full $2 N$ -bit product and performs exact accumulation across all modes. As a result, approximation affects only the multiplication stage, whereas the accumulation behavior remains deterministic and mode-independent.
Interface stability and run-time reconfiguration. Because operand formatting and datapath width remain unchanged, transitions between operating modes do not require datapath reconfiguration, operand rescaling, or retraining of the neural network. This internalized adaptation enables seamless run-time switching between modes without disrupting network-level operation or control flow.

The neuron module is replicated across layers to construct both ANN and CNN architectures. As illustrated in Figure 2, each neuron multiplies incoming weight–activation pairs using the configurable MAC, accumulates partial sums through the CLA, and forwards the result to the activation function. An external control signal generated by the Energy-Aware MAC Configuration Algorithm (EAMCA) selects the operating mode dynamically. The EAMCA is discussed further in Section 5.

4.3. Trade-Offs for Power, Delay, and Area in ASIC Designs

To quantify the hardware-level impact of run-time configurability, the proposed MAC architecture was synthesized using a uniform

90 nm

CMOS technology. All configurations were mapped to the same standard-cell library and evaluated under identical voltage, timing, and load constraints to ensure a fair comparison.

Unlike prior MAC designs that rely on replicated datapaths or speculative execution [21,22], the proposed architecture scales energy and performance by modulating internal logic activity. Three variants were synthesized independently: HP-Mode, MP-Mode, and LP-Mode. Each variant represents a fully realizable hardware instance operating under a fixed internal configuration.

4.3.1. Normalized ASIC Comparison

The absolute MAC datapath characteristics reported in Table 1 directly reflect the hardware structure shown in Figure 2. The labels HP-Mode, MP-Mode, and LP-Mode correspond to the same configurable MAC architecture synthesized under different internal logic-compression configurations. Accordingly, the reported silicon area, delay, and power values represent absolute post-synthesis results obtained from independent synthesis runs using the same

90 nm

CMOS technology. More aggressive suppression of low-significance logic in LP-Mode leads to substantial reductions in area and energy consumption compared to MP-Mode and HP-Mode, despite identical external interfaces.

4.3.2. Implementation of ANNs

Using the proposed configurable MAC architecture, several ANNs are implemented to evaluate system-level energy–accuracy trade-offs across representative machine learning benchmarks. Table 2 summarizes the training configuration parameters, including network topology, number of epochs, and activation functions used for each evaluated dataset. For reproducibility and fair comparison across HP-, MP-, and LP-Modes, the network topology (including the number of hidden neurons), activation functions, training epochs, numeric precision, and dataset preprocessing pipeline remain strictly identical across all operating modes. Thus, any observed variation in inference accuracy or energy efficiency arises exclusively from internal arithmetic approximation rather than architectural or training differences.

The neuron module illustrated in Figure 2 serves as the fundamental computational unit of all evaluated networks, with the configurable MAC constituting the dominant arithmetic component. Each neuron performs weight–activation multiplication using the configurable multiplier, accumulates partial sums through a Carry Look-Ahead Adder (CLA), and applies a nonlinear activation function to produce the neuron output.

Model complexity versus hardware approximation. Although the evaluated networks are intentionally lightweight to facilitate architectural analysis, it is important to distinguish between model-level simplification and hardware-level approximation. Reducing the number of layers or neurons represents a static design-time choice that permanently fixes the accuracy–energy trade-off. In contrast, the proposed configurable MAC enables run-time adaptation of energy consumption by modifying internal arithmetic behavior without altering network topology or retraining.

Even for compact networks, MAC operations dominate energy consumption during both training and inference. Consequently, configurable approximate MAC units yield additional energy savings beyond those achievable through model simplification alone. Moreover, unlike reduced-complexity models, the proposed architecture allows for accuracy to be increased opportunistically when energy is abundant and gracefully degraded when energy is constrained. To provide a reference baseline, reduced-complexity networks employing HP-Mode MAC units are also evaluated by decreasing the number of neurons per layer. This baseline captures energy reductions achieved through algorithmic simplification rather than hardware approximation. Since these approaches are orthogonal, configurable approximation can be combined with reduced models to extend operation under strict energy budgets further. The ANN learning process consists of forward propagation followed by backward propagation. The hidden and output layers dominate both power consumption and silicon area due to their intensive use of MAC operations.

The hidden-layer activation is computed as

h = f (W_{h x} x + b_{h}),

(4)

and the output-layer activation is given by

y = Φ (W_{y h} h + b_{y}),

(5)

where

W_{h x}

and

W_{y h}

denote the input-to-hidden and hidden-to-output weight matrices, respectively;

f (\cdot)

and

Φ (\cdot)

represent the activation functions of the hidden and output layers; and

b_{h}

and

b_{y}

are the corresponding bias vectors.

The neuron modules are implemented in SystemVerilog and synthesized using Synopsys Design Compiler (Version R-2022.09, Synopsys Inc., Mountain View, CA, USA), with all designs mapped to the same Faraday

90 nm

standard-cell library. The power, area, and timing metrics extracted from synthesis are then integrated into the Energy-Aware MAC Configuration Algorithm (EAMCA), which dynamically selects the MAC operating mode at run time based on the available energy.

Training is performed offline using backpropagation with stochastic gradient descent. For the Noisy XOR and Binary IRIS datasets, 60% of samples are allocated for training and the remaining 40% for testing. Each experiment is repeated five times using independent random weight initialization to ensure statistical robustness.

Figure 3 illustrates the learning convergence behavior across operating modes. HP-Mode converges most rapidly, followed by MP-Mode and LP-Mode configurations. Although higher approximation levels require additional training epochs due to increased internal arithmetic error, all configurations achieve stable convergence and satisfactory inference accuracy.

The observed learning behavior reveals a clear relationship between the degree of internal logic compression and the convergence speed of the training process. Configurations that activate larger logic clusters converge faster, whereas more aggressive compression increases the number of iterations required to reach optimal weights. Nevertheless, approximate configurations demonstrate robust learning performance while delivering substantial energy savings.

Extending the evaluation to the MNIST dataset, a three-layer ANN with 784 inputs, a single hidden layer, and 10 output neurons is implemented using pretrained weights derived from the methodology of [23]. For hardware evaluation, the network topology, activation functions, and numeric precision remain fixed across all MAC operating modes to ensure equivalent computational complexity.

When mapped to the proposed hardware, HP-Mode achieves up to 96% inference accuracy for this lightweight ANN configuration, closely matching software baselines. In contrast, LP-Mode maintains inference accuracy above 90% while significantly reducing energy consumption.

Figure 4 illustrates the inference accuracy across the evaluated MAC operating modes, confirming that internal logic compression enables a favorable energy–accuracy trade-off without altering network architecture or retraining procedures.

Baseline: Reduced-Complexity Networks (Model Simplification).

To assess whether energy savings can be achieved solely through architectural simplification, a reduced-complexity baseline is considered in which the number of neurons per layer is decreased while retaining HP-Mode MAC units. This baseline captures energy reduction through algorithmic simplification rather than hardware approximation.

Model simplification and configurable approximation are orthogonal strategies: reducing neurons lowers the total number of MAC operations, whereas internal logic compression reduces the energy per MAC operation while enabling run-time adaptation under fluctuating energy budgets. Consequently, both techniques can be combined to extend operation under strict energy constraints while preserving a target accuracy level.

To benchmark the proposed configurable MAC against state-of-the-art approximate arithmetic designs, we compare our approach with the approximate multiplier framework reported in [24] using the MNIST classification task. In the reference work, a single-hidden-layer multilayer perceptron (MLP) with 784 inputs and 10 outputs achieves approximately 97% classification accuracy when employing 300 hidden neurons and a sigmoid activation function.

To ensure a fair and controlled comparison, the network size and depth utilized in our hardware assessment have been adjusted, as detailed in Section 4.3.2, to align with equivalent computational complexity. For fairness, equivalent computational complexity is defined as identical MAC operation count per inference, identical parameter count, identical activation functions, identical numeric precision, and identical dataset preprocessing pipeline. All comparative experiments were conducted under these strictly matched conditions to ensure that observed differences in accuracy and energy efficiency arise solely from arithmetic approximation mechanisms rather than architectural or training discrepancies.

Moreover, approximation is selectively applied only to hidden-layer MAC units, while output-layer MACs operate in HP-Mode to avoid direct degradation of the final classification decision.

Figure 5 further illustrates that the proposed LP→HP hybrid configuration exhibits less than a 4% reduction in inference accuracy relative to [24], despite employing more aggressive internal logic compression. In contrast, the MP→HP configuration demonstrates a negligible loss in output quality, limited to approximately 1%, while delivering notable improvements in energy efficiency. These results confirm that selective internal approximation enables a more favorable energy–accuracy trade-off than uniform or interface-level approximation strategies.

4.3.3. Discussion of ASIC Results

Table 3 reports normalized ASIC synthesis metrics for the proposed configurable MAC under different operating modes, aggregated over the MAC activity profiles induced by representative ANN workloads. Although results are grouped by dataset, the reported values reflect hardware-level reductions accumulated across all MAC operations executed during each workload, with all metrics normalized to the HP-Mode baseline.

It is important to clarify that the silicon area of the MAC datapath itself is constant and independent of the evaluated dataset. The dataset-wise grouping in Table 3 does not imply structural hardware variation; rather, it reflects workload-dependent activity patterns and hybrid configuration usage (e.g., MP→HP or LP→HP) across different network layers. Since each dataset induces a different distribution of MAC operations across hidden and output layers, the effective normalized reductions in power and PDP represent workload-weighted averages, not changes in the underlying synthesized hardware structure.

4.3.4. Implementation and Reproducibility Details

ASIC synthesis is performed using Synopsys Design Compiler (Version R-2022.09, Synopsys Inc., Mountain View, CA, USA), targeting a Faraday 90 nm standard-cell library at the typical (TT) process corner with a 1.0 V supply voltage. The clock period constraint is fixed to 2.66 ns, corresponding to the worst-case HP-Mode critical path delay. Switching activity is estimated using post-synthesis gate-level simulation with uniformly distributed random input vectors and an activity factor of 0.5.

FPGA synthesis is conducted using Xilinx Vivado Design Suite (Version 2023.1, AMD/Xilinx Inc., San Jose, CA, USA), targeting the Ultra96-V2 development board (Avnet Inc., Phoenix, AZ, USA). All designs are synthesized under identical timing constraints without manual placement directives to ensure fair comparison. Resource utilization is reported in terms of LUTs, flip-flops, and DSP blocks.

Due to institutional intellectual property policy, RTL source code cannot be publicly released. However, numeric formats, synthesis settings, tool versions, and key architectural parameters are fully documented to enable independent replication.

4.3.5. Technology Scaling and Leakage Considerations

The presented hardware evaluation is performed using a 90 nm CMOS technology node, where dynamic power dominates overall energy consumption under active workload conditions. Post-synthesis analysis indicates that static leakage contributes less than 15% of total power in HP-Mode under nominal voltage and temperature assumptions.

In more advanced technology nodes (e.g., 7 nm or 5 nm FinFET), leakage power can constitute a significantly larger fraction of total consumption. While the proposed architecture primarily reduces dynamic power through internal logic suppression, its structural characteristics remain beneficial under scaled technologies. Specifically, the selective disabling of low-significance logic reduces effective capacitance and switching probability, which continues to scale proportionally with technology scaling.

Furthermore, the proposed internal logic compression mechanism is compatible with fine-grained power gating and multi-threshold voltage (multi-

V_{t}

) design strategies. When combined with cell-level power gating of suppressed logic clusters, additional leakage reduction can be achieved in advanced technology nodes. Although the relative energy improvement factor may slightly reduce when leakage becomes dominant, the architecture preserves its fundamental advantage of controllable arithmetic activity modulation. Future work will extend validation to advanced FinFET-based technologies to quantify scaling behavior more precisely.

4.4. Area Reduction and Inference Accuracy in FPGA Implementations

To assess the practicality of the proposed configurable MAC on reconfigurable hardware, the design is implemented on an FPGA and assessed against a recent state-of-the-art approximate MAC architecture reported in [25]. All FPGA experiments are carried out using the Xilinx Vivado Design Suite (Version 2023.1, AMD/Xilinx Inc., San Jose, CA, USA) and mapped to the Ultra96-V2 development board (Avnet Inc., Phoenix, AZ, USA) [26], ensuring consistent tool flow, device constraints, and evaluation conditions. The FPGA evaluation focuses on inference workloads derived from the MNIST dataset using a LeNet-5 convolutional neural network (CNN) architecture [23]. The LeNet-5 architecture consists of two convolutional layers, followed by max-pooling stages and two fully connected layers. Network training is performed offline using the PyTorch framework [27], and its batch size is 128 over 20 epochs. After convergence, the trained weights and biases are exported and integrated into a SystemVerilog-based hardware realization of the CNN for FPGA inference.

In CNN workloads, convolution layers dominate both arithmetic intensity and energy consumption. Consequently, these layers are selected as the primary candidates for approximation. In the proposed deployment, MAC units operating in HP-Mode are selectively replaced by LP-Mode configurable MACs within the convolution layers, while MACs in the remaining layers retain HP-Mode operation to safeguard output classification accuracy. This hybrid configuration exploits the error tolerance of intermediate feature extraction while preserving precision in the final decision stages.

Table 4 reports the FPGA synthesis results in terms of inference accuracy and logic area reduction. The proposed LP-Mode configurable MAC achieves a substantial reduction in FPGA resource utilization while maintaining an inference accuracy of approximately 91%. By contrast, the approximate MAC design reported in [25] incurs a significantly larger loss in inference accuracy despite delivering only limited area savings. For FPGA comparison, resource utilization metrics include LUTs, flip-flops, and DSP blocks. All designs are synthesized using Xilinx Vivado Design Suite (Version 2023.1, AMD/Xilinx Inc., San Jose, CA, USA), targeting the same Ultra96-V2 development board (Avnet Inc., Phoenix, AZ, USA) under identical synthesis constraints. Approximation is applied exclusively to convolution and fully connected MAC units, while control logic and memory blocks remain unmodified. To ensure fairness, LUT utilization is reported independently of DSP usage, and total logic utilization percentages are computed relative to full device capacity. These observations highlight the advantage of internal logic compression, which reduces hardware cost without aggressively perturbing numerical behavior at the architectural interface.

Overall, the FPGA results demonstrate that the proposed configurable MAC architecture is particularly suitable for implementation on reconfigurable platforms such as FPGAs. The ability to achieve meaningful logic area reduction with modest impact on inference accuracy confirms the robustness of the design and its applicability to practical edge-AI acceleration scenarios.

5. Energy-Aware MAC Configuration Algorithm (EAMCA)

Reliable computation under non-deterministic and time-varying power conditions requires a control mechanism capable of adapting arithmetic behavior to instantaneous energy availability. To address this requirement, we introduce the Energy-Aware MAC Configuration Algorithm (EAMCA), which governs the run-time selection of MAC operating modes within the proposed neuron architecture. The EAMCA corresponds to the configuration-selection block shown in Figure 2. The formal decision procedure implemented by the EAMCA is presented in Algorithm 1.

As specified in Algorithm 1, the EAMCA assumes that instantaneous energy availability can be monitored at run time, as supported in energy-harvesting systems through voltage, current, or energy-buffer sensing mechanisms [28]. At each scheduling interval, the available energy

E (t)

is compared against the minimum executable energy threshold

E_{LP}

. If

E (t) < E_{LP}

, computation is temporarily suspended to prevent unsafe operation under insufficient supply conditions.

When

E (t) \geq E_{LP}

, Algorithm 1 selects the most accurate MAC configuration satisfying the instantaneous energy constraint. Rather than defaulting to the lowest-energy mode, the algorithm maximizes computational fidelity subject to

E_{m} \leq E (t)

. Modes are evaluated in descending accuracy order (HP → MP → LP), ensuring optimal precision within the feasible energy envelope.

Since the number of supported operating modes is fixed and small, the computational complexity of Algorithm 1 is

O (| M |)

, which reduces to constant time in practice. Therefore, the control overhead is negligible relative to the cost of MAC operations in ANN and CNN workloads.

Execution proceeds whenever a feasible configuration exists. Suspension occurs only when no mode satisfies the energy constraint, thereby guaranteeing energy-safe operation under intermittent or energy-constrained supply conditions. By embedding configuration selection within the deterministic procedure defined in Algorithm 1, the system enables fine-grained arithmetic adaptation while preserving datapath stability and timing integrity.

5.1. Control Overhead Analysis

The hardware overhead of the EAMCA controller was synthesized alongside the MAC datapath. The controller comprises a voltage comparator, a lightweight finite-state machine (FSM), and a configuration register. Post-synthesis analysis indicates that the controller introduces less than 2.1% additional silicon area and approximately 1.3% dynamic power overhead relative to the HP-Mode MAC datapath. The controller does not lie on the critical computation path, and no additional clock cycles are required for configuration updates, as mode selection occurs in parallel with scheduling logic. These results confirm that run-time configurability is achieved with negligible architectural and timing overhead.

5.2. Decision Granularity and Switching Behavior

Mode selection is performed at the granularity of an inference window (i.e., per-layer execution) rather than per individual MAC cycle. Once selected, the operating mode remains fixed for the duration of that layer’s MAC operations to prevent mid-accumulation instability and timing hazards.

To emulate realistic intermittently powered conditions, synthetic energy traces representing capacitor discharge trajectories are applied during simulation. The available energy budget

E (t)

is updated once per scheduling interval, and the EAMCA selects the highest-precision executable mode satisfying

E (t) \geq E_{m}

. This coarse-grained adaptation prevents rapid oscillation between modes and mitigates transient power fluctuations.

Mode transitions are synchronized to clock boundaries and latched through configuration registers external to the critical datapath. Switching is explicitly disallowed during active accumulation cycles, thereby eliminating metastability, pipeline stalls, and glitch propagation in high-speed operation. Consequently, run-time reconfiguration introduces zero-cycle latency and preserves deterministic timing behavior.

6. Comparative Analysis of Configurable Approximate MAC Designs

Table 5 summarizes and contrasts the proposed architecture with recent configurable approximate MAC designs, focusing on their reconfiguration mechanisms, supported platforms, and application domains [24,25,29,30,31].

A key distinction between the proposed approach and many existing designs lies in how accuracy–energy trade-offs are realized. Several prior works rely on bitwidth scaling or operand truncation, which modifies arithmetic precision at the datapath interface. While effective in reducing hardware cost, such techniques often require quantization-aware training, operand rescaling, or layer-wise revalidation, thereby increasing system-level complexity.

In contrast, the proposed architecture preserves a fixed input/output operand interface and performs adaptation entirely within the multiplier through bit-level logic compression. By confining approximation to the internal arithmetic structure, the design maintains compatibility with existing neuron datapaths and avoids the need for operand reformatting or architectural redesign. This internalized adaptation enables seamless run-time reconfiguration without disrupting network operation.

Prior efforts have demonstrated important advances in configurable arithmetic, but they are typically constrained in scope. For example, DyRecMul [29] introduces a reconfigurable LUT-based multiplier optimized for INT8 inference on FPGA platforms, whereas RUCA [30] emphasizes quality-controlled circuit selection in ASIC implementations. Other approaches, such as AMG [31], employ optimization techniques to identify approximate arithmetic structures but remain tied to specific design flows or platforms.

The proposed work differs by combining internal arithmetic adaptability with system-level energy awareness. Beyond the configurable MAC itself, the architecture integrates an Energy-Aware MAC Configuration Algorithm (EAMCA) that selects the most suitable operating mode at run time based on instantaneous energy availability. This combination of configurable arithmetic and energy-driven control is validated across both ASIC synthesis and FPGA deployment, demonstrating portability and practical feasibility.

Overall, while existing approaches typically address isolated objectives such as energy reduction, precision tuning, or platform specialization, the proposed architecture offers a unified solution that jointly addresses adaptability, deployment flexibility, and system-level energy management. This broader design scope makes the proposed approach particularly well suited for energy-constrained and heterogeneous edge-AI acceleration scenarios.

7. Conclusions

This paper presented a run-time configurable multiply–accumulate (MAC) architecture that enables dynamic energy–accuracy adaptation through significance-aware internal logic compression. Unlike conventional approximate designs that modify operand precision at the datapath interface, the proposed approach preserves a fixed external operand format and performs approximation exclusively within the multiplier structure. This architectural choice enables seamless run-time reconfiguration without operand rescaling, retraining, or structural datapath modification.

The configurable MAC was described in SystemVerilog and evaluated through ASIC synthesis using a 90 nm CMOS technology and FPGA deployment on the Ultra96-V2 development board (Avnet Inc., Phoenix, AZ, USA). Post-synthesis analysis shows that the LP-Mode configuration achieves up to a 4.06× reduction in power–delay product (PDP) relative to HP-Mode, together with approximately a 42% reduction in critical path delay (2.66 ns to 1.54 ns). These improvements are obtained under a fixed system clock frequency determined by the HP-Mode critical path, ensuring fair and synchronous comparison across all operating modes.

To enable system-level adaptability, the MAC was integrated with the Energy-Aware MAC Configuration Algorithm (EAMCA), which selects the most accurate executable configuration under instantaneous energy constraints. Synthesis results confirm that the controller introduces negligible architectural overhead and does not lie on the critical computation path.

The configurable MAC was further embedded within ANN and CNN neuron architectures and evaluated across multiple benchmark datasets. Experimental results demonstrate substantial reductions in dynamic power consumption and silicon area with only modest degradation in inference accuracy. Even under aggressive internal logic compression, inference accuracy remains above 89%, confirming stable learning behavior and robustness under variable energy conditions. Hybrid configurations that selectively approximate hidden layers further improve the energy–accuracy trade-off while preserving output fidelity.

Overall, the results establish that internal logic compression, combined with lightweight energy-aware run-time control, provides a practical and scalable foundation for power-adaptive machine learning acceleration. This framework is particularly well-suited for energy-constrained and intermittently powered edge-AI platforms. Future work will extend validation to advanced CMOS technology nodes and real-world energy-harvesting deployment scenarios.

Author Contributions

Conceptualization, T.A. and I.H.; software, I.H.; methodology, T.A. and I.H.; validation, T.A. and I.H.; investigation, T.A. and I.H.; resources, T.A. and I.H.; writing—original draft preparation, T.A. and I.H.; writing—review and editing, T.A. and I.H. All authors have read and agreed to the published version of the manuscript.

Funding

This scientific paper is derived from a research grant funded by Taibah University, Madinah, Kingdom of Saudi Arabia, with grant number (1126-15-447).

Data Availability Statement

The data presented in this study are available in the article. Further information can be obtained from the corresponding author upon request.

Acknowledgments

The authors would like to express their sincere gratitude to Taibah University, Madinah, Saudi Arabia, and the Energy, Industry, and Advanced Technologies Research Centre, Taibah University, Madinah, Saudi Arabia, for supporting this research through grant number (1126-15-447).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sekanina, L. Introduction to Approximate Computing: Embedded Tutorial. In Proceedings of the IEEE 19th International Symposium on Design and Diagnostics of Electronic Circuits & Systems (DDECS), Kosice, Slovakia; IEEE: Piscataway, NJ, USA, 2016; pp. 1–6. [Google Scholar]
Mahdiani, H.; Ahmadi, A.; Fakhraie, S.M.; Lucas, C. Bio-Inspired Imprecise Computational Blocks for Efficient Digital Signal Processing. IEEE Trans. Circuits Syst. I Regul. Pap. 2010, 57, 850–862. [Google Scholar] [CrossRef]
Shafik, R.; Yakovlev, A.; Das, S. Real-Power Computing. IEEE Trans. Comput. 2018, 67, 1445–1461. [Google Scholar] [CrossRef]
Wang, E.; Davis, J.J.; Zhao, R.; Ng, H.-C.; Niu, X.; Luk, W.; Cheung, P.Y.K.; Constantinides, G.A. Deep Neural Network Approximation for Custom Hardware: Where We’ve Been, Where We’re Going. ACM Comput. Surv. 2019, 52, 40. [Google Scholar] [CrossRef]
Masadeh, M.; Hasan, O.; Tahar, S. Input-Conscious Approximate Multiply—Accumulate (MAC) Unit for Energy Efficiency. IEEE Access 2019, 7, 147129–147142. [Google Scholar] [CrossRef]
Pinos, M.; Mrazek, V.; Vaverka, F.; Vasicek, Z.; Sekanina, L. Acceleration Techniques for Automated Design of Approximate Convolutional Neural Networks. IEEE J. Emerg. Sel. Top. Circuits Syst. 2023, 13, 212–224. [Google Scholar] [CrossRef]
Leon, V.; Paparouni, T.; Petrongonas, E.; Soudris, D.; Pekmestzi, K. Improving Power of DSP and CNN Hardware Accelerators Using Approximate Floating-Point Multipliers. ACM Trans. Embed. Comput. Syst. 2021, 20, 1–21. [Google Scholar] [CrossRef]
Yin, P.; Wang, C.; Waris, H.; Liu, W.; Han, Y.; Lombardi, F. Design and Analysis of Energy-Efficient Dynamic Range Approximate Logarithmic Multipliers for Machine Learning. IEEE Trans. Sustain. Comput. 2020, 6, 612–625. [Google Scholar] [CrossRef]
Mileiko, S.; Bunnam, T.; Xia, F.; Shafik, R.; Yakovlev, A.; Das, S. Neural Network Design for Energy-Autonomous Artificial Intelligence Applications Using Temporal Encoding. Philos. Trans. R. Soc. A 2020, 378, 20190166. [Google Scholar] [CrossRef] [PubMed]
Zhang, T.; Niu, Z.; Han, J. A Brief Review of Logarithmic Multiplier Designs. In IEEE Latin American Test Symposium (LATS); IEEE: Piscataway, NJ, USA, 2022; pp. 1–4. [Google Scholar]
Patel, S.K.; Singhal, S.K. An Area–Delay Efficient Single-Precision Floating-Point Multiplier for VLSI Systems. Microproc. Microsyst. 2023, 98, 104798. [Google Scholar]
Masadeh, M.; Hasan, O.; Tahar, S. Machine-Learning-Based Self-Tunable Design of Approximate Computing. IEEE Trans. Very Large Scale Integr. (Vlsi) Syst. 2021, 29, 800–813. [Google Scholar] [CrossRef]
Nakhaee, F.; Kamal, M.; Afzali-Kusha, A.; Pedram, M.; Fakhraie, S.M.; Dorosti, H. Lifetime Improvement by Exploiting Aggressive Voltage Scaling During Runtime of Error-Resilient Applications. Integration 2018, 61, 29–38. [Google Scholar] [CrossRef]
Venkatachalam, S.; Adams, E.; Lee, H.J.; Ko, S.-B. Design and Analysis of Area and Power Efficient Approximate Booth Multipliers. IEEE Trans. Comput. 2019, 68, 1697–1703. [Google Scholar] [CrossRef]
Strollo, A.G.M.; De Caro, D.; Napoli, E.; Petra, N.; Di Meo, G. Low-Power Approximate Multiplier with Error Recovery Using a New Approximate 4–2 Compressor. In IEEE International Symposium on Circuits and Systems (ISCAS); IEEE: Piscataway, NJ, USA, 2020; pp. 1–4. [Google Scholar]
Burke, D.; Jenkus, D.; Qiqieh, I.; Shafik, R.; Das, S.; Yakovlev, A. Significance-Driven Adaptive Approximate Computing for Energy-Efficient Image Processing Applications. In International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS); IEEE: Piscataway, NJ, USA, 2017; pp. 1–2. [Google Scholar]
Qaseem, Q. Investigation of Magnetic Bistability for a Wider Bandwidth In Vibro-Impact Triboelectric Energy Harvesters. Master’s Thesis, University of Taxas at Tyler, Tyler, TX, USA, 2023. [Google Scholar]
Qiqieh, I.; Shafik, R.; Tarawneh, G.; Sokolov, D.; Yakovlev, A. Energy-Efficient Approximate Multiplier Design Using Bit Significance-Driven Logic Compression. In Design, Automation & Test in Europe; IEEE: Piscataway, NJ, USA, 2017; pp. 7–12. [Google Scholar]
Al-Maaitah, K.; Qiqieh, I.; Soltan, A.; Yakovlev, A. Configurable-Accuracy Approximate Adder Design with Lightweight Fast Convergence Error Recovery Circuit. In IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies (AEECT); IEEE: Piscataway, NJ, USA, 2017; pp. 1–6. [Google Scholar]
Balasubramanian, P.; Mastorakis, N. Performance Comparison of Carry-Lookahead and Carry-Select Adders Based on Accurate and Approximate Additions. Electronics 2018, 7, 369. [Google Scholar] [CrossRef]
Ellaithy, D.M.; El-Moursy, M.A.; Zaki, A.; Zekry, A. Dual-Channel Multiplier for Piecewise-Polynomial Function Evaluation for Low-Power 3-D Graphics. IEEE Trans. Very Large Scale Integr. (Vlsi) Syst. 2019, 27, 790–798. [Google Scholar] [CrossRef]
Ghabeli, H.; Molahosseini, A.S.; Zarandi, A.A.E. New Multiply–Accumulate Circuits Based on Variable Latency Speculative Architectures with Asynchronous Data Paths. Majlesi J. Electr. Eng. 2022, 16, 41–53. [Google Scholar]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-Based Learning Applied to Document Recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Mrazek, V.; Sarwar, S.S.; Sekanina, L.; Vasicek, Z.; Roy, K. Design of Power-Efficient Approximate Multipliers for Approximate Artificial Neural Networks. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD); IEEE: Piscataway, NJ, USA, 2016; pp. 1–7. [Google Scholar]
Ullah, S.; Rehman, S.; Shafique, M.; Kumar, A. High-Performance Accurate and Approximate Multipliers for FPGA-Based Hardware Accelerators. IEEE Trans.-Comput.-Aided Des. Integr. Circuits Syst. 2021, 41, 211–224. [Google Scholar] [CrossRef]
Ultra96-V2 Development Board. Avnet Inc., Phoenix, AZ, USA, 2021. Available online: https://www.avnet.com/americas/products/avnet-boards/avnet-board-families/ultra96-v2/ (accessed on 13 December 2025).
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; pp. 8026–8037. [Google Scholar]
Mitcheson, P.D.; Yeatman, E.M.; Rao, G.K.; Holmes, A.S.; Green, T.C. Energy Harvesting from Human and Machine Motion for Wireless Electronic Devices. Proc. IEEE 2008, 96, 1457–1486. [Google Scholar] [CrossRef]
Vakili, S.; Vaziri, M.; Zarei, A.; Langlois, J.-P. DyReCMul: Fast and Low-Cost Approximate Multiplier for FPGAs Using Dynamic Reconfiguration. ACM Trans. Reconfigurable Technol. Syst. 2024, 18, 1–22. [Google Scholar] [CrossRef]
Ma, J.; Reda, S. RUCA: Runtime Configurable Approximate Circuits with Self-Correcting Capability. In Asia and South Pacific Design Automation Conference (ASP-DAC); ACM: Kowloon, Hong Kong, China, 2023; pp. 140–145. [Google Scholar]
Luo, H.; Cho, Y.; Demmel, J.W.; Kozachenko, I.; Li, X.S.; Liu, Y. Non-Smooth Bayesian Optimization in Tuning Scientific Applications. Int. J. High Perform. Comput. Appl. 2024, 38, 633–657. [Google Scholar] [CrossRef]

Figure 1. Bit-level logic compression in the configurable multiplier. Computation proceeds from left to right. HP-Mode enables full-precision operation, while MP-Mode and LP-Mode progressively suppress low-significance logic prior to final carry-propagate accumulation.

Figure 2. Alternative representation of the run-time configurable MAC architecture. Operands are applied at full precision, whereas internal multiplier stages are selectively activated or suppressed according to HP, MP, and LP operating modes. Energy-aware configuration control determines the active logic stages, while accumulation and activation remain exact and mode-independent.

Figure 3. Training accuracy versus training epochs for ANN implementations using HP-Mode, MP-Mode, and LP-Mode configurations on (a) Noisy XOR, (b) Binary IRIS, (c) MNIST, and (d) UCI Breast Cancer datasets.

Figure 4. Inference accuracy across datasets under different MAC operating modes. HP denotes full-precision operation, MP and LP apply moderate and aggressive internal logic compression, and MP→HP preserves output-layer precision.

Figure 5. Inference accuracy comparison on the MNIST dataset between the proposed configurable MAC architecture and the approximate multiplier approach of Mrazek et al. [24].

Table 1. Absolute ASIC synthesis results of the proposed configurable MAC under different operating modes (90 nm CMOS technology).

MAC Mode	Area (μm²)	Delay (ns)	Power (μW)	PDP (fJ)
HP-Mode	1644.4	2.66	70.2	186.7
MP-Mode	1098.6	2.88	43.6	125.6
LP-Mode	695.4	1.54	29.9	46.0

Table 2. ANN training configuration parameters. H denotes the number of hidden neurons (identical across HP-, MP-, and LP-Modes for fair comparison).

Dataset	Network Topology	Epochs	Activation
Noisy XOR	2–H–1	20	Sigmoid
Binary IRIS	4–H–1	100	Sigmoid
UCI Breast Cancer	9–H–1	100	Sigmoid
MNIST	784–H–10	100	ReLU (hidden)/Softmax (output)

Table 3. Normalized ASIC results for different MAC operating modes (relative to HP-Mode baseline).

Dataset	Mode	$Δ A$ (%)	$Δ P$ (%)	$Δ PDP$ (%)
	MP→HP Hybrid	−34	−38	−50
Noisy XOR	MP-Mode	−37	−41	−52
	LP-Mode	−65	−61	−81
	MP→HP Hybrid	−36	−36	−48
Binary IRIS	MP-Mode	−37	−38	−51
	LP-Mode	−65	−61	−80
	MP→HP Hybrid	−36	−37	−49
MNIST	MP-Mode	−37	−38	−52
	LP-Mode	−65	−60	−80
	MP→HP Hybrid	−35	−37	−50
UCI Breast Cancer	MP-Mode	−38	−40	−53
	LP-Mode	−66	−62	−81

Table 4. Comparison between the proposed work and [25] regarding FPGA inference accuracy and logic area reduction. The arrows → and ↓ indicate column and row directions in the table layout.

Data-Set →	MNIST
MAC ↓	Area Reduction (%)	Inference Accuracy (%)
Proposed HP-Mode	-	98.2
$A c c_{a p p}$ [25]	4.02	64.1
Proposed LP-Mode	45.2	91

Table 5. Comparing the proposed configurable MAC architecture with existing configurable approximation MAC designs.

Work	Platform	Approximation Mechanism	Run-Time Configurable	Energy-Aware Control	Interface Stability
Mrazek et al. [24]	ASIC	Truncation-based multiplier	No	No	No
Ullah et al. [25]	FPGA	Approximate adders/partial-product pruning	No	No	No
DyRecMul [29]	FPGA	LUT-based reconfigurable multiplier	Yes	No	Yes
RUCA [30]	ASIC	Quality-controlled circuit selection	Yes	No	Partial
AMG [31]	ASIC	Optimized approximate arithmetic generation	Limited	No	Partial
This work	ASIC/FPGA	Significance-aware internal logic suppression	Yes	Yes (EAMCA)	Yes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Alnuayri, T.; Haddadi, I. Adaptive Energy—Accuracy Trade-Offs in Configurable MAC Architectures for AI Acceleration. Electronics 2026, 15, 1129. https://doi.org/10.3390/electronics15051129

AMA Style

Alnuayri T, Haddadi I. Adaptive Energy—Accuracy Trade-Offs in Configurable MAC Architectures for AI Acceleration. Electronics. 2026; 15(5):1129. https://doi.org/10.3390/electronics15051129

Chicago/Turabian Style

Alnuayri, Turki, and Ibrahim Haddadi. 2026. "Adaptive Energy—Accuracy Trade-Offs in Configurable MAC Architectures for AI Acceleration" Electronics 15, no. 5: 1129. https://doi.org/10.3390/electronics15051129

APA Style

Alnuayri, T., & Haddadi, I. (2026). Adaptive Energy—Accuracy Trade-Offs in Configurable MAC Architectures for AI Acceleration. Electronics, 15(5), 1129. https://doi.org/10.3390/electronics15051129

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Adaptive Energy—Accuracy Trade-Offs in Configurable MAC Architectures for AI Acceleration

Abstract

1. Introduction

2. Related Work

2.1. Problem Formulation

2.2. Motivation and Contributions

3. Methodology

3.1. Energy Model and Definition

3.2. Numeric Representation and Training Configuration

4. Proposed Architectures and Implementations

4.1. Bit-Level Logic Compression in Configurable Multipliers

Run-Time Mode Switching and Transient Behavior

4.2. Configurable Neuron Architecture

4.3. Trade-Offs for Power, Delay, and Area in ASIC Designs

4.3.1. Normalized ASIC Comparison

4.3.2. Implementation of ANNs

4.3.3. Discussion of ASIC Results

4.3.4. Implementation and Reproducibility Details

4.3.5. Technology Scaling and Leakage Considerations

4.4. Area Reduction and Inference Accuracy in FPGA Implementations

5. Energy-Aware MAC Configuration Algorithm (EAMCA)

5.1. Control Overhead Analysis

5.2. Decision Granularity and Switching Behavior

6. Comparative Analysis of Configurable Approximate MAC Designs

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI