An Area-Efficient and Low-Error FPGA-Based Sigmoid Function Approximation

Bosso, Vinicius de Azevedo; Nardini, Ricardo Masson; de Sousa, Miguel Angelo de Abreu; dos Santos, Sara Dereste; Pires, Ricardo

doi:10.3390/app152111551

Open AccessArticle

An Area-Efficient and Low-Error FPGA-Based Sigmoid Function Approximation

by

Vinicius de Azevedo Bosso

,

Ricardo Masson Nardini

,

Miguel Angelo de Abreu de Sousa

,

Sara Dereste dos Santos

and

Ricardo Pires

^*

Federal Institute of Education, Science and Technology of São Paulo, São Paulo 01109-010, Brazil

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(21), 11551; https://doi.org/10.3390/app152111551

Submission received: 20 September 2025 / Revised: 21 October 2025 / Accepted: 28 October 2025 / Published: 29 October 2025

Download

Browse Figures

Versions Notes

Abstract

Neuromorphic hardware systems allow efficient implementation of artificial neural networks (ANNs) across various applications that demand high data throughput, reduced physical size, and low energy consumption. Field-Programmable Gate Arrays (FPGAs) possess inherent features that can be aligned with these requirements. However, implementing ANNs on FPGAs also presents challenges, including the computation of the neuron activation functions, due to the balance between resource constraints and numerical precision. This paper proposes a resource-efficient hardware approximation method for the sigmoid function, utilizing a combination of first- and second-degree polynomial functions. The method aims mainly to minimize the approximation error. This paper also evaluates the obtained results against existing techniques and discusses their significance. The experimental results showed that, although the proposed method mainly aimed to minimize the approximation error, it also had lower hardware resource usage than several of the most closely related works. Using 16-bit fixed-point number representation, the absolute mean error was

1.66 \times 10^{- 3}

by using 0.04% of the logic blocks and 3.21% of the DSP blocks in a Ciclone V 5CGXFC7C7F23C8 FPGA Device.

Keywords:

neuromorphic hardware; Field-Programmable Gate Array; artificial neural network; neuron activation function; sigmoid

1. Introduction

The field of neuroengineering is experiencing significant growth, with its activities characterized by several challenging tasks. One important area of focus within this field is the creation of neuromorphic hardware systems. These systems are composed of electronic circuits specifically designed to directly implement artificial neural networks (ANNs). The primary aim of developing such systems is to facilitate the application of neural computing models across a broad range of scenarios, each with its unique set of requirements. One critical need addressed by neuromorphic hardware is the acceleration of processing speed in ANN. For instance, many applications require the ability to achieve high data throughput to support fast training and/or recall cycles, such as video processing [1]. Furthermore, some applications require real-time responses from ANN, such as in cryptography/encryption processing [2] and in Internet of Things (IoT) environments, where it is essential to handle data from sensor networks instantly [3]. In addition to processing speed, neuromorphic systems also meet the requirements of mobile applications. For example, robotic platforms can benefit greatly from the reduced physical size and energy consumption of these systems [4]. Additionally, telecommunications is another example of a subject area where minimizing both hardware size and power requirements is important [5]. In this context, neuromorphic systems generally make it feasible to deploy neural network capabilities in compact and energy-efficient devices.

ASIC (Application-Specific Integrated Circuit) and FPGA (Field-Programmable Gate Array) technologies allow the creation of highly parallel computing architectures when implementing ANNs, which can achieve greater processing speeds than traditional Central Processing Units (CPUs) [3]. In addition, the low energy consumption of ASIC and FPGA circuits makes them well-suited for developing embedded neuromorphic systems, offering an advantage over Graphics Processing Units (GPUs) and CPUs in this regard [6]. While ASIC chips are ideal for ultra-low power applications like brain–computer interfaces, FPGA technology is beneficial due to its shorter development cycles and lower costs for small- and medium-scale production [7]. FPGAs can thus achieve high data throughput while minimizing space and energy use, making them a suitable choice for the types of ANN applications described above.

However, implementing ANNs directly on hardware, particularly on FPGA, also presents several challenges. These difficulties arise from the inherent complexity of neural computations and the constraints of hardware resources. Some of these difficulties include the limited capacity of the chips and the numerical precision of calculations. The challenge of limited hardware resources is related to FPGAs having a finite number of logic blocks (commonly implemented as Look-Up Tables (LUTs) ), memory, and Digital Signal Processing units (DSPs), which can restrict the size and complexity of the networks that can be implemented [8]. The second challenge involves numerical precision, as implementing ANNs on hardware often uses fixed-point arithmetic instead of floating-point [9]. While this strategy saves resources, it can lead to precision loss. Therefore, numerical precision should be monitored in ANN hardware implementations to ensure it does not affect model accuracy. One specific computation in which chip resources and precision issues can arise is in the implementation of neural activation functions like the sigmoid function [10]. The sigmoid function,

σ (x)

, is defined as follows:

y = σ (x) = \frac{1}{1 + e^{- λ (x - ρ)}}

(1)

where

λ

represents the slope of the curve and

ρ

represents the horizontal translation that adjusts the location of the curve’s inflection point. As a neural activation function, the sigmoid function is often used with

λ = 1

and

ρ = 0

[11].

Directly implementing the sigmoid function in hardware is impractical due to the extensive operations involved, such as exponentiation and division. In this way, the present work focuses on developing a hardware-based approximation of the sigmoid function using a combination of first- and second-degree polynomial functions, depending on the region in the function domain. Based on the outlined motivation, the proposal is intended to be resource-efficient to conserve chip area and maintain numerical precision comparable to previous publications.

The function approximation by low-degree polynomials, in this work, will be performed at the software level, using a hardware description language. Then, it will be up to the FPGA vendor’s hardware synthesis software, based on that description, to find the corresponding hardware implementation compatible with the characteristics of each specific chip model.

Some related works ([12,13,14,15,16]) restricted the values of operands or boundaries between approximation regions to be powers of two, hoping this would save hardware resources. However, this restriction may result in a loss of accuracy. Previous works, as [16,17], have also adopted the strategy of approximating regions of the function domain using low-degree polynomials. However, they adopted arbitrary values for the boundaries between those regions. In the present work, instead, it was decided to carry out a systematic search for the best values of those boundaries, hoping that this would better identify which regions of the function would be more suitable for approximations by polynomials of which degrees.

A methodological challenge encountered in works related to FPGA algorithm implementations is the wide variety of devices, with different types of internal hardware modules available, produced by different companies, each with its own synthesis software. For instance, experimental results in [17] were obtained for Xilinx FPGA XC7V2000 with Vivado design suit, while in [14] the authors used Quartus II for FPGA device EP3C16F484C6 from Cyclone III family. This often makes direct comparisons between published results produced in different studies impractical. The representation of operands and coefficients used in calculations also varies from study to study, in terms of the total number of bits, the number of bits in the integer part, and the number of bits in the fractional part (when using fixed point representation), potentially rendering comparisons meaningless.

The main contributions of this work are as follows:

In the sigmoid approximation, it systematically searches for the best boundaries between approximation regions in its domain, rather than establishing them arbitrarily. The use of this strategy provided a low mean absolute approximation error value ( $1.66 \times 10^{- 3}$ ) when compared to related works.
It relaxes the common restriction on using operands that are powers of two, allowing them to assume values freely. Experimental results showed that this had no systematic negative impact on hardware resource usage and probably contributed to the quality of the results.
As a methodological contribution, the results are validated for multiple chip models, from two large companies, making the comparisons with other works fair and more general. Furthermore, all related works considered here were implemented in VHDL language according to the details provided in the original articles and their codes were made publicly available for reproduction by the scientific community and for the continuation of this research.
Also seeking fairness and a methodologically correct procedure, we advocate that the comparison with the results of related works be made by standardizing the number of bits of the operands, the FPGA synthesis programs, and the chip models for implementation.

The remainder of the paper is organized as follows: The next section presents the justification for this study and a review of the literature. Section 3 details the proposed method, and the final two sections present the results, discussion, comparisons to published works, and conclusions.

2. Justification and Related Works

In the field of machine learning, the sigmoid function is a widely used nonlinear component, commonly serving as a neural activation function. It is frequently employed in shallow networks, such as Multilayer Perceptrons (MLPs), both during the operation phase (inference) and the training phase, in the derivative for error backpropagation and weight adjustment [11]. However, the sigmoid function is also essential in various other neural computing models beyond MLP. For example, it regulates the retention, forgetting, and passing of information in recurrent neural networks, acting as the activation function for different gates in Long Short-Term Memory (LSTM) networks [18], and in Gated Recurrent Units (GRUs) [19,20]. While the Rectified Linear Unit (ReLU) and its variants, collectively known as the ReLU family, have gained popularity as activation functions in deep learning networks [21], the sigmoid function remains in use within deep models as well. Examples of its use can be found for modulating the flow of information in Structured Attention Networks (SANs), which incorporate structured attention mechanisms to focus on specific parts of the input data [22]. The sigmoid function is also employed to map internal values to a range between 0 and 1 in deep models, particularly for binary decisions or probabilistic outputs, as seen in convolutional neural networks, like LeNet [23], and in Autoencoders [24].

The continued interest in researching the sigmoid function is also reflected in the numerous studies published on its hardware implementation. Common approaches to implementing the sigmoid function include piecewise approximation with linear or nonlinear segments, LUTs for direct storage of output values, Taylor series expansion, and Coordinate Rotation Digital Computer (CORDIC) methods. The work [25] aims to achieve low maximum approximation error using CORDIC-based techniques to iteratively calculate trigonometric and hyperbolic functions. However, this approach demands significant chip resources. Similarly, the authors in [26] employed Taylor’s theorem and the Lagrange form for approximation. The work also proposes incorporating the reutilization of neuron circuitry into the approximation calculation to enhance chip area efficiency.

For piecewise approximation of the sigmoid curve, the authors in [27] utilized exponential function approximation, which involves complex division operations. A number of studies propose multiple approximation schemes. For instance, the authors in [13] introduced an approach using three independent modules—Piecewise Linear Function (PWL), Taylor series, and the Newton–Raphson method—that can be combined to achieve the desired balance between accuracy and resource efficiency. The study in [28] presents two methods for approximation: one centralized and the other distributed, using reconfigurable computing environments to optimize implementation.

Meanwhile, the authors in [16] proposed different schemes for sigmoid approximation, utilizing first- or second-order polynomial functions depending on the required accuracy and resource usage. Nevertheless, most works focus on different PWL approximation strategies. Reference [17] uses curvature analysis to segment the sigmoid function for PWL approximation, adjusting each segment based on maximum and average error metrics to balance precision and resource consumption. In another approach, the authors in [29] employed a first-order Taylor series expansion to define the intersection points of the PWL segments, while [30] leveraged statistical probability distribution to determine the number of segments, fragmenting the function into unequal spaces to minimize approximation error. Other approaches aim to reduce computational complexity. For example, the authors in [31] explored a technique that combines base-2 logarithmic functions with bit-shift operations for approximation, resulting in reduced chip area.

The use of LUTs for direct storage of sigmoid output values is explored in [32], providing a straightforward method for FPGA implementation. On the other hand, the authors in [33] achieved high precision by using floating-point arithmetic, directly computing the sigmoid function with exponent function interpolation via the McLaurin series and Padé polynomials. However, this implementation requires a large chip area.

A restriction adopted in several related works ([12,13,14,15,16]) was to only use multiplications by powers of 2 or operations with base-2 logarithm instead of using coefficients with generic values, in order to use shifts replacing normal multiplications, to save hardware resources. However, this risks sacrificing accuracy.

3. Materials and Methods

As in some of the related works, the scheme proposed here consists in splitting the sigmoid function domain into a small number of subdomains (Figure 1) and in approximating the function in each subdomain by a low-degree polynomial, expecting in this way a low hardware resource usage. The polynomial degrees are chosen based in the function graph form in each subdomain. For x in the function domain, if

x < xinf

,

σ (x)

will be approximated as being a constant equal to zero. For

x > xsup

,

σ (x)

will be approximated as being the constant one. For

xinf < x < xmin

and for

xmax < x < xsup

,

σ (x)

will be approximated by first degree polynomials. For

xmin < x < xmed

and for

xmed < x < xmax

,

σ (x)

will be approximated by second degree polynomials. However, unlike those works, the boundaries between the approximation regions will not be defined arbitrarily, but will be systematically sought, in an approximately continuous way, in order to minimize the average absolute error in the range

- 8 < x < + 8

. This range to be explored was chosen here, as in [17], because outside it,

σ (x)

is approximately constant, 0 or 1. As an alternative, the structure of the same algorithm will search for the boundaries that minimizes the maximum (the peak) of the absolute error in the same domain.

Algorithm 1 was used to search for the best approximations boundaries. It was implemented in the GNU Octave software [34]. Initially, the search step dx is set to 0.005. Due to the sigmoid function symmetry about the point

(0, 0.5)

, xmed is set to 0. Variables minAvgAbsError and minMaxAbsError are used to register the smaller average absolute error and the minimum value for the absolute error peak, respectively, known up to each iteration along the whole domain. Then, two nested for loops generate many combinations of values to xinf and xmin. The particular ranges used in those for loops in Algorithm 1 were chosen after experimentation with wider ranges and with larger steps, in a preliminar coarse experiment. Those ranges are used in a final refinement. In lines 7 and 8, symmetric values about the origin are set for xmax and xsup in relation to xmin and xinf, respectively. Lines from 9 to 14 build sequences of values for x in each of the approximations regions. Lines 15 to 20 build sequences containing the standard values for the sigmoid in each region. In lines from 21 to 24, the Octave function polyfit is applied to the sequences belonging to each of the regions where polynomial approximations are adopted, using the least square error criterion. The third polyfit argument is the degree of the desired polynomial. Variables

p 2

,

p 3

,

p 4

, and

p 5

receive the polynomials coefficients of the corresponding regions. Lines from 25 to 31 calculate and concatenate the errors for each domain value. Line 32 calculates the average absolute error (avgAbsError) and line 33 calculates the maximum absolute error (maxAbsError) for the current

(xinf, xmin)

combination. Lines 34 to 42 check if the newly calculated avgAbsError is smaller than the least average absolute error known up to this moment, in which case the corresponding optimal conditions are registered. The same kind of checking is performed for maxAbsError in lines from 43 to 51. So, at the execution end, the best boundaries and the corresponding polynomials coefficients are known for both criteria. There was no concern about optimizing this algorithm in terms of execution time, because it only needs to be executed once. We do not adopt here the restriction or preference for operations involving powers of 2, thus leaving the possibilities of values for the polynomial coefficients and for the boundaries more free.

The application of Algorithm 1 to any other function (for example, the hyperbolic tangent, also a common neural activation function [11]) is straightforward. It will suffice to replace the sigmoid calculation in lines from 15 to 20 by that function. Depending on that function’s characteristics, one may change the number of regions in the domain and the polynomial degrees, but the strategy remains the same.

The results obtained by executing the Algorithm 1 are shown in Table 1.

Table 1 shows that, as expected, the average absolute error was slightly lower in the first implementation than in the second. The maximum absolute error was slightly lower in the second implementation than in the first.

Algorithm 1 Search for the best boundaries between approximation regions

1:: $d x$ = 0.005
2:: xmed = 0
3:: minAvgAbsError = ∞
4:: minMaxAbsError = ∞
5:: for $x i n f$ from $- 6$ to $- 5$ in 0.01 steps do
6:: for $x m i n$ from $- 4$ to $- 3$ in 0.01 steps do
7:: $x m a x = - x m i n$ ▹ $x m a x$ and $x m i n$ symmetrical about the origin
8:: $x s u p = - x i n f$ ▹ $x s u p$ and $x i n f$ symmetrical about the origin
Build the ranges of x (sigmoid domain):
9:: $x R e g 1 =$ sequence from $- 8$ to $x i n f$ , with increments of $d x$
10:: $x R e g 2 =$ sequence from $(x i n f + d x)$ to $x m i n$ , with increments of $d x$
11:: $x R e g 3 =$ sequence from $(x m i n + d x)$ to $x m e d$ , with increments of $d x$
12:: $x R e g 4 =$ sequence from $(x m e d + d x)$ to $(x m a x - d x)$ , with increments of $d x$
13:: $x R e g 5 =$ sequence from $x m a x$ to $(x s u p - d x)$ , with increments of $d x$
14:: $x R e g 6 =$ sequence from $x s u p$ to 8, with increments of $d x$
Build y standard per range:
15:: $y R e g 1 =$ sequence built from 1 / ( 1 + exp( $- x R e g 1$ ) )
16:: $y R e g 2 =$ sequence built from 1 / ( 1 + exp( $- x R e g 2$ ) )
17:: $y R e g 3 =$ sequence built from 1 / ( 1 + exp( $- x R e g 3$ ) )
18:: $y R e g 4 =$ sequence built from 1 / ( 1 + exp( $- x R e g 4$ ) )
19:: $y R e g 5 =$ sequence built from 1 / ( 1 + exp( $- x R e g 5$ ) )
20:: $y R e g 6 =$ sequence built from 1 / ( 1 + exp( $- x R e g 6$ ) )
Polynomial coefficients calculation per region:
21:: $p 2$ = polyfit( $x R e g 2$ , $y R e g 2$ , 1 ) ▹ first degree
22:: $p 3$ = polyfit( $x R e g 3$ , $y R e g 3$ , 2 ) ▹ second degree
23:: $p 4$ = polyfit( $x R e g 4$ , $y R e g 4$ , 2 ) ▹ second degree
24:: $p 5$ = polyfit( $x R e g 5$ , $y R e g 5$ , 1 ) ▹ first degree
Error calculation:
25:: $e r r o r R e g 1$ = zeros( 1, size( $x R e g 1$ , 2 ) ) − $y R e g 1$
26:: $e r r o r R e g 2$ = polyval( p2, $x R e g 2$ ) − $y R e g 2$
27:: $e r r o r R e g 3$ = polyval( p3, $x R e g 3$ ) − $y R e g 3$
28:: $e r r o r R e g 4$ = polyval( p4, $x R e g 4$ ) − $y R e g 4$
29:: $e r r o r R e g 5$ = polyval( p5, $x R e g 5$ ) − $y R e g 5$
30:: $e r r o r R e g 6$ = ones( 1, size( $x R e g 6$ , 2 ) ) − $y R e g 6$
31:: $e r r o r A l l R e g i o n s$ = concatenation( $e r r o r R e g 1, e r r o r R e g 2, \dots, e r r o r R e g 6$ )
32:: $a v g A b s E r r o r$ = mean( abs( $e r r o r A l l R e g i o n s$ ) )
33:: $m a x A b s E r r o r$ = max( abs( $e r r o r A l l R e g i o n s$ ) )
34:: if $a v g A b s E r r o r < m i n A v g A b s E r r o r$ then
35:: $b e s t X i n f A v g A b s E r r o r = x i n f$
36:: $b e s t X m i n A v g A b s E r r o r = x m i n$
37:: $m i n A v g A b s E r r o r = a v g A b s E r r o r$
38:: $b e s t P 2 A v g A b s = p 2$
39:: $b e s t P 3 A v g A b s = p 3$
40:: $b e s t P 4 A v g A b s = p 4$
41:: $b e s t P 5 A v g A b s = p 5$
42:: end if
43:: if $m a x A b s E r r o r < m i n M a x A b s E r r o r$ then
44:: $b e s t X i n f M a x A b s E r r o r = x i n f$
45:: $b e s t X m i n M a x A b s E r r o r = x m i n$
46:: $m i n M a x A b s E r r o r = m a x A b s E r r o r$
47:: $b e s t P 2 M a x A b s = p 2$
48:: $b e s t P 3 M a x A b s = p 3$
49:: $b e s t P 4 M a x A b s = p 4$
50:: $b e s t P 5 M a x A b s = p 5$
51:: end if
52:: end for
53:: end for

The two sigmoid modules corresponding to Table 1 were implemented in Very High Speed Integrated Circuits Hardware Description Language (VHDL) language [35], in a behavioral way. Each implementation consists of two VHDL files: a package file, called pck_abs_avg_error.vhdl (in Appendix A.1) when containing constants definitions for the first module) and a sigmoid.vhdl file (in Appendix A.2), containing the approximation algorithm. Those constants are the polynomial coefficients and the boundaries values. The package files for the second module are omitted here because only the constant values change between the two implementations. The sigmoid.vhdl file is the same for both.

The input x, the output

y = σ (x)

, the boundaries values, and the polynomial coefficients are all represented as 16 bit numbers, in fixed point, with 4 bits for the integer part and 12 bits for the fractional part, as in [17]. So, the real number 1 is represented as the binary number

0001 0000 0000 0000

, which corresponds to the decimal number

2^{12} = 4096

(without considering the decimal point position). Thus, each constant in Table 1 appears multiplied by 4096 in the VHDL file pck_abs_avg_error.vhdl. Two’s complement representation was used to allow calculations with negative numbers, by using the type signed from the ieee.numeric_std VHDL package.

The sigmoid implementation in the file sigmoid.vhdl follows directly from Table 1 and from Figure 1. An x value is read at the clock (ck) rise and the output y is then calculated according to the x approximation region. After each multiplication operation is performed, its result contains a number of bits that is the sum of the numbers of bits of its operands: 8 bits to the integer part and 24 bits to the fractional part. So, following this operation, the result is truncated, by discarding the 4 leftmost and the 12 rightmost bits, producing a number respecting again the convention with 16 bits. Due to the multiplication operands particular values used here (Table 1), the resulting 4 leftmost bits are always zeros for positive results and ones for negative results (in two’s complement), allowing them to be discarded. Discarding the 12 least significant bits affects the results very little, because they have very small weights.

The proposed sigmoid calculation is not iterative, but rather a single-pass one. The input argument x is presented, the approximation region is identified, the polynomial corresponding to that region is calculated, and the result is output in a single pass (a single clock cycle), without further iterations.

To compare the hardware resource usage and accuracy of our implementations with those of related works cited in Section 2, we attempted to reimplement them all in a uniform way, not necessarily identical to theirs, but seeking to reproduce as faithfully as possible their algorithms and their sigmoid approximations, by using a uniform fixed-point representation, with the same number of bits (16 bits, being 4 bits for the integer part and 12 bits for the decimal part), on uniform chip models, and using uniform synthesis programs. This was possible only for some of the related works [12,13,14,15,16,17], because their implementations were described in sufficient detail there. Each proposed sigmoid approximation was described in VHDL. The synthesis programs were Quartus Prime Version 24.1 std, Build 1077 SC Lite Edition, for the FPGA chips Device 5CGXFC7C7F23C8 and Device MAX-10 10M02DCU324A6G, and Vivado v2025.1, for the FPGA chips Virtex-7—Device xc7v585tffg1157-3, Spartan-7—Device xc7s75fgga484-1 and Zinq-7000—Device xc7z045iffg900-2L. So, programs were chosen from companies that are among the leaders in the FPGA field and, for each company, chips of varying levels of complexity were chosen. The block diagram obtained by Quartus is in Figure 2.

Regarding accuracy, the criterion most used in the cited works was the average absolute error of the function over the domain considered, which justifies its adoption here.

Each VHDL description used in the comparisons was simulated using the GHDL program [36], because it is fast to execute and is capable of recording results in text files for later analysis. Each simulation run consisted in calculating and comparing the standard value of the sigmoid function,

σ (x)

(obtained by the GNU Octave program [34]), with the value generated by the system described in VHDL, for each of the

2^{16}

= 65,536 possible values in the x domain and obtaining, at the end, the average of the absolute error. In other words, the tests were exhaustive, performed for all representable input values with 16 bits.

4. Results

Table 2 shows the synthesis results regarding the use of hardware resources in the FPGA, for each sigmoid approximation. There are six approximations from related works and our two approximations: to minimize the error average (“ours avg.”) and to minimize the error peak (“ours max.”). The table has five parts, one for each software–device combination. It shows the FPGA resources used by each implementation as reported by the corresponding synthesis software. The kinds of those resources vary depending on the device type. For instance, only some devices have DSP blocks. The last column shows the total available number of each resource type in each FPGA device. The percentage of usage of each resource type relative to availability is also indicated. Reports from all implementations indicated that 16 registers (due to 16-bit operands) and 33 pins (16 input bits, 16 output bits, and a clock pin) were used. Thus, these results were omitted from Table 2, keeping in it only those relevant for comparison purposes.

Comparing resource usage between implementations is not straightforward because they are of different types and available in different quantities. However, at first glance, one can see in Table 2 that our implementations are neither systematically better nor systematically worse, in general, than the others in terms of resource usage, even though this criterion did not guide the creation of our implementations. Specifically, only the implementation in [12] always matches ours or is advantageous.

Table 3 shows the average absolute error for all sigmoid implementations, obtained for 65,536 equally spaced values in its domain. It should be noted that, unlike with Table 2, these results do not depend on the chip model on which the implementation was made, but only on the VHDL descriptions.

In Table 3, one can see that only the implementation from [13] managed to achieve a lower error than ours. However, in Table 2, we see that it used more resources of most types compared to our implementations. Another interesting result was that our second implementation, which aimed only to obtain the minimum error peak, surpassed almost all others in the average absolute error criterion. The implementation in [12], which is advantageous compared to ours concerning resource usage, has a much bigger average absolute error.

The absolute error for our first implementation is shown in Figure 3 and for the second implementation, in Figure 4 for the function domain.

The maximum absolute errors of our implementations were 0.0068181 for the first and 0.0071184 for the second. They are larger than those before VHDL implementations occur (Table 1), surely due to the use of fixed point representation in 16 bits. Another effect was that now, the maximum error of the first implementation is lower than that of the second implementation. However, in some cases, the second implementation is advantageous in relation to the first in resource utilization (Table 2).

After obtaining these results, new experiments were carried out to observe the effect of the number of bits in the representation of the operands on the average absolute error and on the use of hardware resources.

We then also implemented our approximation using 8-bit and 32-bit fixed-point representation, always keeping 4 bits for the integer part and the rest for the fractional part. We kept 4 bits in the integer part because they were enough to cover the range of values from

- 8

to

+ 8

used in the experiments. The average absolute error values for these cases and for the previous case are in Table 4:

It can be seen that the mean absolute error for both the 16-bit and 32-bit fixed-point implementations was equal to that obtained using the low-degree polynomial approximation to the sigmoid on a general-purpose 64-bit floating-point computer, prior to the VHDL description (Table 3). Therefore, for both 16-bit and 32-bit, the mean error value is likely due solely to having approximated the sigmoid by low-degree polynomials, not to the use of those bit numbers or the fixed-point implementation.

The results of hardware resource usage as a function of the number of bits in the operands, for two chip models, are given in Table 5. It can be seen that hardware resource usage grew much more than linearly with the number of bits in the operands. This must be taken into account in a design, especially in a case like this, where the error value saturates beyond a certain number of bits.

Table 6 presents the data path delay results for some chip models, for related works and for our implementation that aimed to minimize the mean absolute error. In it, one can see that the delay in our implementation does not differ much from most others, but that it is much lower than that of proposal [13], which presented the lowest average absolute error.

5. Discussion and Conclusions

The sigmoid approximation for its implementation in FPGA, by dividing its domain into a few subdomains, approximating it, in each subdomain, using a low-degree polynomial, and systematically searching for the best boundaries between the subdomains, allowed us to obtain results with a good compromise between the use of hardware resources and the average absolute approximation error (

1.66 \times 10^{- 3}

for the interval from

- 8

to

+ 8

in the function domain). It thus becomes a good alternative to preexisting solutions.

The fact that we did not restrict operands or boundaries between approximation regions to be powers of two did not prevent relatively good results from being obtained not only in error value, but also in resource usage when compared to related works. Therefore, consideration should be given to the use of such restrictions.

The method proposed here aimed to obtain a low value for the average absolute error. In this, the work in [13] obtained better results (

1.07 \times 10^{- 3}

, versus ours,

1.66 \times 10^{- 3}

), using a more elaborate method, including Newton–Raphson approximation, but requiring greater use of hardware resources in most cases (for instance, in Logic Utilization, 91 versus 25 for Ciclone V - Device 5CGXFC7C7F23C8). Therefore, it may be preferred when the importance of the error is much greater than the importance of the use of resources.

A limitation of this work was that it attempted to reproduce the implementations described in other studies, with the possibility that this reproduction might have had differences from the originals that could have influenced the results. Furthermore, some of those proposals allowed for parameter adjustments, whereas in this reproduction, to minimize the length of the work, we sought to use only a typical version of what was described in each article.

As future work, a complete neural network can be implemented in FPGA using, in each neuron, a sigmoid implementation, as proposed here, comparing its performance with a version implemented in a general-purpose computer.

One can also investigate the implementation of other activation functions, such as the hyperbolic tangent, using the principles used here for the sigmoid.

Author Contributions

Conceptualization, M.A.d.A.d.S., S.D.d.S., and R.P.; methodology, all the authors; software, all the authors; validation, all the authors; formal analysis, M.A.d.A.d.S., S.D.d.S., and R.P.; investigation, all the authors; resources, all the authors; data curation, all the authors; writing—original draft preparation, all the authors; writing—review and editing, all the authors; visualization, all the authors; supervision, M.A.d.A.d.S., S.D.d.S., and R.P.; project administration, M.A.d.A.d.S., S.D.d.S., and R.P.; funding acquisition, M.A.d.A.d.S. and S.D.d.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq), which provided a research grant to author V.A.d.B., and by Instituto Federal de Educação, Ciência e Tecnologia de São Paulo, which provided a research grant to author R.M.N.

Data Availability Statement

The VHDL and Octave code used in this study are openly available in https://github.com/AngeloIFSP/FPGA-implementation-of-sigmoid-approximation-for-neural-networks, (accessed on 28 October 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ANN	Artificial neural network
ASIC	Application-Specific Integrated Circuit
CORDIC	Coordinate Rotation Digital Computer
CPU	Central Processing Unit
DSP	Digital Signal Processing
FPGA	Field-Programmable Gate Array
GPU	Graphics Processing Unit
GRU	Gated Recurrent Unit
LSTM	Long Short-Term Memory
LUT	Look-Up Table
MLP	Multilayer Perceptron
PWL	Piecewise Linear Function
ReLU	Rectified Linear Unit
SAN	Structured Attention Network
VHDL	Very High Speed Integrated Circuits Hardware Description Language

Appendix A. VHDL Source Code

Appendix A.1. Package Containing Definitions, Minimizing Average Absolute Error Version

Appendix A.2. Sigmoid Implementation

References

de Sousa, M.A.d.A.; Pires, R.; Del-Moral-Hernandez, E. SOMprocessor: A high throughput FPGA-based architecture for implementing Self-Organizing Maps and its application to video processing. Neural Netw. 2020, 125, 349–362. [Google Scholar] [CrossRef]
Teodoro, A.A.; Gomes, O.S.; Saadi, M.; Silva, B.A.; Rosa, R.L.; Rodríguez, D.Z. An FPGA-based performance evaluation of artificial neural network architecture algorithm for IoT. Wirel. Pers. Commun. 2022, 127, 1085–1116. [Google Scholar] [CrossRef]
Qian, B.; Su, J.; Wen, Z.; Jha, D.N.; Li, Y.; Guan, Y.; Puthal, D.; James, P.; Yang, R.; Zomaya, A.Y.; et al. Orchestrating the development lifecycle of machine learning-based IoT applications: A taxonomy and survey. ACM Comput. Surv. (CSUR) 2020, 53, 1–47. [Google Scholar] [CrossRef]
Korayem, M.H.; Adriani, H.R.; Lademakhi, N.Y. Regulation of cost function weighting matrices in control of WMR using MLP neural networks. Robotica 2023, 41, 530–547. [Google Scholar] [CrossRef]
de Abreu de Sousa, M.A.; Pires, R.; Del-Moral-Hernandez, E. OFDM symbol identification by an unsupervised learning system under dynamically changing channel effects. Neural Comput. Appl. 2018, 30, 3759–3771. [Google Scholar] [CrossRef]
Dundar, A.; Jin, J.; Martini, B.; Culurciello, E. Embedded streaming deep neural networks accelerator with applications. IEEE Trans. Neural Netw. Learn. Syst. 2016, 28, 1572–1583. [Google Scholar] [CrossRef]
Nurvitadhi, E.; Sheffield, D.; Sim, J.; Mishra, A.; Venkatesh, G.; Marr, D. Accelerating binarized neural networks: Comparison of FPGA, CPU, GPU, and ASIC. In Proceedings of the 2016 International Conference on Field-Programmable Technology (FPT), Xi’an, China, 7–9 December 2016; pp. 77–84. [Google Scholar]
Mittal, S. A survey of FPGA-based accelerators for convolutional neural networks. Neural Comput. Appl. 2020, 32, 1109–1139. [Google Scholar] [CrossRef]
Tao, Y.; Ma, R.; Shyu, M.L.; Chen, S.C. Challenges in energy-efficient deep neural network training with FPGA. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 400–401. [Google Scholar]
Tisan, A.; Oniga, S.; Mic, D.; Buchman, A. Digital implementation of the sigmoid function for FPGA circuits. Acta Tech. Napoc. Electron. Telecommun. 2009, 50, 6. [Google Scholar]
Haykin, S.S. Neural Networks and Learning Machines, 3rd ed.; Pearson Education: Upper Saddle River, NJ, USA, 2009. [Google Scholar]
Liu, F.; Zhang, B.; Chen, G.; Gong, G.; Lu, H.; Li, W. A novel configurable high-precision and low-cost circuit design of sigmoid and tanh activation function. In Proceedings of the 2021 IEEE International Conference on Integrated Circuits, Technologies and Applications (ICTA), Zhuhai, China, 24–26 November 2021; pp. 222–223. [Google Scholar]
Pan, Z.; Gu, Z.; Jiang, X.; Zhu, G.; Ma, D. A modular approximation methodology for efficient fixed-point hardware implementation of the sigmoid function. IEEE Trans. Ind. Electron. 2022, 69, 10694–10703. [Google Scholar] [CrossRef]
Tsmots, I.; Skorokhoda, O.; Rabyk, V. Hardware implementation of sigmoid activation functions using FPGA. In Proceedings of the 2019 IEEE 15th International Conference on the Experience of Designing and Application of CAD Systems (CADSM), Polyana, Ukraine, 26 February–2 March 2019; pp. 34–38. [Google Scholar]
Vaisnav, A.; Ashok, S.; Vinaykumar, S.; Thilagavathy, R. FPGA implementation and comparison of sigmoid and hyperbolic tangent activation functions in an artificial neural network. In Proceedings of the 2022 International Conference on Electrical, Computer and Energy Technologies (ICECET), Prague, Czech Republic, 20–22 July 2022; pp. 1–4. [Google Scholar]
Vassiliadis, S.; Zhang, M.; Delgado-Frias, J.G. Elementary function generators for neural-network emulators. IEEE Trans. Neural Netw. 2000, 11, 1438–1449. [Google Scholar]
Li, Z.; Zhang, Y.; Sui, B.; Xing, Z.; Wang, Q. FPGA implementation for the sigmoid with piecewise linear fitting method based on curvature analysis. Electronics 2022, 11, 1365. [Google Scholar] [CrossRef]
Moharm, K.; Eltahan, M.; Elsaadany, E. Wind speed forecast using LSTM and Bi-LSTM algorithms over gabal El-Zayt wind farm. In Proceedings of the 2020 International Conference on Smart Grids and Energy Systems (SGES), Perth, Australia, 23–26 November 2020; pp. 922–927. [Google Scholar]
Dey, R.; Salem, F.M. Gate-variants of gated recurrent unit (GRU) neural networks. In Proceedings of the 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS), Boston, MA, USA, 6–9 August 2017; pp. 1597–1600. [Google Scholar]
Gharehbaghi, A.; Ghasemlounia, R.; Ahmadi, F.; Albaji, M. Groundwater level prediction with meteorologically sensitive Gated Recurrent Unit (GRU) neural networks. J. Hydrol. 2022, 612, 128262. [Google Scholar] [CrossRef]
Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
Yang, L.; Han, J.; Zhao, T.; Liu, N.; Zhang, D. Structured attention composition for temporal action localization. arXiv 2022, arXiv:2205.09956. [Google Scholar] [CrossRef]
Li, L.; Wang, Y. Improved LeNet-5 convolutional neural network traffic sign recognition. Int. Core J. Eng. 2021, 7, 114–121. [Google Scholar]
Sewak, M.; Sahay, S.K.; Rathore, H. An overview of deep learning architecture of deep neural networks and autoencoders. J. Comput. Theor. Nanosci. 2020, 17, 182–188. [Google Scholar] [CrossRef]
Tiwari, V.; Khare, N. Hardware implementation of neural network with Sigmoidal activation functions using CORDIC. Microprocess. Microsyst. 2015, 39, 373–381. [Google Scholar] [CrossRef]
Del Campo, I.; Finker, R.; Echanobe, J.; Basterretxea, K. Controlled accuracy approximation of sigmoid function for efficient FPGA-based implementation of artificial neurons. Electron. Lett. 2013, 49, 1598–1600. [Google Scholar] [CrossRef]
Gomar, S.; Mirhassani, M.; Ahmadi, M. Precise digital implementations of hyperbolic tanh and sigmoid function. In Proceedings of the 2016 50th Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, USA, 6–9 November 2016; pp. 1586–1589. [Google Scholar]
Shatravin, V.; Shashev, D.; Shidlovskiy, S. Sigmoid activation implementation for neural networks hardware accelerators based on reconfigurable computing environments for low-power intelligent systems. Appl. Sci. 2022, 12, 5216. [Google Scholar] [CrossRef]
Qin, Z.; Qiu, Y.; Sun, H.; Lu, Z.; Wang, Z.; Shen, Q.; Pan, H. A novel approximation methodology and its efficient vlsi implementation for the sigmoid function. IEEE Trans. Circuits Syst. II Express Briefs 2020, 67, 3422–3426. [Google Scholar] [CrossRef]
Wei, L.; Cai, J.; Nguyen, V.; Chu, J.; Wen, K. P-SFA: Probability based sigmoid function approximation for low-complexity hardware implementation. Microprocess. Microsyst. 2020, 76, 103105. [Google Scholar] [CrossRef]
Zaki, P.W.; Hashem, A.M.; Fahim, E.A.; Mansour, M.A.; ElGenk, S.M.; Mashaly, M.; Ismail, S.M. A novel sigmoid function approximation suitable for neural networks on FPGA. In Proceedings of the 2019 15th International Computer Engineering Conference (ICENCO), Cairo, Egypt, 29–30 December 2019; pp. 95–99. [Google Scholar]
Pogiri, R.; Ari, S.; Mahapatra, K. Design and FPGA Implementation of the LUT based Sigmoid Function for DNN Applications. In Proceedings of the 2022 IEEE International Symposium on Smart Electronic Systems (iSES), Warangal, India, 18–22 December 2022; pp. 410–413. [Google Scholar]
Hajduk, Z. Hardware implementation of hyperbolic tangent and sigmoid activation functions. Bull. Pol. Acad. Sci. Tech. Sci. 2018, 66, 563–577. [Google Scholar] [CrossRef]
Eaton, J.W.; Bateman, D.; Hauberg, S.; Wehbring, R. Manual: A High-Level Interactive Language for Numerical Computations, version 4.4.0; Free Software Foundation, Inc.: Boston, MA, USA, 2018.
Ashenden, P. The Designer’s Guide to VHDL; Morgan Kauffman Publishers: San Francisco, CA, USA, 2008. [Google Scholar]
Gingold, T. GHDL. 2017. Available online: http://ghdl.free.fr/ (accessed on 28 October 2025).

Figure 1. Ranges for approximations in the sigmoid domain. The boundaries are drawn in approximate positions, to be defined by an algorithm to be described below.

Figure 2. Block diagram of the FPGA implementation obtained by Quartus Prime Version 24.1 std, Build 1077 SC Lite Edition.

Figure 3. Absolute error for our first implementation.

Figure 4. Absolute error for our second implementation.

Table 1. Results obtained by Algorithm 1.

To obtain the minimal average absolute error:
$xinf (= - xsup)$	$- 5.440$
$xmin (= - xmax)$	$- 3.250$
minAvgAbsError	0.0016548
minMaxAbsError	0.0064729
region 2 polynomial	$0.014117 x + 0.076735$
region 3 polynomial	$0.040384 x^{2} + 0.272505 x + 0.502208$
region 4 polynomial	$- 0.040394 x^{2} + 0.272543 x + 0.497762$
region 5 polynomial	$0.014117 x + 0.923265$
To obtain the minimum value for the error peak:
$xinf (= - xsup)$	$- 5.190$
$xmin (= - xmax)$	$- 3.240$
minAvgAbsError	0.0016775
minMaxAbsError	0.0055832
region 2 polynomial	$0.015656 x + 0.082845$
region 3 polynomial	$0.040434 x^{2} + 0.272634 x + 0.502260$
region 4 polynomial	$- 0.040444 x^{2} + 0.272674 x + 0.497709$
region 5 polynomial	$0.015656 x + 0.917155$

Table 2. Results on FPGA used resources. Only results that varied between implementations are shown.

Work	[17]	[12]	[13]	[14]	[15]	[16]	Ours Avg.	Ours Max.	Available
Quartus Prime Version 24.1 std. 0 Build 1077 SC Lite Edition
Family Ciclone V-Device 5CGXFC7C7F23C8
Logic Utilization	117	25	91	39	41	176	25	52	56,480
percentage	0.21	0.04	0.16	0.07	0.07	0.31	0.04	0.09	100
DSP Blocks	4	2	6	3	0	8	5	4	156
percentage	2.56	1.28	3.85	1.92	0.00	5.13	3.21	2.56	100
Quartus Prime Version 24.1 std. 0 Build 1077 SC Lite Edition
Device MAX-10 10M02DCU324A6G
Logic Elements	505	95	394	238	79	424	362	310	2304
percentage	21.92	4.12	17.10	10.33	3.43	18.40	15.71	13.45	100
Embed. Multiplier	0	4	6	2	0	16	4	4	32
percentage	0.00	12.50	18.75	6.25	0.00	50.00	12.50	12.50	100
Vivado v2025.1
Family Virtex-7-Device xc7v585tffg1157-3
LUT as Logic	202	66	168	86	114	282	67	80	364,200
percentage	0.06	0.02	0.05	0.02	0.03	0.08	0.02	0.02	100
DSPs	4	2	5	3	0	8	6	4	1260
percentage	0.32	0.16	0.40	0.24	0.00	0.63	0.48	0.32	100
Vivado v2025.1
Family Spartan-7-Device xc7s75fgga484-1
LUT as Logic	202	66	169	86	114	281	67	80	48,000
percentage	0.42	0.14	0.35	0.18	0.24	0.59	0.14	0.17	100
DSPs	4	2	5	3	0	8	6	4	140
percentage	2.86	1.43	3.57	2.14	0.00	5.71	4.29	2.86	100
Vivado v2025.1
Family Zinq-7000-Device xc7z045iffg900-2L
LUT as Logic	202	66	169	86	114	282	67	80	218,600
percentage	0.09	0.03	0.08	0.04	0.05	0.13	0.03	0.04	100
DSPs	4	2	5	3	0	8	6	4	900
percentage	0.44	0.22	0.56	0.33	0.00	0.89	0.67	0.44	100

Table 3. Average absolute error in the sigmoid implementation, obtained for 65,536 equally spaced values in its domain.

Work	Avg. Abs. Error
[17]	$1.71 \times 10^{- 3}$
[12]	$7.70 \times 10^{- 3}$
[13]	$1.07 \times 10^{- 3}$
[14]	$4.25 \times 10^{- 3}$
[15]	$5.90 \times 10^{- 3}$
[16]	$2.44 \times 10^{- 3}$
ours (min. avg.)	$1.66 \times 10^{- 3}$
ours (min. max.)	$1.68 \times 10^{- 3}$

Table 4. The average absolute error values for different numbers of bits in the representation of the operands.

Number of Bits	Avg. Abs. Error
8	$81.0 \times 10^{- 3}$
16	$1.66 \times 10^{- 3}$
32	$1.66 \times 10^{- 3}$

Table 5. Results on FPGA used resources as a function of the number of bits for the operands.

Number of Bits	8	16	32	Available
Quartus Prime Version 24.1 std. 0 Build 1077 SC Lite Edition
Family Ciclone V-Device 5CGXFC7C7F23C8
Logic Utilization	25	25	231	56,480
percentage	0.0443	0.0443	0.409	100
DSP Blocks	2	5	15	156
percentage	1.28	3.20	9.62	100
Quartus Prime Version 24.1 std. 0 Build 1077 SC Lite Edition
Device MAX-10 10M02DCU324A6G
Logic Elements	50	362	1159	2304
percentage	2.17	15.7	50.3	100
Embed. Multiplier	2	4	32	32
percentage	6.25	12.5	100	100

Table 6. Results for the data path delay.

Work	[17]	[12]	[13]	[14]	[15]	[16]	Ours Avg.
Vivado v2025.1
Family Virtex-7-Device xc7v585tffg1157-3
delay (ns)	8.16	7.97	21.6	12.5	6.93	9.86	9.38
Vivado v2025.1
Family Spartan-7-Device xc7s75fgga484-1
delay (ns)	13.8	14.5	38.9	21.8	14.0	18.3	16.2
Vivado v2025.1
Family Zinq-7000-Device xc7z045iffg900-2L
delay (ns)	8.89	8.84	23.5	14.0	7.06	10.7	10.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bosso, V.d.A.; Nardini, R.M.; de Sousa, M.A.d.A.; dos Santos, S.D.; Pires, R. An Area-Efficient and Low-Error FPGA-Based Sigmoid Function Approximation. Appl. Sci. 2025, 15, 11551. https://doi.org/10.3390/app152111551

AMA Style

Bosso VdA, Nardini RM, de Sousa MAdA, dos Santos SD, Pires R. An Area-Efficient and Low-Error FPGA-Based Sigmoid Function Approximation. Applied Sciences. 2025; 15(21):11551. https://doi.org/10.3390/app152111551

Chicago/Turabian Style

Bosso, Vinicius de Azevedo, Ricardo Masson Nardini, Miguel Angelo de Abreu de Sousa, Sara Dereste dos Santos, and Ricardo Pires. 2025. "An Area-Efficient and Low-Error FPGA-Based Sigmoid Function Approximation" Applied Sciences 15, no. 21: 11551. https://doi.org/10.3390/app152111551

APA Style

Bosso, V. d. A., Nardini, R. M., de Sousa, M. A. d. A., dos Santos, S. D., & Pires, R. (2025). An Area-Efficient and Low-Error FPGA-Based Sigmoid Function Approximation. Applied Sciences, 15(21), 11551. https://doi.org/10.3390/app152111551

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Area-Efficient and Low-Error FPGA-Based Sigmoid Function Approximation

Abstract

1. Introduction

2. Justification and Related Works

3. Materials and Methods

4. Results

5. Discussion and Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. VHDL Source Code

Appendix A.1. Package Containing Definitions, Minimizing Average Absolute Error Version

Appendix A.2. Sigmoid Implementation

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI