1. Introduction
Due to the active use of artificial neural networks (ANNs) in real-time systems and the availability of a wide class of typical nonlinearities as activation functions, there is an ongoing and relevant research interest in accelerating information processing. In this regard, various methods are used to speed up the computation of activation functions. For example, [
1,
2] are devoted to the special organization of “fast” neurons and neural networks. A detailed review of modern approaches to accelerating activation functions is provided in [
3].
The CORDIC (Coordinate Rotation Digital Computer) family reduces the computation of complex functions to a set of simple iterations of algebraic addition and shift. It is often used as the basic algorithm for solving problems in onboard aerospace systems, providing a tradeoff between the accuracy and speed of computation with low memory costs [
4,
5]. The advantage of this implementation is that its accuracy is related to the number of iterations used, each of which adds one correct digit of the result. Modern applications of CORDIC algorithms that are useful on board UAVs mainly include fast and discrete Fourier transforms, discrete sine and cosine transforms, linear algebra algorithms, digital filtering, and image-processing algorithms. Despite their popularity, CORDIC algorithms have some disadvantages that prevent their effective use in parallel systems, including the iterative nature of calculations and the necessity of correcting the result. The solution may lie in constructing neurons with activation functions based on bit-parallel information-processing circuits using specialized computers with limited operand digit capacity [
6].
Bit-parallel computing is a group of methods that calculate mathematical functions with high accuracy in floating-point formats. The problem with such calculations is multifaceted due to the diversity and specificity of the tasks solved on their basis. Bit-parallel computing is used, for example, in the onboard navigation and control systems of autonomous aerial vehicles. All existing approaches aim to increase the length of the processed binary operand, thus improving the computation speed. Of considerable interest is a series of works aimed at increasing the speed of solving various comparison problems, establishing correspondence between streams, and performing arithmetic operations on bit vectors, including the simultaneous processing of matrix columns [
7,
8,
9,
10].
The problem of achieving a high level of instruction parallelism in string-matching algorithms is considered in [
11]. Two variants of the bit-parallel wide-window algorithm are presented, using parallel computation relying on the SIMD (Single Instruction Multiple Data) paradigm. As the experimental results show, this technique provides increased speed by doubling the size of window shifts.
Various architectural solutions have been proposed to speed up operations under computational resource constraints. In [
12], a bit-parallel computing in-memory architecture based on SRAM 6T cells is presented to support reconfigurable bit precision, in which bit-parallel complex computations become possible by iterating low-latency operations. The proposed architecture provides 0.68 and 8.09 TOPS/W performance for parallel addition and multiplication, respectively. A specialized architecture designed for the parallel computation of bit patterns is proposed in [
13]. That research assumes that all operations based on multi-bit data values are performed bit-serially, but at the same time, each operation is SIMD-parallel based on
data, and
-bit operations complete in
clock cycles using only
gates per clock.
Bit-analog computations of mathematical and logical functions are of interest. Thus, in [
14], an alternative approach to a parallel solution is considered. Here, a bit stream is processed through digital logic to satisfy the required algorithm. Using elementary digital gates, classical elements such as integrators and differentiators required to implement a PID controller on an FPGA can be constructed. It is also worth noting the research on the bit-parallel modeling of combinational logic gates [
15]. These results allow us to estimate the probability of the logical masking of a random circuit fault. The methods are compared in terms of accuracy and time cost using testing results for benchmark circuits.
The analog computation technique proposed by G.E. Pukhov is also of great interest. In [
16], a method based on the differential Taylor transform (DT) was proposed, which was applied to the basic concepts and solutions of various mathematical problems, as well as to the creation of physical models. The results obtained using this method can be presented both numerically as spectra and analytically as an approximate sequential or functional dependence. The DT method is universal, allowing one to determine the original function in the form of various functions other than the Taylor expansion. In [
17], differential transformations of functions and equations for the construction of linear and nonlinear circuits are presented. The authors of [
18] consider the differential transform method for solving various types of differential equations in fluid mechanics and heat transfer. A method of approximate differential transformation and its combination with Laplace transforms and Padé approximation for finding the exact solutions of linear and nonlinear differential and integro-differential equations is presented in [
19].
Let us proceed to a direct analysis of the work of G.E. Pukhov with respect to bit-parallel calculus [
6]. The basis for constructing bit-parallel computing circuits is the concepts of a bit vector and matrix used in bit-analog schemes of inverse function calculation and square root extraction. For high-precision bit-parallel computing in the floating-point format, the modular–positional data representation format is used. It enables parallelizing arithmetic operations down to the level of individual digits of multi-digit mantissas. Mapping the result of a functional transformation into a bit-parallel circuit involves representing each digit of a number in the form of a bit formula. Analog and digital bit-parallel computing is used, for example, in onboard navigation and control systems, where it is important to ensure a real-time mode under limited resources.
Due to the limited number of mathematical functions representable by G.E. Pukhov’s method, we will turn to another theory regarding the construction of a variety of mathematical functions based on CORDIC family algorithms proposed by Jack Volder [
20]. The algorithms efficiently calculate arithmetic, trigonometric, and hyperbolic functions and are currently used in various applications [
21], such as image processing and communication. Full-fledged studies of CORDIC functions were performed in detail by V.D. Baykov [
22,
23]. CORDIC algorithms reduce the computation of complex functions to iterative procedures containing simple addition and shift operations [
24,
25,
26]. A comparison and evaluation of methods for calculating the basic functions “square root” and “inverse function” is considered in [
27]. The transition to bit-parallel computational schemes can be carried out formally by expanding the iterations. This study proposes solving the increasing problem of simultaneous derivation of Volder operators by comparing the obtained bitwise representations with similar ones proposed by G.E. Pukhov.
In [
28], a low-latency CORDIC algorithm is proposed to accelerate the computation of arctangent functions. The authors stated that the novel method can effectively reduce the number of iterations through dedicated pre-rotation and comparison processes. The CORDIC IP Core User’s Guide [
29] provides a block diagram of the CORDIC arithmetic unit, which is configurable and supports several functions, including “rotation”,
. The device supports parallel mode, in which the output data are processed in one clock cycle, and serial mode, in which the output data are calculated in several clock cycles. The IP core supports variable precision and several rounding algorithms.
In the present study, we solve the problem of constructing bit-parallel computing circuits for the efficient implementation of neuron activation functions based on universal CORDIC algorithms. Such circuits can be constructed by extending the iterations and, on this basis, implementing computationally efficient activation functions of the “sigmoid” [
30] and “s-parabola” [
31] types. The circuits provide the necessary parallelism and expansion for the functional capabilities of onboard computers and can be obtained within the framework of the approaches proposed in [
32,
33]. An example of the construction of a “fast neuron” based on the calculation of the activation function using a tabular-algorithmic method with a focus on performing parallel representation and group summation operations is provided in [
34].
To speed up computations in neural networks, specialized devices (ANN accelerators) are used [
35]. ANN accelerators speed up two basic operations: multiplication and accumulation (MAC) and matrix–vector multiplication (MVM). The devices can be built using field-programmable gate arrays (FPGAs) or application-specific integrated circuits (ASICs). The authors of [
36] propose a scalable convolutional neural network (CNN) accelerator architecture for devices with limited computing resources. It can be deployed on several small FPGAs instead of one large FPGA for better control over parallel processing. A low-power CNN accelerator using an FPGA designed for mobile and resource-constrained devices is proposed in [
37]. Accelerators are used in various areas where neural network-based algorithms are employed: image and video processing, autonomous systems, and robotics [
38].
Let us now turn to some modern development directions in the subject area. The problems in constructing fast onboard computers are of great interest and are considered, for example, in [
39]. An analysis of the current state of the CubeSat Command and Data Handling (C&DH) subsystem is provided, covering both hardware components and flight software (FSW) development frameworks. The problems of applying modern neural network models in resource-constrained environments and accelerating inference time are considered in [
40]. That study provides a comprehensive review of current advances in pruning techniques as a popular research direction in neural network compression.
A promising direction for further research is the construction of fuzzy neural networks (FNNs), which can be achieved using neural architecture search (NAS). Such networks provide high classification accuracy and good performance in the presence of uncertainties [
41].
In this study, we rely on the creation of specialized computing structures without being tied to specific technologies.
Subsequent sections are organized as follows:
Section 2 formulates and solves the problem of bit-parallel computation in the sigmoid activation function of a neural network using the CORDIC and Pukhov methods. This function is constructed based on corresponding representations of the inverse function (
Section 2.1,
Section 2.2 and
Section 2.3) and exponent (
Section 2.4). Two statements on the parallelism of the Volder operators for these functions are proven.
Section 3 is devoted to constructing a bit-parallel scheme to compute a new activation function called “s-parabola”. For its implementation, bit-parallel schemes for extracting a square root using the CORDIC method (
Section 3.1) and Pukhov’s method (
Section 3.2) are presented. A statement on the parallel computability of Volder operators for implementing this function is proved. A parallel scheme for calculating Volder operators is also provided (
Section 3.3).
Section 4 presents theoretical estimates of the computational complexity of activation functions used in neural networks. Estimates of the time complexity of bit-parallel schemes for computing the sigmoid and s-parabola activation functions for different operand bit-widths are provided. The estimates consider the number of operations and the algebraic summation method used. Performance studies of the software implementation of the bit-parallel scheme using the example of the s-parabola are carried out.
The final section contains the main conclusions and prospects for the practical application of bit-parallel computational circuits for the fast calculation of activation functions as a part of convolutional neural networks. We recommend that the algorithms be included in the mathematical support of onboard computing systems, taking into account the limitations of their computing resources.
2. Bit-Parallel Computation of the Activation Function of the “Sigmoid” Type
A limited number of constants; ; in CORDIC, the “rotate” and “vector” procedures; and the “logarithmic” operation, , in practical applications, can be calculated in advance and stored in a database/knowledge base. Having simultaneous access to all Volder operators, , in the “rotate”, “vector”, and “logarithm” operations, provides broad opportunities for organizing bit-parallel computations. A typical approach is to move away from the sequential nature of computations by expanding recurrent relations in time. Here, one should add the possibilities of bit-parallel circuits to find the inverse function and calculate the square root, obtained using G.E. Pukhov’s methodology.
The sigmoid is a nonlinear function,
, with parameter
that outputs a real number in a range of 0 to 1; thus, if
, then
; if
, then
[
6]. The sigmoid is characterized by the saturation effect, which means that the network will be poorly trained at the boundaries. One disadvantage of neurons and ANNs is low computation speed. Let us consider the sigmoid (logistic) function of the following form:
In binary positional notation, the result can be represented as a convergent sum: . In practice, the upper limit of the sum is limited by the specific bit-width of the operands and the length of the digit grid of the computing device. We will consider the function , such that . The result of the calculation can be written in the usual positional form: .
The CORDIC family of computational algorithms can find one correct digit of the result at each iteration. Using the shift operation, addition, subtraction, and limited memory for storing constants, the algorithms implement the necessary functions [
23]. The technology for performing CORDIC operations is based on the fundamental possibility of representing the result as a sum,
, where
are Volder operators (hereinafter referred to as operators) for cases of alternating iterations. The number of terms in the sum,
, is associated with the number of iterations performed, which is determined by the bit-width of the operands,
; moreover,
. The operators are obtained as a result of the corresponding iterative procedure, which forms the first stage of function computation. By analogy with [
6], if we express the operators through the digits of the argument, we obtain a sum of the following form:
Since, in the general case, are not Volder operators, we will call them pseudo-operators.
The advantage of implementing these activation functions is the connection of accuracy with the number of iterations, at each of which, one correct digit of the result is added. The adjustable accuracy of the calculations, related to the number of iterations performed, allows us to choose a strategy depending on the time resource. The disadvantage of the given Formula (2) is the impossibility of the parallel calculation of the Volder operators, predetermining the sequential nature of the algorithms.
Let us consider the possibility of overcoming this disadvantage based on bit calculus [
6]. Vector
, whose components are the digits of a binary number (
), is a normalized bit vector. The task of mapping the result of the functional transformation
, where
, into a bit-parallel circuit involves representing each
th digit of a number,
, in the form of a formula:
In several cases typical of CORDIC algorithms, the digits of a mathematical function can be represented as a function of variables associated with the digits of the argument, for example, by Volder operators
. Such schemes are also bit-parallel, provided that all operators are simultaneously available. Implementing the sigmoid in accordance with (1) is based on the schemes of the bit-parallel representation of functions
and
.
2.1. Bitwise Representation of the Inverse Function Using the CORDIC Method
The argument of the function
must be presented in the form
. Here,
,
are quantities called “Volder operators”, which are calculated by the following formula:
Given the operators (3), the result of the operation is obtained as follows:
By successively expanding the product (4), we obtain
Scheme (5) expresses the result of the operation through Volder operators, which can only be calculated sequentially. Of interest is the possibility of the parallel calculation of these operators.
2.2. Bitwise Representation of the Inverse Function Using G.E. Pukhov’s Method
Let us consider the function . Here, is a positive binary number represented in normalized form, is the mantissa, is the order of number , and is the value of the th digit of the mantissa with its own weight (); moreover, , and is the length of the digit grid.
The result of finding the reciprocal value is presented as
, where
is the mantissa of a number,
. The relationship between
and
is set using a bit expression,
, where
is the bit vector. Here,
is the square bit matrix, and
is the inverse bit matrix [
6].
The bit matrix is formed by bit vectors, and all known matrix inversion methods can be used to obtain the inverse bit matrix. Thus, we can obtain the following formulas:
Without losing generality, let us consider the order of calculations for
. As a result of successive substitutions, given that
, we obtain
Here,
is expressed through the corresponding digit coefficients of
. To make the form of the system (7) suitable for performing the group summation of operands, the corresponding elements should be moved from one digit to another, taking into account their weights. Thus, we have
The computing scheme is bit-parallel. In the general case, calculating each digit, , through the digits of the argument leads to values that do not belong to the set and, therefore, requires inter-bit alignment to form the binary code of the result. However, the summation of all elements with respect to their weights must be performed to obtain the final result.
From the system of bit coefficients (8), two groups of binary operands can be formed in accordance with their signs. The prepared binary numbers must be summed to obtain the final result.
2.3. Bit-Parallel Computation of Volder Operators for the Inverse Function
Statement 1. Operators , for function can be calculated in parallel based on the equivalence of bit circuits (5) and (8).
Proof of Statement 1. Computational Schemes (5) and (8) are equivalent since they lead to the same values of the function
. The validity of the statement is based on the identity of the bitwise equating of representations.
By bitwise equating Relations (5) and (8), we can obtain the Volder operators:
In accordance with Statement 1, scheme (9) demonstrates the fundamental possibility of obtaining the values of pseudo-operators directly from the binary argument, . Substituting the values of these operators into (8) provides the correct solution to the problem of calculating the inverse function in bit-parallel form.
2.4. Bitwise Representation of the Function
When calculating the exponential function
with a base,
we can assume that its argument,
, is an
-digit binary number. Therefore, the following representation of the result as a product is valid:
where
.
Considering that
is a vector consisting of zeros and ones, we obtain
For definiteness, let us set
and
. Substituting the set of constants, we obtain the formulas
From Expression (11), we derive bit vectors
:
In accordance with (10), we obtain the result
in the form of the following scheme:
The accuracy can be increased by moving to a higher bit-width. This approach can be applied to any other base.
According to the CORDIC approach, the function argument is represented by
where the signs of the operators are found from the condition
and the result is obtained in the following form:
Expanding the expression for
, we obtain the following scheme for the bitwise representation:
By implementing the calculation scheme for an 8-bit argument, we obtain the results presented in
Table 1.
In CORDIC, the simultaneous retrieval of operator values is a problem that can be eliminated based on the equivalence of bit schemes.
2.5. Parallel Computation of Volder Operators for the Function
Statement 2. Operators , , for the function can be calculated in parallel based on the equivalence of bit circuits (14) and (16).
Proof of Statement 2. By equating the expressions with the corresponding coefficients in Formulas (14) and (16), we express operators,
, through the digits of the argument:
This result is useful only from the perspective of demonstrating the fundamental possibility of obtaining bit-parallel circuits based on CORDIC.
Let us consider the calculation of the function
, the result digits of which form a binary number,
. To obtain this function in a bitwise form, we use the Volder relations. For a function of the form
, the following recurrence relations are valid [
23]:
Input values:
Result: .
However, in accordance with [
6], we have
Considering the given relations, the result of calculating the function
in bit-parallel form has the structure shown in
Table 2.
Thus, to perform the bit-parallel representation of the function , it is necessary to do the following:
For each , calculate the corresponding set of operators, , for the base ;
Create tables of logarithms, , , where the sign within a bracket is determined by the operator sign, ;
Calculate the function using an adder that implements the summation operation.
Implementing the inverse function is performed as follows:
The bit-parallel form (see
Table 1) is modified by adding one
. For this, we will use an approximate bitwise representation of the digit one (see
Table 3).
In accordance with
Table 3, the digits
for bitwise representation (8) are formed. As noted above, to move to the normalized binary representation, it is necessary to perform the summation operation.
The inverse function, , according to scheme (8), and, finally, expression (1), are calculated.
4. Time Complexity Estimation for the Implementation of Bit-Parallel Circuits
Based on the developed bit-parallel circuits, the structure of a specialized processor unit is proposed, the operational part of which is shown in
Figure 1.
The main emphasis is placed on tabular algorithmic calculations, with a significant role given to logic arrays that store pre-prepared information. Each operation is implemented in the processing unit by appropriately adjusting its structure. The structure is programmed by switching (assigning) individual blocks using control signals
which are formed after the control device decodes the command codes. The use of pipelined and bit-parallel algorithms allows us to build calculators on a unified methodological basis with simple and convenient mathematical apparatuses combining cycles of information processing, which involves the group summation of operands with effective hardware support. The basic set of commands is presented in
Table 8.
To select the required function, it is necessary to assign the path of information flow in the specified structure using control signals. The processor element hardware supports a specialized language, with the help of which the user can perform structured programming of algorithms.
The calculation of mathematical functions based on bit-parallel representation is reduced to the algebraic summation of binary operands in accordance with their signs and weights. The number of terms for , is estimated as , and for , as , where is the bit-width of the operands.
For bit-parallel circuits, a special data representation is used in the form of a table, which contains operands (columns) prepared for algebraic addition. An example of such a logical table for calculating the inverse function is presented in
Table 9.
The table allows us to implement bit-parallel circuits based on FPGA and ASIC technologies. At this stage of the study, we limit ourselves to assessing the possible performance of such an approach.
Estimates of the computation complexity of various activation functions depending on the operand digits,
, but without using bit-parallel computations are presented in
Table 10.
Table 10 shows that the theoretical estimation of computational acceleration using the s-parabola function compared with using sigmoid and swish can be more than 5.6-fold. At the same time, the s-parabola is significantly inferior to the ReLU function in terms of average performance (approximately 8-fold).
In [
44], a theorem is formulated proving that the highest possible length of the sum of
n m-bit integer numbers is equal to
.
Using the cascade scheme of connecting parallel adders, the algebraic summation of
operands can theoretically be performed for several clock cycles,
, and the sequential summation will take
cycles. The clock cycle here is equal to the time of one summation operation. Estimates of the complexity of calculating various functions according to Formulas (22) and (25) are provided in
Table 11.
Implementing bit-parallel circuits requires an increase in hardware costs. Of interest is summation technology based on fast multi-input adders [
45,
46]. Performance is determined by the formula
, where the clock corresponds to the time when reading the operand from the permanent memory, and
is the number of summands.
We simulated the computation of several mathematical functions for different input data in accordance with the CORDIC iteration formulas [
29]. The average performance indicators obtained are given in
Table 12.
The values are used to evaluate the performance indicators of the software implementations of activation functions. The corresponding computation time values are provided in
Table 13. Testing was carried out on an Intel Core i7-7820X (3.6 GHz) processor using the assembler programming language.
Table 13 shows that the s-parabola activation function outperforms the sigmoid, SiLu, and swish activation functions by about 1.6-fold based on the test results. The transition to bit-parallel formulas to implement activation functions makes it possible to substantially equalize quantitative time estimations of various function calculations. Thus, program implementations of bit-parallel circuits provide acceleration in comparison with CORDIC implementations on an average of 1.5-fold. Better results can only be obtained with the hardware implementation of activation functions, but these studies are outside the scope of this study.
5. Discussion
We analyzed works devoted to the problem of implementing fast activation functions for artificial neural networks. We propose using “s-parabola” as an activation function along with the sigmoidal function. Computing the sigmoid as one of the main activation functions requires resource-intensive operations, such as raising to a power, division, or series expansion. Compared with the sigmoid, the proposed activation function is significantly simpler to implement and has certain advantages. To speed up calculations in systems with limited capabilities, we propose using Pukhov and Volder’s computational schemes with a fixed bit-width. The simultaneous use of these approaches ensures the parallel implementation of activation functions, demonstrating the possibility of constructing bit-parallel computational circuits that provide high performance.
The use of bit-parallel circuits and multiprocessor (multicore) devices allows us to significantly increase performance compared with implementation on a universal processor. The considered functions can be used in various multilayer ANNs, taking into account the specifics of the tasks being solved and the computing capabilities of the onboard computers. We recommend including the proposed algorithms in the mathematical software of onboard computers in the form of a library or implementing them as a universal computing module. The adjustable accuracy of calculations associated with the number of iterations performed allows us to choose a strategy depending on the time resource.
In the general case, studies have shown that applying the s-parabola function and its variations provides theoretical superiority in speed over the sigmoid, swish, hyperbolic tangent, and softsign functions, and that it is inferior only to the ReLU function. Software implementation with the swish, sigmoid, and s-parabola functions using CORDIC algorithms and bit-parallel circuits maintains this trend.
The results presented herein are subject to certain limitations, primarily attributable to the algorithmic and theoretical orientation of the study. The proposed bit-parallel computing circuits have not yet been directly implemented in hardware accelerators. The objective of this work was not to achieve direct hardware realization but rather to focus on algorithmic transformations and formal restructuring aimed at facilitating future hardware implementation. Due to this focus, no direct comparisons were conducted between the performance of neural networks using the newly proposed activation functions and those employing standard functions.
A distinguishing feature of the study lies in the development of a method for organizing bit-parallel algorithms as tabular structures defined by logical expressions. This representation provides a foundation for leveraging programmable logic arrays to accelerate the computation of activation functions in artificial neural networks. Additionally, an architecture of a specialized processing unit is proposed, featuring an instruction set capable of executing mathematical operations involved in fast activation functions. At this stage, a theoretical analysis of the computational complexity for both sequential and bit-parallel implementations of key activation functions has been carried out. Comparative software evaluations using assembly language confirm the potential and efficiency of the proposed approach for integration into hardware-based computing systems.
We plan to conduct experiments using hardware accelerators and specialized computers in subsequent studies. In addition, new activation functions will be used to solve many practical problems.