High-Precision and Efficiency Hardware Implementation for GELU via Its Internal Symmetry

Huang, Jianxin; Wu, Yuling; Zhuang, Mingyong; Zhou, Jianyang

doi:10.3390/electronics14091825

Open AccessArticle

High-Precision and Efficiency Hardware Implementation for GELU via Its Internal Symmetry

School of Electronic Science and Engineering, Xiamen University, Xiamen 361000, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(9), 1825; https://doi.org/10.3390/electronics14091825

Submission received: 14 April 2025 / Revised: 28 April 2025 / Accepted: 28 April 2025 / Published: 29 April 2025

(This article belongs to the Section Circuit and Signal Processing)

Download

Browse Figures

Versions Notes

Abstract

:

The Gaussian Error Linear Unit (GELU), a crucial component of the transformer model, poses a significant challenge for hardware implementation. To address this issue, this paper proposes internal symmetry piecewise approximation (ISPA) and error peak search strategy (EPSS) for high-precision and high-efficiency implementation of the GELU activation function. ISPA only approximates the positive axis of the erf in GELU and then leverages its internal symmetry to calculate the negative axis part. With ISPA, the mean square error (MSE) between the fitted result and the true value can reach

4.29 \times

10^{- 9}

with 16 parts of the approximation segment, outperforming the regular method, which achieves

1.19 \times

10^{- 6}

with 16 parts. Furthermore, EPSS can automatically find suitable and high-precision intervals for our piecewise approximation method. To evaluate the effectiveness of ISPA and EPSS, we conducted experiments on three different ViT models and observed negligible loss of prediction accuracy. The hardware implementation is on an XCZU9EG FPGA running at 450 MHz. Experimental results indicate that ISPA outperforms existing methods.

Keywords:

Gaussian Error Linear Unit; hardware architecture; neural networks; search strategy; FPGA

1. Introduction

The paradigm-shifting advancements in deep learning architectures have fundamentally reshaped artificial intelligence applications across modern technological ecosystems. As the computational cornerstone of neural networks, activation functions serve as critical nonlinear transformers that empower hierarchical feature extraction—a capability indispensable for solving real-world problems with intricate nonlinear patterns. Among contemporary activation paradigms, the Gaussian Error Linear Unit (GELU) [1] has emerged as a seminal innovation demonstrating exceptional versatility across diverse artificial intelligence domains.

Distinct from conventional rectified linear units, GELU introduces a probabilistic interpretation through its unique integration with Gaussian statistics, enabling adaptive input modulation based on relative magnitudes within normal distribution parameters [2]. This sophisticated mechanism inherently incorporates stochastic regularization properties during activation, effectively balancing deterministic computation with noise-induced robustness—a dual functionality that enhances model generalization without explicit regularization terms. Empirical studies [1,2,3] have quantitatively demonstrated GELU’s superior performance in mitigating overfitting while maintaining gradient stability across deep architectures.

The practical significance of GELU manifests through its pervasive adoption in state-of-the-art models spanning multiple AI disciplines. In natural language processing, it powers transformer-based giants including BERT [4], T5 [5], and GPT-2 [6], where its non-monotonic nature proves crucial for contextual representation learning. Computer vision systems equally benefit from GELU’s smooth gradient transitions, with benchmark implementations in Vision Transformers [7] and Swin Transformers [8] demonstrating consistent performance gains over traditional activation functions. Cross-domain analyses reveal GELU’s architectural agnosticism, showing competitive results in multimodal architectures (CLIP [9]) and speech recognition systems (Whisper [10]).

Currently, numerous accelerator circuit implementations have incorporated Gaussian GELU function computing modules, particularly in dedicated Vision Transformer (ViT) accelerator circuits [11,12,13,14,15,16,17,18] and general-purpose Transformer accelerator designs [19,20,21,22]. In these architectures, a direct hardware implementation of the GELU function is challenging; most designs, therefore, rely on approximation methods to realize GELU in circuitry. This prevalence highlights that, within the current field of hardware-accelerator research, GELU computation blocks are used extensively—and that designing highly accurate GELU-approximation algorithms, which can be implemented with minimal resource overhead, has exceptional practical value.

However, the mathematical expression of the GELU activation function is relatively complex, involving several hardware-unfriendly operations such as exponentiation, division, and multiplication. When designing domain-specific accelerators for networks that incorporate the GELU activation function, these operations can significantly degrade the circuit’s performance, power consumption, and area efficiency.

Consequently, many prior works have employed approximation techniques to convert hardware-unfriendly operations into hardware-friendly ones, such as linear and bit-shift operations. Within an acceptable error range, these approaches simplify circuit design and enhance performance. The primary approximation methods include Taylor expansion [23,24], lookup tables [25,26], and piecewise linear functions [27,28]. Among these, the piecewise linear function method achieves the desired approximation with minimal storage and computational resources. Despite existing advancements, there remains significant room for improvement in accelerating GELU computation.

We propose a hardware-friendly algorithm capable of automatically identifying segmentation points. Unlike prior studies that approximate the function directly without leveraging its internal properties, our approach exploits the characteristics of sub-functions for both algorithm and circuit design. Furthermore, most piecewise linear function methods [29,30,31,32] rely on manually defined segments without providing a reliable segmentation algorithm. Additionally, many existing approaches adopt fixed-point data formats [29,30,31,32,33], leading to substantial quantization errors and compatibility issues. In contrast, this study employs the BF16 data format for circuit design, which reduces precision loss and enhances circuit versatility.

To accelerate the computation of the GELU function, we propose a novel algorithm and implement a corresponding hardware circuit utilizing the widely adopted BF16 data format in the deep learning field. Our main contributions are as follows:

We propose the internal symmetry piecewise approximation (ISPA). Instead of using the symmetry of the entire GELU activation function, we use the symmetry of the GELU’s internal Gauss error function (erf) to achieve a piecewise approximation of the positive and negative parts.
We propose an Error Peak Search Strategy (EPSS), an automated framework for determining optimal segmentation schemes in piecewise approximation tasks. Extensive experimental results demonstrate that EPSS achieves superior performance compared to conventional optimization methods, including but not limited to the Nelder-Mead simplex algorithm and Newton-CG (Newton Conjugate Gradient) method.
The proposed method is verified on three ViT models (Res-ViT, VIT-B, and VIT-L) with different configurations and demonstrated lossless precision.
Hardware implementation on the FPGA platform achieves lower resource costs (LUT, Register and BRAM) and higher work frequency compared with the existing advanced method.

The rest of this paper is organized as follows: Section 2 presents prior research efforts on hardware-accelerated computation of the GELU function. Section 3 elaborates on the algorithmic principles underlying the ISPA and EPSS methods. Section 4 details the hardware circuit implementations derived from ISPA and EPSS methodologies. Section 5 evaluates the proposed algorithms and circuits through comprehensive performance analysis. Finally, Section 6 closes with a summary and conclusion.

2. Background and Prior Research

While the GELU function can be computationally approximated, prevailing approximation methodologies necessitate the computation of nonlinear functions such as hyperbolic tangent (tanh), rendering these approaches hardware-unfriendly in implementation. Therefore, to enhance the operational efficiency of neural networks within specialized architectures, there is an urgent need to propose a hardware-friendly approximation method for the GELU activation function.

Some prior work has addressed the approximation and hardware implementation of the GELU activation function. For instance, ref. [29] directly approximates GELU using a piecewise linear function, but this method exhibits low computational accuracy and inefficient circuit design, resulting in high hardware resource consumption. In another approach, ref. [34] designs a circuit that supports both softmax and GELU activation function calculations by leveraging their shared computational properties and optimizing logarithmic and exponential calculations through mathematical transformations. However, this design requires a significant amount of LUT and register resources, limiting its efficiency. Ref. [25] employs a lookup table method for nonlinear function calculations and introduces a new LUT structure (t-LUT), though the design ultimately consumes considerable memory resources and fails to achieve high accuracy.

Considering the balance between hardware resource consumption and computational accuracy, we chose to use the BF16 data format. BF16, introduced by Google, is a floating-point format optimized for deep learning applications. Compared to the traditional FP16 format, BF16 increases the exponent to eight bits while reducing the mantissa to seven bits. This gives BF16 a dynamic range comparable to FP32, effectively avoiding the reduced representable range often seen with lower bit-width formats. By using BF16 instead of a fixed-point format, we can significantly reduce errors associated with lower bit-width data formats and improve the accuracy of GELU activation function approximations using piecewise linear functions.

3. Algorithm Design

This section introduces ISPA and EPSS, a novel piecewise approximation method for GELU and an automatic interval search strategy.

3.1. Internal Symmetry Piecewise Approximation Method

The GELU activation function is a Gaussian-based activation function, and its mathematical representation is denoted as follows:

G E L U (x) = x \cdot Φ (x)

(1)

where

Φ

(x) represents the standard normal cumulative distribution function (CDF) of the input x. The CDF is written as

Φ (x) = 0.5 \cdot (1 + e r f (\frac{x}{\sqrt{2}}))

(2)

where the erf is the represented error function, and it can be calculated according to the following equation:

e r f (\frac{x}{\sqrt{2}}) = \frac{2}{\sqrt{2}} \int_{0}^{\frac{x}{\sqrt{2}}} e^{- t^{2}} d t

(3)

The erf is an odd function with symmetry at the zero point, and this property leads to the transformation for CDF:

\begin{matrix} Φ (- x_{-}) & = 1 - 0.5 \cdot (1 + e r f (\frac{- x_{+}}{\sqrt{2}})) \\ = 1 - Φ (- x_{+}) \end{matrix}

(4)

where

x_{+}

> 0,

x_{-}

< 0, and

| x_{+} |

=

| x_{-} |

.

To leverage the symmetry of the erf function within the GELU activation to reduce computational complexity, we propose a symmetric transformation method called the ISPA. We directly apply a piecewise approximation of its internal erf function. Our method can be described as

\begin{matrix} G E L U (x_{-}) & = (- x_{+}) \cdot Φ (- x_{+}) \\ = (- x_{+}) \cdot 0.5 \cdot (1 + e r f (\frac{- x_{+}}{\sqrt{2}})) \\ = 0.5 \cdot x_{+} \cdot (1 + \underset{︸}{e r f (\frac{x_{+}}{\sqrt{2}})}) - x_{+} \\ P i e c e w i s e A p p r o x i m a t i o n \end{matrix}

(5)

After defining the symmetric transformation and approximation method of GELU, we implement the piecewise approximation on the erf in Equation (5). This constitutes a distinctive component of our algorithm compared to other existing works, as we implement piecewise approximation solely on a specific segment of the entire formula rather than approximating the complete GELU computation formula.

The erf is an odd function with zero point symmetry. We divided it into

n

segmentation intervals, each characterized by distinct coefficients

a_{i}

and

b_{i}

, where

i \in {1, 2, 3, \dots, n}

. As Equation (6) indicated, erf is approximated by calculating

a_{i} \cdot x + b_{i}

for each interval. The approximation result of erf can be seen in Figure 1a.

\erf (\frac{x}{\sqrt{2}}) = a_{i} \cdot x + b_{i}

(6)

After approximating the erf function, the GELU formulation can be represented through the derived analytical expression in Equation (7).

\begin{matrix} G E L U (x_{+}) & = 0.5 \cdot x_{+} \cdot (a_{i} \cdot x_{+} + b_{i} + 1) \end{matrix}

(7)

Utilizing the approximation parameters obtained from the erf analysis, we establish an accurate functional representation of GELU. The resulting approximation of the GELU function is presented in Figure 1b, demonstrating high agreement with the original function through visual inspection. Quantitative evaluation of the approximation accuracy will be systematically examined in Section 5.

3.2. Error Peak Search Strategy

We utilize piecewise linear approximation for the erf function to facilitate simplified circuit implementation. In piecewise linear approximations, breakpoint selection constitutes a critical factor, as optimal positioning achieves enhanced approximation accuracy with reduced segment counts while controlling hardware complexity. To address this challenge, we develop an automated breakpoint search framework—the Error Peak Search Strategy (EPSS)—specifically designed for high-precision approximation, enabling efficient identification of optimal segmentation points.

The EPSS determines new breakpoints by analyzing approximation errors generated during piecewise linear fitting of nonlinear functions. We analyze this approach through a case study where piecewise linear approximations first fit the erf function and subsequently implement GELU approximation. As illustrated in Figure 2a, the absolute error between the ISPA-based piecewise linear approximation and the original erf function displays symmetry about zero, a consequence of exploiting the erf function’s intrinsic symmetry during fitting. This symmetry permits EPSS optimization to concentrate exclusively on the [0, 8] interval, with optimized breakpoints automatically mirrored to [−8, 0]. Analysis of the erf curve reveals that outputs asymptotically approach

1

for inputs exceeding 3. Therefore, we truncate the fitting domain at x = 3 and approximate all values in

(3, \infty)

as constant

1

, establishing [0, 3] as the initial segmentation interval.

The initial interval is divided into six segments with lengths constrained to

2^{- n}

, a design choice motivated by hardware implementation requirements. This quantization scheme ensures breakpoint coordinates are exactly representable in BF16 formats, minimizing parameter storage errors. EPSS identifies the dominant error peak (

α

) and compares adjacent peaks in both positive (

β

) and negative (

γ

) directions. The interval containing the higher-magnitude peak undergoes refinement through midpoint insertion. Figure 2b demonstrates this process: with six initial breakpoints, the

β

peak dominates, prompting refinement of (

α

,

β

). Subsequent error analysis (Figure 2c) shows significant error reduction in the modified region when progressing to seven breakpoints. Further optimization to eight breakpoints (Figure 2d) eliminates the

α

peak while achieving comprehensive error suppression, validating EPSS’s interval optimization efficacy. The iterative breakpoint identification process is formalized in Algorithm 1.

Algorithm 1 Error Peak Search Strategy

Require:: InitSegments, MaxSegNums
Ensure:: NewSegments
1:: $N e w S e g m e n t s = I n i t S e g m e n t s$
2:: $n = Sizeof (N e w S e g m e n t s)$
3:: while $n < MaxSegNums$ do
4:: $α, β, γ \leftarrow AbsoluteErrorCal (G E L U_o r i g i n, G E L U_F i t t e d (N e w S e g m e n t s))$
5:: if $γ < β$ then
6:: $N e w B r e a k p o i n t = \frac{α + β}{2}$
7:: else
8:: $N e w B r e a k p o i n t = \frac{α + γ}{2}$
9:: end if
10:: $N e w S e g m e n t s . insert (N e w B r e a k p o i n t)$
11:: $n = n + 1$
12:: end while

4. Hardware Architecture and Implementation Details

This section presents the hardware circuit implementing ISPA, which we proposed in Section 3, utilizing the BF16 data format. The discussion covers the overall circuit framework and the internal structures of the multiplier and adder specifically designed to support the BF16 data format.

4.1. Overall Architecture

The overall block diagram of the accelerator is shown in Figure 3. Each GELU function computation requires two clock cycles.

Firstly, the calculation of Equation (6) is executed. The input x is evaluated to determine the interval to which it belongs, and the corresponding coefficients

a_{i}

and

b_{i}

are retrieved from LUT and sent to the multiplier and adder, respectively. The final output from the adder is temporarily stored in a register.

Based on Equation (7), the subsequent steps are executed using the result from stage one. The value

0.5 x

and the result from stage one are passed to the multiplier. By leveraging the BF16 format, the multiplication by

0.5 x

does not require a multiplier; instead, the exponent of x is reduced by a constant

1

to compute

0.5 x

, thus reducing the computation load. The value of x and the output from the multiplier are then passed to the adder. Finally, the MUX unit selects either the multiplier or adder output to pass to the register based on the sign of the input x.

\begin{matrix} G E L U (x_{+}) & = 0.5 \cdot x_{+} \cdot (a_{i} \cdot x_{+} + b_{i} + 1) \\ = 0.5 \cdot x_{+} \cdot (a_{i} \cdot x_{+} + b_{i}^{'}) \end{matrix}

(8)

Additionally, as Equation (8) shows, the constant

1

can be incorporated into the coefficient

b_{i}

, which is stored in the circuit as

b_{i}^{'} = b_{i} + 1

. Figure 4 presents this approach, which reduces the number of addition operations required during computation.

Furthermore, since the initial interval length received by the EPSS necessarily satisfies being an integer multiple of

2^{- n}

, and when adding new partition points, the newly generated points are always positioned at the midpoints of existing intervals, the resulting sub-intervals maintain lengths that remain integer multiples of

2^{- n}

. This mathematical property ensures that all numerical values representing partition points in circuit design can be exactly represented in the BF16 data format without incurring rounding errors during value storage.

4.2. Basic Calculate Unit

To perform multiplication and addition operations in the BF16 data format, we designed the corresponding multiplier and adder circuits. We analyzed the computation process of BF16 data, handling the sign bit, exponent, and mantissa separately to design the multiplier and adder. To improve the operating frequency of the circuit, the internal structure of both the multiplier and adder was designed using two-stage pipelining techniques.

The arithmetic units for BF16 floating-point operations employ a unified two-stage processing pipeline with tailored computational steps for multiplication and addition, as illustrated in Figure 5a and Figure 5b, respectively. Both implementations share fundamental normalization and overflow handling mechanisms while differing in their initial computational approaches.

For multiplication, the first stage combines the exponents through addition and computes the mantissa product through binary multiplication, generating a sixteen-bit intermediate result with seven higher bits preserved for rounding precision. Conversely, the adder’s initial phase aligns exponents by shifting the smaller-magnitude operand’s mantissa based on exponent differences, followed by mantissa addition/subtraction.

The second stage demonstrates architectural convergence through three essential operations: normalization, rounding, and overflow management. The multiplier performs normalization through bit-shifting and subtractive exponent adjustment to maintain the leading one convention, while the adder resolves carry propagation and mantissa realignment through similar shift operations. Both units incorporate overflow detection mechanisms—the multiplier limits output within representable ranges, whereas the adder employs a fail-safe zero-output strategy for overflow conditions.

5. Experiments and Performance Evaluation

This section evaluates ISPA, EPSS, and the designed circuit from several aspects, including algorithmic error, the actual computational performance of the deep learning network (DNN), and the implementation results of the hardware circuit. To evaluate the fitting accuracy of the piecewise functions obtained by EPSS under different segmentation counts and the area consumption of the ISPA computational circuit, assessments were conducted for the cases with eight segments (ISPA-8) and sixteen segments (ISPA-16).

5.1. Quantitative Error Characterization

As shown in Figure 6, we compared the approximation results obtained by directly applying EPSS to the GELU function versus applying ISPA. Due to the order-of-magnitude difference in accuracy between the two fitting methods, a logarithmic axis is used in the figure. The results demonstrate that piecewise linear fitting of the internal erf yields higher accuracy. We attribute this to the relatively simpler curve structure of the erf compared to the GELU. Additionally, retaining the 0.5× multiplication after approximating the erf preserves part of the GELU calculation process, which further contributes to accuracy improvement.

As shown in Table 1, the mean square error (MSE) and max absolute error (MAE) between the approximated results and the exact results are presented for comparison. The segment number column denotes the number of segments in the piecewise approximation method. The results indicate that our fitted GELU function achieved higher accuracy than other methods with fewer segments. Table 1 also presents a comparison of the fitting accuracy between ISPA-8 and ISPA-16. It is evident that after using EPSS to identify new segmentation points, increasing the number of segments in the piecewise function from 8 to 16 leads to a significant improvement in fitting accuracy.

5.2. DNN Accuracy Test

To test the potential impact of our ISPA on the actual application of DNNs, we selected the ViT (Vision Transformer) [35] for evaluation. As illustrated in Figure 7, the architectural framework of ViT primarily consists of Transformer encoder modules. The GELU activation function is implemented within the MLP module of the encoder and is invoked multiple times throughout the entire computational process of ViT. We utilized Google’s pre-trained ViT model based on ImageNet21K and fine-tuned it on the CIFAR-100 dataset. After completing the training, we replaced the GELU function in the ViT network with our proposed fitted function and then performed inference to test whether the inference accuracy was affected by the fitted GELU function.

Table 2 and Table 3 show the test results. We evaluated three different ViT configurations: Base (ViT-B), Large (ViT-L), and ResNet Backbone (Res-ViT) with ISPA-8 and ISPA-16. The results indicate that using ISPA-8 for ViT model inference achieves minimal accuracy loss while using ISPA-16 does not result in any accuracy loss. This demonstrates that EPSS can identify the segmentation points required for high-precision fitting and validate the effectiveness of ISPA.

5.3. Hardware Resource Evaluation

We used the Vivado to perform synthesis and implementation of the hardware circuit, with the selected device being XCZU9EG. Table 4 shows the parameters of XCZU9EG. Table 5 presents the resource consumption of the circuit and compares it with that of other GELU accelerator circuits. Compared to other designs, our design consumes fewer logic resources and registers, utilizes no digital signal processor (DSP) for computation, and achieves a higher operating frequency. Table 5 indicates that there is no significant difference in the resource consumption of the ISPA computational circuit across different segmentation counts. Therefore, under varying application scenarios, the choice of circuit configuration can primarily be determined based on the required computational accuracy.

6. Conclusions

In this study, we present a systematic investigation of activation function approximation through a novel methodology named ISPA. The core innovation lies in exploiting the inherent odd function property of the error function to construct a piecewise linear approximation for Gaussian Error Linear Unit activation, effectively combining analytical approximation with an automated piecewise segmentation strategy.

Furthermore, we implement a hardware-efficient architecture on FPGA platforms. The proposed design demonstrates superior resource efficiency, requiring only 337 LUTs and 185 FFs for ISPA-16 implementation. The implementation demonstrates that using the internal symmetry of erf to approximate GELU can achieve higher fitting accuracy and save more resources compared with the existing approximation method. Compared with [25,29,30,33], our work achieves lower hardware resource utilization and a higher operating frequency without employing any DSPs. The proposed techniques establish a new paradigm for activation function implementation that harmonizes mathematical precision with hardware pragmatism.

Author Contributions

Conceptualization, J.H.; methodology, J.H.; software, J.H.; validation, J.H.; writing—original draft, J.H.; writing—review and editing, J.H., Y.W., M.Z. and J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research is partly supported by Fujian Provincial Department of Science and Technology 2022I0001.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Hendrycks, D.; Gimpel, K. Gaussian error linear units (gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]
Lee, M. Mathematical analysis and performance evaluation of the gelu activation function in deep learning. J. Math. 2023, 2023, 4229924. [Google Scholar] [CrossRef]
Dubey, S.R.; Singh, S.K.; Chaudhuri, B.B. Activation functions in deep learning: A comprehensive survey and benchmark. Neurocomputing 2022, 503, 92–108. [Google Scholar]
Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Proc. NAACL-HLT, Albuquerque, NM, USA, 29 April–4 May 2019; pp. 4171–4186. [Google Scholar]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 140:1–140:67. [Google Scholar]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Zhang, P.; Dai, X.; Yang, J.; Xiao, B.; Yuan, L.; Zhang, L.; Gao, J. Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 2978–2988. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, Virtual Event, 18–24 July 2021; Meila, M., Zhang, T., Eds.; PMLR: Westminster, UK, 2021; Volume 139, pp. 8748–8763. [Google Scholar]
Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision. In Proceedings of the International Conference on Machine Learning, ICML 2023, Honolulu, HI, USA, 23–29 July 2023; PMLR: Westminster, UK, 2023; Volume 202, pp. 28492–28518. [Google Scholar]
Wang, T.; Gong, L.; Wang, C.; Yang, Y.; Gao, Y.; Zhou, X.; Chen, H. ViA: A Novel Vision-Transformer Accelerator Based on FPGA. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2022, 41, 4088–4099. [Google Scholar] [CrossRef]
Nag, S.; Datta, G.; Kundu, S.; Chandrachoodan, N.; Beerel, P.A. ViTA: A Vision Transformer Inference Accelerator for Edge Applications. In Proceedings of the 2023 IEEE International Symposium on Circuits and Systems (ISCAS), Monterey, CA, USA, 21–25 May 2023; pp. 1–5. [Google Scholar]
You, H.; Sun, Z.; Shi, H.; Yu, Z.; Zhao, Y.; Zhang, Y.; Li, C.; Li, B.; Lin, Y. ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and Accelerator Co-Design. In Proceedings of the 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), Montreal, QC, Canada, 25 February–1 March 2023; pp. 273–286. [Google Scholar]
Dumoulin, J.; Houshmand, P.; Jain, V.; Verhelst, M. Enabling Efficient Hardware Acceleration of Hybrid Vision Transformer (ViT) Networks at the Edge. In Proceedings of the 2024 IEEE International Symposium on Circuits and Systems (ISCAS), Singapore, 19–22 May 2024; pp. 1–5. [Google Scholar]
Marino, K.; Zhang, P.; Prasanna, V.K. ME-ViT: A Single-Load Memory-Efficient FPGA Accelerator for Vision Transformers. In Proceedings of the 2023 IEEE 30th International Conference on High Performance Computing, Data, and Analytics (HiPC), Goa, India, 18–21 December 2023; pp. 213–223. [Google Scholar]
Dong, P.; Zhuang, J.; Yang, Z.; Ji, S.; Li, Y.; Xu, D.; Huang, H.; Hu, J.; Jones, A.K.; Shi, Y.; et al. EQ-ViT: Algorithm-Hardware Co-Design for End-to-End Acceleration of Real-Time Vision Transformer Inference on Versal ACAP Architecture. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2024, 43, 3949–3960. [Google Scholar] [CrossRef]
Parikh, D.; Li, S.; Zhang, B.; Kannan, R.; Busart, C.; Prasanna, V. Accelerating ViT Inference on FPGA through Static and Dynamic Pruning. In Proceedings of the 2024 IEEE 32nd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), Orlando, FL, USA, 5–8 May 2024; pp. 78–89. [Google Scholar]
Tian, S.; Szafranski, C.; Zheng, C.; Yao, F.; Louri, A.; Chen, C.; Zheng, H. VITA: ViT Acceleration for Efficient 3D Human Mesh Recovery via Hardware-Algorithm Co-Design. In Proceedings of the 61st ACM/IEEE Design Automation Conference, DAC ’24, San Francisco, CA, USA, 23–27 June 2024. [Google Scholar]
Han, Y.; Liu, Q. HPTA: A High Performance Transformer Accelerator Based on FPGA. In Proceedings of the 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL), Gothenburg, Sweden, 4–8 September 2023; pp. 27–33. [Google Scholar]
Zhou, M.; Xu, W.; Kang, J.; Rosing, T. TransPIM: A Memory-based Acceleration via Software-Hardware Co-Design for Transformer. In Proceedings of the 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA), Seoul, Republic of Korea, 2–6 April 2022; pp. 1071–1085. [Google Scholar]
Luo, Y.; Yu, S. H3D-Transformer: A Heterogeneous 3D (H3D) Computing Platform for Transformer Model Acceleration on Edge Devices. ACM Trans. Des. Autom. Electron. Syst. 2024, 29, 1–19. [Google Scholar] [CrossRef]
Wang, H.Y.; Chang, T.S. Row-wise Accelerator for Vision Transformer. In Proceedings of the 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS), Incheon, Republic of Korea, 13–15 June 2022; pp. 399–402. [Google Scholar]
Nilsson, P.; Shaik, A.U.R.; Gangarajaiah, R.; Hertz, E. Hardware implementation of the exponential function using Taylor series. In Proceedings of the 2014 NORCHIP, Tampere, Finland, 27–28 October 2014; pp. 1–4. [Google Scholar]
Qin, Z.; Qiu, Y.; Sun, H.; Lu, Z.; Wang, Z.; Shen, Q.; Pan, H. A Novel Approximation Methodology and Its Efficient VLSI Implementation for the Sigmoid Function. IEEE Trans. Circuits Syst. II Express Briefs 2020, 67, 3422–3426. [Google Scholar] [CrossRef]
Xie, Y.; Joseph Raj, A.N.; Hu, Z.; Huang, S.; Fan, Z.; Joler, M. A Twofold Lookup Table Architecture for Efficient Approximation of Activation Functions. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2020, 28, 2540–2550. [Google Scholar] [CrossRef]
Leboeuf, K.; Namin, A.H.; Muscedere, R.; Wu, H.; Ahmadi, M. High Speed VLSI Implementation of the Hyperbolic Tangent Sigmoid Function. In Proceedings of the 2008 Third International Conference on Convergence and Hybrid Information Technology, Busan, Republic of Korea, 11–13 November 2008; Volume 1, pp. 1070–1073. [Google Scholar]
Chiluveru, S.R.; Gyanendra; Chunarkar, S.; Tripathy, M.; Kaushik, B.K. Efficient Hardware Implementation of DNN-Based Speech Enhancement Algorithm With Precise Sigmoid Activation Function. IEEE Trans. Circuits Syst. II Express Briefs 2021, 68, 3461–3465. [Google Scholar] [CrossRef]
Choi, K.; Kim, S.; Kim, J.; Park, I.C. Hardware-Friendly Approximation for Swish Activation and Its Implementation. IEEE Trans. Circuits Syst. II Express Briefs 2024, 71, 4516–4520. [Google Scholar] [CrossRef]
Sadeghi, M.E.; Fayyazi, A.; Azizi, S.; Pedram, M. PEANO-ViT: Power-Efficient Approximations of Non-Linearities in Vision Transformers. In Proceedings of the 29th ACM/IEEE International Symposium on Low Power Electronics and Design, Newport Beach, CA, USA, 5–7 August 2024; pp. 1–6. [Google Scholar]
Hong, Q.; Liu, Z.; Long, Q.; Tong, H.; Zhang, T.; Zhu, X.; Zhao, Y.; Ru, H.; Zha, Y.; Zhou, Z.; et al. A reconfigurable multi-precision quantization-aware nonlinear activation function hardware module for DNNs. Microelectron. J. 2024, 151, 106346. [Google Scholar] [CrossRef]
Li, L.; Zhang, S.; Wu, J. An Efficient Hardware Architecture for Activation Function in Deep Learning Processor. In Proceedings of the 2018 IEEE 3rd International Conference on Image, Vision and Computing (ICIVC), Chongqing, China, 27–29 June 2018; pp. 911–918. [Google Scholar]
Liu, K.; Shi, W.; Huang, C.; Zeng, D. Cost effective Tanh activation function circuits based on fast piecewise linear logic. Microelectron. J. 2023, 138, 105821. [Google Scholar] [CrossRef]
Li, Y.; Cao, W.; Zhou, X.; Wang, L. A Low-Cost Reconfigurable Nonlinear Core for Embedded DNN Applications. In Proceedings of the 2020 International Conference on Field-Programmable Technology (ICFPT), Maui, HI, USA, 9–11 December 2020; pp. 35–38. [Google Scholar]
Li, T.; Zhang, F.; Xie, G.; Fan, X.; Gao, Y.; Sun, M. A high speed reconfigurable architecture for softmax and GELU in vision transformer. Electron. Lett. 2023, 59, e12751. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the 9th International Conference on Learning Representations, ICLR, Vienna, Austria, 4 May 2021. [Google Scholar]

Figure 1. (a) erf(

\frac{x}{\sqrt{2}}

) approximation. (b) The internal structure of the basic computation unit.

Figure 1. (a) erf(

\frac{x}{\sqrt{2}}

) approximation. (b) The internal structure of the basic computation unit.

Figure 2. EPSS Execution Process. (a) The absolute error between the piecewise linear fitting function with six segmentation points and the original function over the interval [−8, 8]. (b) The absolute error obtained by inputting six segmentation points during EPSS initialization. (c) Comparison of absolute errors obtained after EPSS added a new segmentation point. (d) After two runs of EPSS, the absolute error in the highlighted blue region significantly decreased.

Figure 3. Overall structure of GELU accelerator.

Figure 4. The optimized circuit eliminates the area consumption of one BF16 adder.

Figure 5. The internal structure of the basic computation unit. (a) BF16Mul. (b) BF16Add.

Figure 6. Comparison of Accuracy of Fitting GELU and ERF.

Figure 7. Structure of Vision Transformer.

Table 1. Comparison of Algorithm Error.

Method	Input Interval	Segment Number	MSE	MAE
[25]	[−8,8]	16	1.19 $\times 10^{- 6}$	1.95 $\times 10^{- 3}$
[29]	[−4,4]	10	8.31 $\times 10^{- 5}$	N/A
[33]	[−4,4]	N/A	7.10 $\times 10^{- 3}$	1.13 $\times 10^{- 3}$
[30]	[−8,8]	8	1.54 $\times 10^{- 6}$	4.06 $\times 10^{- 3}$
ISPA-8	[−8,8]	8	3.97 $\times 10^{- 8}$	2.74 $\times 10^{- 5}$
ISPA-16	[−8,8]	16	4.29 $\times 10^{- 9}$	1.07 $\times 10^{- 5}$

Table 2. Accuracy evaluation of ViT with ISPA-8.

	Res-ViT		ViT-B		ViT-L
	TOP-1	TOP-5	TOP-1	TOP-5	TOP-1	TOP-5
Baseline	90.97	99.03	92.17	99.10	93.32	99.30
Fitted NN	90.94	99.03	92.16	99.10	93.29	99.30
Acc. Loss	−0.03	0.00	−0.03	0.00	−0.03	0.00

The unit of accuracy is a percentage.

Table 3. Accuracy evaluation of ViT with ISPA-16.

	Res-ViT		ViT-B		ViT-L
	TOP-1	TOP-5	TOP-1	TOP-5	TOP-1	TOP-5
Baseline	90.97	99.03	92.17	99.10	93.32	99.30
Fitted NN	90.97	99.03	92.17	99.10	93.32	99.30
Acc. Loss	0.00	0.00	0.00	0.00	0.00	0.00

The unit of accuracy is a percentage.

Table 4. XCZU9EG Resources.

Device	LUT	Slice Register	DSP	BRAM (Mb)
XCZU9EG	274,080	548,160	2520	32.1

Table 5. Comparison of hardware resource.

Method	Device	LUT	Slice Register	DSP	BRAM (Bits)	Frequency (MHz)
[25]	XC7S50	176 *	0	0	11,264	50
[29]	XCVU9P	2940	2951	16	0	250
[33]	XC7Z045	324	318	1	0	410
[30]	XC7Z010	219	247	0.5	0	312.5
ISPA-8	XCZU9EG	295	194	0	0	450
ISPA-16	XCZU9EG	337	185	0	0	450

* The original design used 11,264 LUT bits, equal to 176 LUTs.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, J.; Wu, Y.; Zhuang, M.; Zhou, J. High-Precision and Efficiency Hardware Implementation for GELU via Its Internal Symmetry. Electronics 2025, 14, 1825. https://doi.org/10.3390/electronics14091825

AMA Style

Huang J, Wu Y, Zhuang M, Zhou J. High-Precision and Efficiency Hardware Implementation for GELU via Its Internal Symmetry. Electronics. 2025; 14(9):1825. https://doi.org/10.3390/electronics14091825

Chicago/Turabian Style

Huang, Jianxin, Yuling Wu, Mingyong Zhuang, and Jianyang Zhou. 2025. "High-Precision and Efficiency Hardware Implementation for GELU via Its Internal Symmetry" Electronics 14, no. 9: 1825. https://doi.org/10.3390/electronics14091825

APA Style

Huang, J., Wu, Y., Zhuang, M., & Zhou, J. (2025). High-Precision and Efficiency Hardware Implementation for GELU via Its Internal Symmetry. Electronics, 14(9), 1825. https://doi.org/10.3390/electronics14091825

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

High-Precision and Efficiency Hardware Implementation for GELU via Its Internal Symmetry

Abstract

1. Introduction

2. Background and Prior Research

3. Algorithm Design

3.1. Internal Symmetry Piecewise Approximation Method

3.2. Error Peak Search Strategy

4. Hardware Architecture and Implementation Details

4.1. Overall Architecture

4.2. Basic Calculate Unit

5. Experiments and Performance Evaluation

5.1. Quantitative Error Characterization

5.2. DNN Accuracy Test

5.3. Hardware Resource Evaluation

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI