1. Introduction
Artificial Intelligence (AI) has fundamentally reshaped the landscape of modern technology, becoming deeply embedded across diverse sectors. In the medical field, AI algorithms are now pivotal for diagnostics and treatment planning, whereas in the academic sphere, they facilitate personalized education. Furthermore, the corporate and entertainment industries rely heavily on intelligent systems to streamline workflows and generate creative content. At the forefront of this evolution are Large Language Models (LLMs), specifically Generative Pretrained Transformers (GPTs).
While these models offer unprecedented proficiency in processing and generating human language, their reliance on complex, data-intensive computations creates a critical challenge: the deployment of such high-performance systems necessitates immense computational power, resulting in a rapidly escalating energy demand. This growing energy demand has raised concerns regarding the sustainability and environmental impact of large-scale AI infrastructure. Recent projections highlight the magnitude of this issue: under high-demand scenarios, the annual energy consumption of US-based AI servers is projected to reach approximately 537 terawatt-hours (TWhs) by 2030, representing a nearly 18-fold increase compared to 2024 levels and more than doubling previous market outlooks [
1]. This dramatic escalation underscores the urgent need for research into energy-efficient AI architectures, optimization strategies, and sustainable practices to mitigate the environmental footprint of emerging AI technologies. A recent paper from Patterson et al. [
2] tried to estimate the energy use of several recent large models, such as T5, Meena, G-Shard, Switch Transformer, and GPT-3.
The electricity required to operate an LLM depends on several factors, including the underlying algorithm, the software implementation, the number of processors utilized, and the speed and power efficiency of those processors. In contemporary practice, LLMs are predominantly deployed on high-performance CPUs and GPUs housed within large-scale, energy-intensive data centers. While these traditional processing units deliver high computational throughput, they are not inherently optimized for energy efficiency.
In contrast, reconfigurable hardware such as Field-Programmable Gate Arrays (FPGAs) has demonstrated superior energy efficiency across various application domains when compared to conventional CPUs and GPUs. This improvement in energy efficiency mainly comes from the fact that FPGAs allow the customization of hardware circuits to execute a specific function. By tailoring the circuit to the target task, FPGAs eliminate many unnecessary operations that are typically required in general-purpose computing. As a result, computation is more efficient, faster, and consumes significantly less energy. Conversely, to control the operation performed by the FPGA, a CPU is usually the best compromise. This advantage has led to increasing interest in heterogeneous computing architectures, which integrate CPUs, GPUs, and FPGAs to form sustainable and reconfigurable computing platforms [
3]. Such systems have the potential to significantly reduce the energy footprint of AI workloads, including those associated with LLMs.
1.1. LLM Inference Implementations
LLM inference implementations are commonly developed in low-level programming languages such as C and C++, which offer direct access to system resources and efficient memory management. These implementations can be deployed on local platforms with significantly lower memory requirements compared to cloud-based solutions. Deploying LLMs locally not only reduces memory and computational overhead but also provides developers with greater control and flexibility for analyzing and optimizing performance-critical code paths. Furthermore, C and C++ are widely supported by high-level synthesis (HLS) tools, making them well-suited for developing hardware-accelerated solutions. This compatibility facilitates the integration of LLM workloads with reconfigurable computing platforms, such as FPGAs, enabling efficient hardware-software co-design for energy optimization and performance improvements.
The drive for greater efficiency and lower power consumption is critical to improving the cost and environmental impact of LLMs. Training state-of-the-art models is estimated to lead to significant energy usage, and the power consumed during the inference phase during the lifetime of a model can be 2 to 25 times greater than that of training [
4,
5,
6,
7,
8]. This highlights a substantial opportunity for optimization; improvements in inference power efficiency can lead to dramatic cost reductions and environmental benefits.
1.2. Heterogeneous Computing for LLM Inference
Heterogeneous computing platforms combine different types of processing units such as CPUs, GPUs, FPGAs, and specialized AI accelerators within a single system to leverage their unique strengths for specific computational tasks. This architecture enables LLMs to run more efficiently by assigning workloads to the most suitable hardware component: GPUs or TPUs handle dense matrix operations for training and inference, CPUs manage control and data orchestration, and FPGAs can accelerate customized operations with lower power consumption. By distributing computation intelligently across heterogeneous resources, these platforms improve throughput, reduce latency, and optimize energy efficiency, making the deployment of LLMs more scalable and cost-effective.
Nevertheless, the compilation flow for heterogeneous devices is still complex, requiring very specific skills and complex toolchains. HLS offers a promising approach for designing custom hardware accelerators on FPGAs by enabling the use of C-like languages for hardware development [
9]. This approach allows for a more productive design process compared to traditional hardware description languages (HDLs) such as Verilog.
In this paper, we leverage novel design tools, notably HLS, to reduce the energy impact of LLMs.
The remainder of this paper is structured as follows.
Section 2 provides an analysis of related work.
Section 3 details the proposed methodology for improving the energy efficiency of LLMs.
Section 4 presents a detailed evaluation of the HLS optimization process and compares the final performance, resource usage, and power consumption of the proposed architecture.
Section 5 discusses the experimental results in the context of prior work and outlines future directions. Finally,
Section 6 presents the conclusions.
2. Analysis of Related Works
Prior work on FPGA-based acceleration for modern deep learning models reveals a clear emphasis on approaches that reduce arithmetic complexity, improve data movement efficiency, and enable practical deployment through repeatable design flows. In addition to low-bitwidth model families such as BitNet [
10] and matrix multiplication (MatMul)-reduced language modeling approaches [
11], there is an active body of work on mapping quantized neural networks to reconfigurable hardware through reusable compilation frameworks and HLS-based flows. These contributions provide context for the method proposed in this paper, which focuses on ternary weights and digital signal processor (DSP) reduction for power-efficient embedded inference.
2.1. Toolflows for Building and Deploying Quantized Neural Networks on FPGAs
Beyond single purpose accelerators, a key direction in FPGA based AI is the development of toolflows that automate the translation of trained models into efficient hardware implementations. FINN and FINN-R exemplify end to end design automation for highly quantized neural networks on FPGAs, emphasizing logic oriented compute and design space exploration, while HLS-centered flows such as hls4ml emphasize developer productivity and systematic exploration of throughput and resource tradeoffs [
12,
13,
14]. These works help motivate the focus of this paper on a deployable embedded design point, where timing closure, DSP availability, and power are primary constraints.
2.2. Multiplier-Minimized and Multiplierless Compute on FPGAs
Another relevant line of work focuses on eliminating or minimizing multiplications in the dominant compute kernels by restricting weights and activations to extremely low precision formats. In particular, binarized neural networks mapped to FPGAs can replace multipliers with lightweight bitwise operations and population counts in the core dot-product datapaths [
15]. While these approaches are most commonly demonstrated for convolutional and fully connected layers, they provide evidence that more aggressive multiplier elimination can be feasible on FPGAs for specific model families. For Transformer and LLM inference, achieving end-to-end multiplier elimination is more challenging because the workload includes additional scaling, normalization, and nonlinear operations beyond the main dot products. As a result, we position fully multiplier-free LLM inference as a promising direction for future work rather than an assumption in the present design.
Several recent works, such as BitNet [
10], have explored the use of low-bit weights in neural network architectures to achieve improved efficiency in memory usage and performance [
11]. This work builds on that foundation but distinguishes itself by focusing specifically on leveraging ternary weights to optimize for speed and, most critically, power consumption in resource-constrained embedded systems. While BitNet demonstrates the feasibility of maintaining accuracy with low-bit weights, our work translates this concept into a practical hardware implementation that demonstrates substantial power and resource benefits on an embedded FPGA platform, focusing on DSP reduction as a critical metric.
2.3. Hardware Accelerated Transformers on FPGAs
Although significant research has been conducted on the acceleration of transformers in FPGAs [
16], existing approaches often rely on specialized techniques that limit their broader applicability. Here, we explore the key differences between our proposed method and these previous works.
Sparse Matrix Multiplication: Frameworks such as Column Balanced Block Pruning [
17] and FTrans [
16] achieve performance improvements by using sparse MatMul techniques with weight pruning. Although effective, these methods require significant modifications to the original transformer model and its training pipeline to induce sparsity, potentially hindering their use with pre-trained, dense architectures. In contrast, our approach maintains dense matrix multiplication, ensuring compatibility with a wider range of existing transformer models without requiring architectural changes or retraining for sparsity.
Function Approximations: Another approach, exemplified by NPE [
18], utilizes piecewise linear approximations for non-linear functions (e.g., softmax and Gaussian Error Linear Unit (GELU)) to achieve speedups. While this method improves performance, it introduces approximation errors that may require model retraining to recover lost accuracy. Our methodology prioritizes model integrity by computing the exact values for these non-linear functions on the host processor, focusing hardware acceleration solely on the MatMul bottleneck. This eliminates the need for hardware-specific model retraining and avoids any potential reduction in accuracy.
2.4. Hardware Accelerated Llama 2 on FPGAs Using High-Level Synthesis
This paper builds upon the recent work of HLSTransform [
19], which introduced a method for accelerating transformer inference on FPGAs using HLS based on a Llama 2 architecture. However, key distinctions exist that highlight our unique contribution to power-efficient embedded inference.
Weight Representation and Hardware Impact: HLSTransform uses the common 8-bit integer (INT8) format for weights [
20]. INT8 is widely regarded as the standard for deploying quantized models on edge devices, balancing accuracy and memory usage. However, this format still requires traditional multiplication operations implemented in DSP blocks, which become a resource bottleneck. In contrast, our method’s core contribution is the use of 2-bit ternary weights. This enables the complete elimination of DSPs for matrix multiplication, replacing them with far simpler and more power-efficient logic. This architectural shift is directly responsible for the dramatic reduction in DSP utilization and the significant decrease in total energy consumption when compared to the standard INT8 equivalent.
Hardware Target and Optimization Focus: HLSTransform targeted a large, data center-class FPGA, focusing primarily on achieving high throughput to be competitive with CPUs and GPUs. Our work specifically targets the AMD-Xilinx ZCU102 (AMD, Inc., Santa Clara, CA, USA) system-on-chip (SoC) [
21], a representative resource-constrained embedded platform. Consequently, our primary focus is on minimizing resource utilization (especially DSPs) and optimizing for total power and energy consumption—metrics that are critical for embedded, battery-powered, or thermally limited applications. By leveraging ternary weights, our proposed method offers a complementary approach to HLSTransform, tailored for high-performance and superior power efficiency in embedded environments rather than raw throughput in a data center.
Many existing projects have already demonstrated the potential of using FPGAs for LLM workloads, with most efforts primarily emphasizing performance rather than energy efficiency within embedded systems. Examples include projects such as GLITCHES [
22] and Terafly [
23] which focus on maximizing system-level throughput through heterogeneous GPU-FPGA collaboration or multi-node scaling. HLS-Eval [
24], conversely, addresses the design process itself by benchmarking the capability of LLMs to generate HLS code. While TerEffic [
25] shares our use of ternary quantization, it primarily targets high throughput on larger FPGA fabrics.
In summary, prior work demonstrates both practical FPGA toolflows for creating and deploying quantized AI models and a wide range of accelerator strategies for Transformer and LLM inference. These findings highlight that achieving high efficiency on embedded FPGAs requires co-design across quantization, architecture, and implementation. Motivated by this analysis, the next section presents our proposed methodology for improving LLM inference efficiency on the target embedded platform.
3. Proposed Methodology for Improving Energy Efficiency of LLMs
Developing a power-efficient FPGA-based LLM inference engine requires a systematic approach consisting of three main phases. Our methodology consists of three main phases: (1) first, we investigate the bottleneck analysis of existing systems while running on an FPGA, (2) we study applied architectural modifications and hardware-software codesign targeting energy efficiency, and (3) we focus on improving the quality-of-results (QoR) optimization through advanced HLS techniques.
3.1. Phase 1: System Bottleneck Analysis
LLMs, particularly transformer architectures like Llama 2, spend the majority of their computational time performing MatMul operations. These operations are fundamental to the attention mechanism and feed-forward networks that form the core of transformer models. In our target 220-million-parameter Llama 2 model, MatMul operations account for over 90% of the total computational workload during inference.
Traditional FPGA implementations of MatMul rely heavily on DSP blocks for multiplication. However, DSPs are among the most power-hungry and area-intensive resources on an FPGA. The number of available DSPs is a primary constraint that limits the degree of parallelism achievable in a design. For example, a mid-range SoC platform (i.e., the AMD-Xilinx ZCU102) contains 2520 DSP slices, a number that can be quickly exhausted by highly parallel MatMul implementations, creating a significant performance bottleneck.
Our preliminary analysis revealed that standard INT8 MatMul implementations on FPGA platforms face two critical limitations:
Resource Exhaustion: Conventional approaches that require one multiplier per operation quickly exceed the available DSP resources, limiting the parallelism needed for high-throughput inference.
Power Inefficiency: DSP-intensive designs consume significant dynamic power, making them less suitable for embedded applications where power budgets are tight.
In the next section, we will explore how we can apply architectural modifications to improve the energy efficiency of this operation.
3.2. Phase 2: Architectural Modifications for Power Efficiency
To address the identified bottlenecks, we adopted a ternary weight quantization approach, restricting all model weights to the set {−1, 0, 1}. This choice is motivated by recent advances in quantized neural networks, particularly BitNet [
10], which demonstrated that ternary weights can maintain competitive accuracy compared to full-precision models.
The quantization process converts model weights from 32-bit floating-point numbers to 8-bit signed integers representing the ternary values. This transformation significantly reduces memory footprint while enabling the elimination of multiplication hardware.
The core innovation of our approach lies in completely eliminating multiplication hardware. With ternary weights, the fundamental MatMul operation transforms from a DSP-intensive computation to simple conditional logic.
Given a weight matrix
and an input matrix
, a single element of the output vector
is calculated as:
With ternary weights
, the logic for the inner product term
simplifies to:
This transformation replaces expensive DSP operations with multiplexers and adders, dramatically reducing both resource requirements and power consumption.
3.3. Phase 3: Quality-of-Results Optimization Through HLS
HLS pragmas play a crucial role in shaping the performance, resource utilization, and overall quality of hardware designs. Pragmas such as pragma HLS pipeline, pragma HLS unroll, and pragma HLS array_partition directly influence how computations are scheduled, executed, and mapped onto FPGA resources. Pipelining, for instance, improves throughput by allowing different stages of computation to operate concurrently on separate data items, effectively increasing the number of operations performed per clock cycle. However, achieving optimal pipelining depends on managing data dependencies and timing constraints—over-aggressive pipelining (e.g., targeting an initiation interval of 1) can lead to timing violations, reducing achievable clock frequency and overall performance. Loop unrolling, on the other hand, increases parallelism by replicating loop bodies, thereby decreasing loop overhead and allowing multiple iterations to execute simultaneously. While higher unrolling factors can yield performance gains, they also increase resource consumption, and beyond a certain point (such as a factor of 4 in our ZCU102 experiments), they may exhaust available hardware resources without proportional performance benefits. Memory partitioning pragmas complement these techniques by alleviating bandwidth bottlenecks; partitioning large arrays into smaller, independent blocks enables simultaneous data access by multiple processing elements, a prerequisite for efficient pipelining and unrolling. Overall, HLS pragmas act as fine-grained optimization levers that, when applied judiciously, balance performance, resource usage, and timing closure—ultimately determining the quality of synthesized hardware designs.
Standard HLS compilation with default settings is insufficient to achieve optimal performance on FPGAs. We employed a systematic optimization methodology focusing on three key areas: pipelining, parallelization, and memory access optimization.
Our optimization process followed a systematic design space exploration:
- 1.
Baseline establishment: Implement minimal area optimization as performance baseline.
- 2.
Individual optimization assessment: Apply each optimization technique stacking on top of the previous until resource exhaustion.
- 3.
Resource constraint validation: Ensure all designs fit within ZCU102 resource limits.
- 4.
Elimination of unsuccessful paths: If resource exhaustion is reached before all optimizations are explored, the last optimization applied is removed and we continue exploration.
- 5.
Performance-power trade-off evaluation: Select configurations that optimize the performance-per-watt metric.
This methodology, partially inspired by previous work [
26], enabled us to identify the optimal configuration while avoiding resource over-utilization and timing violations that plague aggressive optimization attempts.
4. Experimental Results
A comprehensive evaluation of the proposed method follows, detailing the HLS optimization process and comparing the final ternary design against a standard implementation on the AMD-Xilinx ZCU102 platform. Finally, we conclude the experimental section with a performance comparison with a GPU architecture, more specifically an NVIDIA RTX 3090.
Our implementation follows an architectural flow similar to established methodologies like HLSTransform [
19] and the fully quantized transformer acceleration presented in [
27]. The design is partitioned into a C++ host application and a specialized FPGA kernel, bridged by the AMD-Xilinx Runtime (XRT). The host code is compiled via
g++ to manage orchestration and token sampling, while the kernel is synthesized using Vitis HLS to generate the hardware binary. This decoupled approach is standard for FPGA-based machine learning (ML) inference, allowing the host to handle high-level logic while the programmable logic executes the optimized ternary MatMul operations. Detailed build scripts and step-by-step instructions to reproduce the solution and verify the reported resource utilization metrics are provided in the associated repository.
4.1. Experimental Setup
All experiments were conducted on an AMD-Xilinx ZCU102 evaluation board, which features a Zynq UltraScale+ multiprocessor system-on-chip (MPSoC). This platform was chosen as it represents a common target for embedded vision and AI applications, providing a realistic testbed for evaluating power and resource constraints. The software portion of the application runs on the board’s ARM processor, leveraging the XRT for communication with the FPGA kernel. The designs were synthesized and implemented using Vitis HLS, XRT, and Vivado version 2024.2. All FPGA designs reported in this work achieved a consistent maximum clock frequency of after implementation, and all performance and energy results are reported at this achieved clock. The non-ternary baseline, which exceeds device capacity in DSP consumption, is reported based on post-synthesis estimates at the same frequency.
Our FPGA design employs a host-kernel architecture optimized for the ZCU102 platform:
Host Application: Runs on the ARM Cortex-A53 processor, managing data flow, token sampling, and kernel orchestration.
FPGA Kernel: Implements the optimized ternary MatMul engine using custom HLS code synthesized to programmable logic.
Communication Interface: Utilizes XRT and direct memory access (DMA) for efficient data transfer between host and kernel.
The host application sends input parameters (current token and position) to the FPGA via DMA. The FPGA executes the forward pass computations and writes results to a shared buffer, which the host retrieves to sample the next token in the sequence.
The LLM used for testing is a 220-million-parameter Llama 2-based architecture.
To strictly validate the experimental results, we implemented a deterministic verification protocol. During the testing phase, we temporarily replaced the random number generator with a fixed seed (rng_seed) while keeping all other hyperparameters constant. This ensured a consistent execution path, allowing us to verify that the FPGA kernel produced identical outputs to the reference software implementation for every token generated. This step confirmed that the architectural optimizations provided the reported performance gains without compromising the functional correctness of the model. Since the model uses integer quantization with exact arithmetic (no floating-point operations), the results are deterministic and reproducible across different implementations. The source code for the final implementation is available as indicated in the Data Availability Statement.
4.2. HLS Optimization Progression
We systematically applied HLS optimizations and recorded the impact on performance (wall clock time) and resource utilization.
Table 1 summarizes this design space exploration. The baseline “Min Area Optimization”, later indicated as V1, represents a non-pipelined, sequential implementation. Subsequent steps introduce pipelining (V2), loop unrolling (V3), and array partitioning (V4).
The most significant performance leap came from combining pipelining with array partitioning, which reduced the wall clock time from 0.39 s to just 0.033 s, leading to a speedup of 11.8×. This highlights the critical importance of alleviating memory bottlenecks to unlock the potential of pipelined architectures. An unrolling factor of 4 provided a good balance before hitting the resource limits of the ZCU102. Furthermore, we evaluate also the latency-area-product (LAP), showing how our results converge to the best compromise between performance and resource utilization.
4.3. Comparison with Non-Ternary MatMul
To validate the practical impact of our proposed architecture, it is essential to compare it against the prevailing industry standard for edge inference. Currently, INT8 quantization is the dominant format utilized in production mobile inference frameworks such as TensorFlow Lite, PyTorch Quantization, and ONNX Runtime which support a wide variety of hardware architectures [
28]. Consequently, we compared our final ternary MatMul implementation against a conventional non-ternary version that uses standard 8-bit integer multiplication.
It is important to note that this “non-ternary” baseline represents a realistic, optimized HLS design utilizing standard DSP slices for 8-bit multiplication, reflecting the typical resource cost of running quantized LLMs on current FPGA hardware without the proposed ternary architectural changes. This direct comparison isolates the efficiency gains derived specifically from the shift to multiplier-minimized ternary logic in the dominant MatMul kernels.
Table 2 presents the results for performance, resource utilization, and power consumption. The subsequent figures visually summarize these findings;
Figure 1 illustrates the dramatic reduction wall clock time, while keeping the hardware resources utilization very limited.
Figure 2 highlights the corresponding improvements in power and energy efficiency.
The results clearly demonstrate the significant advantages of the ternary approach. By minimizing general-purpose multiplications in the dominant MatMul kernels, the design sees a 96.1% reduction in DSP block usage. In fact, the non-ternary implementation exceeded the available DSPs on the ZCU102, making it infeasible without further modification, whereas the ternary version uses only 4%. This DSP reduction also leads to substantial savings in general logic resources (flip-flops (FFs) and look-up tables (LUTs)) and a significant decrease in power consumption.
The ternary implementation is not only more resource-efficient but also 23.3% faster. This speedup, combined with a 37.3% reduction in total power draw, results in a 51.9% reduction in the total energy required to perform the matrix multiplication. This makes the ternary approach exceptionally well-suited for power-sensitive embedded applications.
4.4. Comparison with GPU Architectures
While our evaluation focuses on the benefits of the ternary approach on an FPGA, it is useful to contextualize this performance against other common hardware accelerators like GPUs. A direct comparison of raw throughput (e.g., tokens per second) can be misleading due to fundamental architectural differences.
High-end GPUs, such as an NVIDIA RTX 3090, possess large amounts of high-bandwidth memory (e.g., 24 GB of video random-access memory (VRAM)) and operate at very high core clock frequencies (over 1.4 GHz). In contrast, the ZCU102 FPGA has significantly less on-chip memory (under 40 MB of block RAM (BRAM)), and its operating clock frequency is determined by timing closure after implementation, achieving 192.4 MHz in our design. Consequently, a GPU will typically achieve higher raw inference speed.
However, the primary advantage of the FPGA architecture, particularly with our ternary approach, lies in power efficiency. Unlike a GPU’s fixed, general-purpose architecture, an FPGA can be configured to create a bespoke circuit perfectly tailored to the specific computation. The core advantage of our method is the ability to minimize the use of power-hungry DSP blocks for multiplication in the dominant MatMul kernels, replacing many multiplications with simple, efficient logic. A GPU cannot be fundamentally re-architected in this way and must rely on its existing, more complex arithmetic logic units. This architectural specialization is the key to the FPGA’s superior performance-per-watt.
For measuring power, the XRT [
29] tool was used for the FPGA, and HWINFO [
30] was used for the GPU. A high-performance GPU under a similar inferencing load consumes 130 W [
19]. While HLSTransform previously validated the superior energy efficiency of FPGAs over GPUs for this workload, our ternary optimization further amplifies this advantage. By contrast, our optimized ternary design on the ZCU102 has a total power draw of only 4.831 W. This orders-of-magnitude difference in power consumption is the critical advantage for embedded and edge applications. The FPGA’s ability to be customized to the exact computation allows it to achieve a much higher performance-per-watt, making it a superior choice when energy consumption, not raw speed, is the primary constraint.
5. Discussion
Experimental results are interpreted here in the context of prior FPGA acceleration and quantization research, with a focus on their implications for energy-constrained embedded inference. The results demonstrate that combining ternary weights with quantization-aware training and HLS-driven architectural optimization yields an efficient FPGA implementation that substantially reduces DSP utilization and energy consumption on the target platform. The discussion below motivates a set of next steps for further efficiency improvement.
5.1. Contextualization with Prior FPGA Acceleration Work
Prior work has demonstrated that FPGAs can accelerate Transformer and natural language processing (NLP) workloads through a range of strategies, including efficient linear algebra implementations and pruning to reduce compute and memory traffic (e.g., FTrans [
16] and column-balanced block pruning [
17]). Our results are consistent with the broader observation that efficiency for sequence models is often bounded by memory movement and dataflow choices rather than peak arithmetic throughput alone.
While the direct difference in clock frequencies (192.4 MHz versus 214 MHz) suggests that cross-paper comparisons should be treated as qualitative, our INT8 baseline remains highly relevant. In practice, it achieves latency and resource utilization for a full forward pass that are remarkably close to those reported in independent state-of-the-art implementations [
19,
20], establishing a competitive foundation for our subsequent optimizations.
5.2. Methods of Creating and Deploying AI on FPGAs
A recurring theme in FPGA-based AI research is the development of reusable toolflows that translate trained models into efficient hardware realizations. Frameworks such as FINN and FINN-R provide end-to-end compilation and design-space exploration for highly quantized (notably binary) neural networks on FPGAs, emphasizing logic-based compute and streamlined datapaths [
12,
13]. Similarly, HLS-centered flows such as hls4ml aim to improve developer productivity and enable systematic exploration of throughput/resource trade-offs for FPGA inference accelerators [
14]. These works motivate the importance of bridging algorithmic choices (quantization, sparsity, and operator selection) with hardware-aware implementation decisions, which aligns with the optimization methodology used in this paper.
5.3. Binary vs. Ternary Quantization on FPGAs
Binary neural networks provide a well-known precedent for reducing multiplications in the dominant dot-product computations of convolutional and fully connected layers. In these settings, multiplications can often be replaced with bitwise operations and addition-heavy logic in the main compute datapaths, and some approaches further map operators into LUT-based structures [
15]. As an example of the underlying binary arithmetic primitive, exclusive-NOR (XNOR)-Net demonstrates that binarized convolution can be expressed using XNOR and population-count style operations as the core compute mechanism [
31]. Ternary quantization targets a related objective while providing additional representational capacity relative to purely binary weights. Prior work on ternary quantization shows that ternary weights can preserve accuracy more effectively than purely binary weights in many settings, while still substantially simplifying the dominant arithmetic [
32,
33]. In this work, ternary weights are used to substantially reduce multiplier usage in the dominant MatMul kernels while maintaining practical accuracy through quantization-aware training. A controlled, end-to-end hardware comparison between matched binary and ternary pipelines on the same platform, under identical accuracy and latency constraints, remains an important direction for future work.
5.4. Transformer-Specific Opportunities for Further Multiplier Minimization
In Transformer and LLM inference, the arithmetic cost is not limited to the core dot products alone. Additional operations, including scaling, normalization, and nonlinear components within the attention and feed-forward blocks, can introduce residual multiplications and other expensive arithmetic even when the main MatMul datapaths are simplified. As a result, further multiplier minimization for LLM inference is likely to require Transformer-specific co-design that goes beyond weight quantization, such as reformulating scaling and normalization steps, exploring alternative attention formulations, and integrating hardware-aware approximations where accuracy permits [
16,
18,
19]. This perspective is consistent with prior FPGA work on extremely low precision inference, where dot-product computation can be implemented using logic-centric primitives (e.g., bitwise operations and population count), but end-to-end efficiency still depends on how the full operator set is mapped and orchestrated [
15].
5.5. Limitations and Future Work
While the proposed design achieves substantial reductions in DSP utilization and energy consumption, several limitations remain. First, the end-to-end inference pipeline is not strictly multiplier-free: even when the dominant MatMul datapaths are simplified, residual multiplications can still arise from scaling, normalization, and other non-MatMul components in the Transformer block. This motivates future work on Transformer-specific co-design that targets these remaining operations, for example through alternative formulations of scaling and normalization, and through carefully controlled approximations where accuracy permits.
Beyond these architectural developments, the provided open-source repository and deterministic verification protocol establish a reproducible foundation for ternary LLM inference on FPGAs. While this work focuses on a specific embedded design point, it would be beneficial for future individual studies to benchmark a more comprehensive suite of model configurations and sequence lengths to further map the efficiency trends of ternary logic. The availability of the complete design flow and build scripts serves to facilitate these efforts, offering a consistent baseline for future empirical comparisons between competing quantization formats, such as binary and ternary pipelines, across unified hardware platforms.
6. Conclusions
The findings of this research highlight a critical path toward reconciling the computational demands of state-of-the-art AI with the strict energy constraints of embedded and edge environments. As AI becomes increasingly pervasive, the transition from power-intensive, general-purpose hardware to specialized, bit-reconfigurable logic represents a vital shift in achieving sustainable large-scale deployment. By demonstrating that the massive energy footprint of LLMs is not an inherent trait of the models themselves, but rather a byproduct of the hardware used to run them, this work provides a framework for future energy-efficient AI systems.
This work successfully demonstrates the design and implementation of a high-efficiency LLM inference engine on an AMD-Xilinx ZCU102 SoC using HLS and ternary weights. By replacing computationally expensive multiplications with simple logic, our proposed architecture achieves substantial improvements in performance, resource utilization, and power efficiency.
Our systematic HLS optimization process revealed that a combination of pipelining and cyclic array partitioning is critical for maximizing throughput. When compared to a conventional eight-bit integer implementation, our final ternary design is 23% faster and, most notably, uses 96% fewer DSP blocks, 63% fewer LUTs, and consumes 52% less energy per operation. These results validate the ternary MatMul approach as a highly effective strategy for deploying LLMs on resource- and power-constrained embedded platforms. Future work will explore applying these principles to the design of even more efficient application-specific integrated circuit (ASIC) accelerators.