Next Article in Journal
CAN-FD ECU Authentication Using Voltage-Characteristic Hardware Fingerprints
Previous Article in Journal
Research on Thermal Analysis and Optimization Methods for a 0.22-Terahertz Traveling Wave Tube
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

ROM-Less Co(Sine) Synthesizer

by
Florentina-Giulia Stoica
,
Alex Calinescu
and
Marius Enachescu
*
Department of Electronic Devices, Circuits and Architectures, Faculty of Electronics, Telecommunications and Information Technology of University Politehnica of Bucharest, 061071 Bucharest, Romania
*
Author to whom correspondence should be addressed.
Electronics 2026, 15(5), 1093; https://doi.org/10.3390/electronics15051093
Submission received: 1 February 2026 / Revised: 1 March 2026 / Accepted: 3 March 2026 / Published: 5 March 2026
(This article belongs to the Section Circuit and Signal Processing)

Abstract

Sine and cosine wave synthesis is utilized for generating sinusoidal-like values in the digital domain. While this task is commonly handled through software, dedicated hardware like Direct Digital Synthesis (DDS) is also available. However, both methods rely on memory resources, such as look-up tables and Read-Only Memories (ROMs), which face latency limitations related to additional memory access times on top of additional S i area. With the advent of real-time arithmetic for sine wave approximation, this paper presents a digital module that employs iterative multiply-accumulate (MAC) operations for sine and cosine synthesis. To support the integration of this module into Systems-on-Chip (SoCs), Field-Programmable Gate Arrays (FPGAs), and standalone Application-Specific Integrated Circuits (ASICs), a comprehensive figure of merit (FoM) comparison against various ROM-less methods is provided. When implemented on a Xilinx (AMD) XC7A100T-3CSG324 FPGA, the proposed architecture compared to other ROM-less solutions like the Taylor approximation, achieves 80.80% lower resource utilization, 80.89% reduced propagation delay, and 36.66% higher accuracy in sine and cosine wave approximation, both operating as 32-bit systems with one sample per clock cycle. Furthermore, the proposed sine accelerator, accompanying control and communication IPs, and custom firmware were deployed on an FPGA-based function generator platform and experimentally validated.

1. Introduction

Sinusoidal waves can be found in applications such as touch screen controllers [1], electromagnetic interference (EMI) cancellation in power electronics [2], software-defined radio, inverters, radar systems, 3D graphics acceleration, instrumentation and signal processing [3,4]. A common application of sinusoidal wave generators is to provide excitation bursts in the kHz and MHz range to perform sensor characterization [5]. Other systems employ sinusoidal waves in motor control applications with Hall sensors [6].
Sine wave generation is a task that most basic applications are solving with software implementations of lookup tables in the memory of a microcontroller which feed data into digital-to-analog converters (DACs) or Pulse-Width Modulation (PWM) drivers through Direct Memory Access (DMA) peripherals [7]. The synthesis of sinusoidal functions can also be performed through hardware systems, which can be either fully analog [8] or digital-based circuits [1] to generate the desired sinusoidal waveform.
Commonly used software-based solutions, along with digital circuits based on the Direct Digital Synthesis (DDS) technique, require either lookup tables (LUTs) or Read-Only Memories (ROMs) [9]. The additional data storage circuitry significantly affects function generation latency due to the slow nature of memory access operations and incurs costs associated with the high silicon area required when handling large volumes of data. The data volume problem generated by LUTs and ROMs is based on two limitations—memory depth and memory width. First, achieving a high-fidelity signal requires storing more sample points in the memory, which increases the depth of the memory. For instance, 16,384 values have better time resolution than 1024, but the address bus width required is 16 bits instead of 10. Secondly, improving the precision of each sample involves increasing the number of bits for the data stored in the memory, which expands the width of the memory. A 16-bit value yields a quantization error of 30.517 ppm, while a 24-bit value offers a quantization error less than 0.12 ppm.
To address the area and latency limitations of memory-based approaches, ROM-less digital circuits that utilize real-time computation through the implementation of various algorithms have emerged [10]. Recent literature extensively explores hardware-optimized algorithms such as CORDIC [4,11,12], Taylor series expansion [4,13], Bhaskara approximation [14], and parabolic synthesis [15]. While modern CORDIC implementations have seen advancements through heavily pipelined or scale-free architectures to improve throughput, they still suffer from variable latency or high logic resource overhead at higher bit-widths. Conversely, polynomial approximations avoid complex division but still demand multiple dedicated multiplication and addition blocks, incurring significant overhead in silicon area and propagation delay when high precision (e.g., 32 bits) is required.
This inherent trade-off motivates the need for an alternative real-time computation approach that minimizes design complexity without degrading approximation accuracy. To this end, we hypothesize that a generic sixth-degree polynomial approximation, evaluated iteratively using Horner’s scheme, can achieve superior or equivalent precision to standard methods while drastically reducing the required logic resources. This work extends [16] and presents a new real-time computation approach in the sense of utilizing the same multiplier and adder in an iterative manner (thus achieving area reduction) in order to generate a high-precision sine wave approximation. Moreover, the accelerator was integrated into an FPGA-based function generation system, featuring additional digital IPs for communication (UART) and overall system control, i.e., an FSM, and register access for configuration and accelerator output. Furthermore, a custom firmware framework was developed to manage the FPGA board, to drive the 50 MSa/s DAC PCB, and interface with an oscilloscope for experimental validation. To assess the performance of the proposed work regarding area, latency, and accuracy, an eADP (error-area-delay product) figure of merit which integrates all three metrics was used. Compared to the best out of the referenced implementation, Taylor approximation, this work shows an improvement of 80.80% in terms of area, 80.89% for propagation delay, and 36.66% higher accuracy for a 32-bit sine and cosine wave approximation design.
Section 2 presents the numerical and mathematical background with respect to the approximation of the sine waves in hardware. Section 3 presents the architecture of the proposed accelerator. Section 4 presents the FPGA-based sine generator in a real-world application scenario. Section 5 presents the accelerator’s performance in terms of area, speed and precision, along with a comparison against other hardware implementations. To evaluate the accelerator’s performance inside the FPGA-based sine generator, real-world measurements are also presented. Finally, the paper ends with concluding remarks in Section 6.

2. Background

This section presents methods for synthesizing sine and cosine waveforms. It also analyzes the implications of the approximation scheme employed in the proposed hardware solution, focusing on reducing the complexity of binary multiplication circuitry and evaluating the precision of the approximation methods.

2.1. Digital Synthesis of Sinusoidal Waves

In applications that require ROM-less hardware acceleration for sine wave generation [1,2,3,4,5,6], various methods for synthesizing sinusoidal waves exist, such as Taylor series expansion [4,13], Bhaskara approximation [14], and parabolic synthesis [15]. These approaches use dedicated arithmetic circuits such as multipliers and adders, or algorithms such as CORDIC [4,12]. However, each method is latency and physical resources required limited, such as LUTs for FPGAs or die area in ASICs.
Bhaskara approximation relies on dividing two polynomials [14]. The division circuitry is known to be complex, hence requiring significant logic resources, while the two polynomials are constructed using multipliers and adders.
Taylor series and parabolic synthesis approximate the sine wave through multiplication, addition, and subtraction [4,10,13,15], thus avoiding division, hence becoming more efficient in terms of area and speed than Bhaskara. The Taylor series implementation can be scaled for precision by increasing the number of coefficients or employing pipelining to enhance throughput [4].
Parabolic synthesis seeks methods to achieve precise approximations across various sub-intervals. By exploiting waveform symmetry, this approach identifies a set of polynomials that accurately describe sub-intervals of the sine or cosine functions between 0 and π 2 . Among this class of numerical approaches of approximating sine and cosine functions, with the advent of hardware implementations of A.I. accelerators, spline interpolation has emerged as a common method of generating a set of polynomials for a fixed group of sub-intervals to provide an approximation method for certain mathematical functions. Approaches such as [17] are using spline interpolation to obtain the coefficients of a polynomial that approximates the desired function for a given interval of values such as the sine and cosine functions.
However, all three methods, i.e., Taylor series, parabolic synthesis, and Bhaskara approximation, demand multiple blocks with dedicated multiplication circuitry, hence requiring significant hardware resources.

2.2. Polynomial Approximation of Cosine Using Horner’s Polynomial Scheme

When looking for a different solution outside of parabolic synthesis, Bhaskara or Taylor series for sinusoidal function approximations, one can take the approach of polynomial approximations through the Horner scheme, which is described in a generic form in Equation (1). By repeatedly factoring the polynomial until we reach the highest power, an iteratively-computable structure can be observed. This structure is particularly well-suited for software or resource-constrained hardware implementations.
P N ( x ) = k = 0 N 1 c k · x k = c 0 + c 1 · x + c 2 · x 2 + + c N 1 · x N 1 = c 0 + x c 1 + c 2 · x + c 3 · x 2 + + c N 1 · x N 2 = c 0 + x c 1 + x c 2 + c 3 · x + + c N 1 · x N 2 = = c 0 + x c 1 + x c 2 + x c 3 + x ( c 4 + x ( ) )
The Taylor series approach can be implemented using Equation (2) following the same method as in Equation (1). Horner’s polynomial scheme for the cosine function utilizes only even powers of x, unlike the sine function that requires both x and x 2 . To minimize the average error across the quadrant, a sixth-degree polynomial approximation is implemented using Horner’s method with the coefficients from Table 1, which were extracted using the MiniMax Polynomial Approximation from [18]. Table 1 also features the values that are used to represent these numbers in signed fixed point format, with 5 integer bits and different lengths of fractional bits, in order to match 16, 24, or 32 data bus widths. If a sine wave is desired, one may simply adjust the phase of the data since the sine wave is ahead of the cosine wave by π 2 .
f ( x ) = a 0 + x 2 ( a 1 + x 2 ( a 2 + a 3 x 2 ) )
The function defined by Equations (1) and (2) is implemented iteratively using Algorithm 1. The algorithm first pre-calculates and stores the value of x 2 . The computation is initialized with the highest-order coefficient, a N . Each subsequent step multiplies the result of the previous iteration by x 2 and adds the next coefficient, continuing until all coefficients have been processed.
Algorithm 1 Iterative Horner approximation algorithm
Require:  x 0 , π 2 , a i with i { 0 , 1 , 2 , , N }
Ensure:  x 2
c a N
i = N 1
while  i 0  do
     c a i + x 2 · c
     i i 1
end while
return c
Figure 1 compares the approximation errors of the Bhaskara (red), 7th order Taylor series (blue), Spline Interpolation for four sub-intervals (magenta) and the proposed MiniMax Horner polynomial (green) methods for generating a sinusoidal wave from 0 to π 2 , using 10 4 samples. The traces reveal that while the Taylor series is more accurate in the lower half of the quadrant, the proposed Horner-based method provides a more evenly distributed error. Given the logarithmic Y scale, the Horner-based method uniform error distribution results in a lower average absolute error across the entire quadrant, confirming the numerical analysis presented in [18]. Furthermore, the spline interpolation method shows its limitations with respect to intervals closer to 0 and π 2 , where the error is significantly larger than the proposed method, even though it achieved the best precision for critical points such as π 6 , π 4 and π 3 .
Using the results plotted in Figure 1, the RMS error, maximum absolute error, and average absolute error were extracted and are displayed in Table 2. The quantitative results confirm that the proposed approximation consistently achieves lower errors when compared to other methods used in the literature. Specifically, the proposed MiniMax coefficients for Horner scheme yield an RMS error of 1.48 × 10 5 and a maximum absolute error of 2.97 × 10 5 . When compared to the next best performing algorithm, the Taylor series, our approach reduces the maximum absolute error by a factor of approximately 5.5 (from 16.52 × 10 5 down to 2.97 × 10 5 ) and improves the RMS error by over 61% (from 3.81 × 10 5 to 1.48 × 10 5 ). Furthermore, it vastly outperforms both Spline and Bhaskara approximations across all three metrics, demonstrating numerical stability across the evaluated domain.

2.3. Symmetry Around π 2 and Quadrant Orientation

The proposed accelerator is based on an algorithm that is fundamentally valid for angles within the first quadrant, [0, π 2 ]. This design choice is justified by the inherent periodicity and symmetry of sinusoidal functions, as shown in Equation (3). By mapping any given input angle to its equivalent in the first quadrant and applying the appropriate sign, the accelerator can compute the function over its entire domain. Although the core computation approximates the sine function, the system produces the cosine by incorporating the necessary π 2 phase shift.
s i n ( x ) = { s i n ( x ) , 0     x   < π 2 s i n ( π x ) , π 2     x   < π s i n ( x π ) , π   x   < 3 π 2 s i n ( 2 π x ) , 3 π 2   x   < 2 π

2.4. Number Format and Precision

The overall precision of the hardware accelerator is constrained by both the approximation method and the fixed-point numerical representation. As with methods like Taylor, Bhaskara, or CORDIC, the width of the data bus and the allocation of fractional bits directly influence Horner-based method accuracy due to truncation errors from hardware-level arithmetic like additions, subtractions and multiplications. The proposed cosine synthesizer addresses this by featuring scalable precision, enabling user-configurable parameters for data width and fractional bits for a direct trade-off between resource utilization and precision.
The fixed-point representation uses integers for arithmetic operations while interpreting them as fractions. The position of the fractional point determines the power of 2 represented by the fractional bits, as shown in Equation (4).
Q n , m = k = 0 n + m 1 b k · 2 k m
The fixed-point representation offers flexibility in selecting the number of fractional bits (m) and integer bits (n) within a fixed data bus size. Figure 2 illustrates the precision improvements achieved by increasing the number of fractional bits for the implementation described in this work and also shows that after 14 bits there is no further precision improvement.

3. Proposed Cosine Synthesizer Based on Horner Polynomial Scheme

Horner’s polynomial approximation not only delivers optimal accuracy with respect to the average approximation error but also enables the iterative use of a single multiply-accumulate (MAC) circuit, resulting in a latency of only four clock cycles for computing a cosine value [16]. Higher orders enhance precision but incur extra latency from the additional MAC operations needed for each new even power and coefficient. Thus, weighing the accuracy/latency trade-off, for this work, a sixth-degree polynomial approximation was selected. Furthermore, this section describes the design and operation of the proposed cosine wave synthesizer.

3.1. The Cosine Synthesizer Core

The proposed design utilizes a single MAC unit with multiplexed inputs to iteratively generate a new cos ( x ) approximation each clock cycle, managed by a finite state machine (FSM). In the first clock cycle, the x 2 value is calculated, followed by three MAC operations that leverage the result from the preceding cycle. Figure 3 depicts the internal structure of the cosine synthesizer, excluding the FSM, clock, enable, and reset signals. Initially, the multiplier computes x 2 . The second step involves multiplying x 2 by the a 3 coefficient and adding a 2 . The third and fourth (final) steps consist of multiplying the previous iteration’s result by x 2 and adding the respective coefficient. In the final cycling stage, the VALID flag is asserted, signaling downstream logic to capture the synthesizer’s output.

3.2. Interleaving Cosine Synthesizer Cores

If the system integrating the synthesizer requires a latency of only one clock cycle, four instances of the proposed design can operate in parallel with interleaved execution and multiplexed outputs, as illustrated in Figure 4. The signal behavior of these four parallel cores is shown in Figure 5.
Each clock cycle, one core outputs a cosine value determined by the input angle. With a latency of four clock cycles per core, the outputs of the four cores are multiplexed to ensure that only one core has a valid output at any given time. Consequently, the initial cosine value is produced after four clock cycles, followed by a new value generated at each subsequent cycle.
Using the aforementioned implementation, the quad-core system streams a new value after each cycle, improving the throughput from 4 cycles/sample to 1 cycle/sample. Input demultiplexing and output multiplexing do not yield significant latency penalties. The main limitation regarding the clock’s top speed is the multiplier architecture, and this is a fixed constraint, due to the fixed DSP48E1 slices in the FPGA chip.

4. Validation Environment

For experimental validation, the proposed accelerator core was integrated into a complete FPGA-based function generation platform, as depicted in Figure 6. The system is controlled by a host computer running custom scripts that issue commands to the FPGA via a serial interface. The accelerator generates digital samples, which are converted to an analog signal by a high-speed DAC on a custom PCB. The resulting analog waveforms are then captured and validated using an oscilloscope.
To enable a versatile and intuitive validation process, a custom communication protocol was developed to manage the interaction between a host computer and the FPGA-hosted accelerator. This protocol allows the computer to send commands over a USB-to-UART serial link, which is facilitated by an on-board bridge. On the FPGA, a dedicated module decodes these commands to dynamically update the accelerator’s configuration registers with the desired values.

4.1. PC Custom Communication Interface

From the computer side, a custom software framework has been developed, providing an abstraction layer that facilitates the configuration of the accelerator system through the functions presented in Table 3.
The first function, dec_to_bin, converts numbers from decimal to the desired fixed point representation, thus offering a more user-friendly configuration of the sine generator. The bin_to_dec is used to convert data from fixed point into decimal in order to display the values read from the function generator more easily. The input parameters of both functions are the value, the number of fractional bits, the number of integer bits, and the signed/unsigned number format.
The sine_config function takes as input arguments the desired amplitude, offset, frequency and phase of the analog wave that has to be generated by the accelerator system. The input values for this function are given in decimal, the function calling the dec_to_bin in order to make the conversion. Underneath this function, a set of optimization problems are solved to generate the values of the configuration registers. These optimization problems imply finding the correct system clock division factor and internal counter step to achieve multiple frequencies for each sample generation.
The control_signal function is used to enable/disable the generator, to change the polarity of the output signal, and to configure the trigger for the shadow buffers of the control registers inside the generator. The trigger can be set to ‘none’ or configured to either a period match event or an external trigger signal coming from an input pin of the FPGA.
The read_register and write_register functions allow direct access to the accelerator system’s control registers.
The read_sine_output function can be used to read the accelerator output back to the computer for further processing or plotting.

4.2. FPGA IPs

The FPGA board receives the commands sent from the computer, then decodes and executes them. As presented in Figure 7, in addition to the function generation engine, multiple IPs were added, such as the UART module, System Registers and Output Interface.
The control FSM determines which configuration register needs to be updated and asserts the address bus accordingly. The next step is transferring the received value to the target configuration register on the data bus. The control and configuration finite state machine can either be IDLE, WRITING, READING or DONE, as presented in Figure 8. Upon a reset, the FSM will enter the IDLE state.
When the UART flags that a new byte has been successfully received, the FSM will decode the received byte and execute the appropriate command. If the command implies writing a new register value, the address will be extracted from the command byte. The following bytes contain the data that needs to be written to the designated register. For reading operations, the computer expects to receive the register values, starting with most significant byte from the address that is stored in the command byte.
The configuration registers hold values that are relevant to the behavior of the accelerator system: regular/half-wave rectified/full-wave rectified sine mode, system clock frequency division factor, angular frequency counter step, amplitude and phase of the output wave. The registers along with the additional circuitry required for the validation of the accelerator system are presented in Figure 9.
The clock division register holds the clock division factor value with respect to the system clock frequency. The clock divider is based on a binary counter and only powers of two are available for the clock division factor.
The angular frequency register holds the magnitude with which the internal counter increases at each positive edge of the clock signal. The phase register holds the start value of the internal counter and determines which of the four quadrants the counter starts from.
The counter is limited between values corresponding to the desired signed fixed point representation of 0 and 2 π . When the counter transitions between quadrants, the counter’s output is processed so that the accelerator’s input is kept between 0 and π 2 , exploiting the symmetry of (co)sine waves, outlined by Equation (3). The output wave sign is determined based on the current quadrant, the selected alternance mode and applied at the accelerator’s output.
The amplitude register holds an unsigned fixed point value between 0 and 1 that will be multiplied with the output of the accelerator, in order to match the input range of the external DAC chip placed on the PCB. The offset register holds the signed fixed point value that will be added to the output of the multiplication between the amplitude register’s contents and the accelerator’s output.
The output interface block is responsible with formatting the accelerator’s output so that it matches the digital input coding format of the DAC.

4.3. The Custom High-Speed DAC Board

The accelerator system hardware implementation consists of two primary components: the previously described FPGA board and a custom-designed PCB for high-fidelity digital-to-analog conversion. The PCB was specifically developed to handle the accelerator’s output, integrating a high-speed, differential current-steering DAC with a parallel digital interface to maximize data throughput. To convert the DAC’s current output into a usable voltage signal, the board also includes an operational amplifier configured as a current-to-voltage (I/V) converter stage. Figure 10 provides a high-level schematic of this interconnected system, illustrating the primary components and signal paths while omitting auxiliary circuitry such as power regulators and passive components for clarity.
Figure 11 provides further details into the schematic implemented on the PCB. The DAC is LTC1666 from Linear Technology (Analog Devices) [19]. It has a 12-bit resolution, parallel input interface, differential current output, and a sample rate of 50 MSa/s, which constraints the system’s operating frequency to 50 MHz. The I/V conversion is done through the resistors at the DAC output, which generates a ± 1 voltage. The LDO, LT3032 [20], converts the ± 15 V from the power source to ± 5 V required to supply the devices. The amplifier, LT1809 [21], is used to generate a single-ended output and to amplify the difference between V A and V B voltages from ± 1 V to ± 5 V, using the LT5400 integrated matched resistors [22].

5. Results

This section evaluates the proposed sine wave synthesizer in terms of area, speed, and precision. We also report on its experimental validation within an FPGA-based function generator. To facilitate a comprehensive comparison with prior works, we introduce a figure of merit (FoM) derived from these performance metrics.

5.1. Implementation Results

The proposed design is compared with other implementations analyzed in [10], utilizing the exact same FPGA family, i.e., the Xilinx (AMD) XC7A100T-3CSG324, to provide the fairest grounds for comparison. Evaluating architectures across different FPGA families introduces unfair biases due to inherent differences in routing fabric, logic elements, and DSP blocks; therefore, accurate benchmarking requires implementations to be judged on the same target FPGA. While the synthesis settings were not provided in the cited comparative study, the magnitude of the performance gaps observed should not be significantly altered by different synthesis strategies. For this work, the synthesis strategy focused on minimal latency, at the price of higher area utilization. A limitation of this benchmarking methodology is the reliance on a single FPGA family, meaning absolute metrics may scale differently on other architectures.
Table 4 presents a comparison of LUT utilization for the proposed design and other literature implementations across for three common data bus widths: 16, 24, and 32 bits. The results clearly demonstrate the superior area efficiency of our proposed single-core implementation. For a 16-bit data path, our design requires only 36 LUTs, which is approximately 5× smaller than the next most efficient method (parabolic synthesis at 179 LUTs) and over 40× smaller than the Bhaskara-based approach.
This trend of significant area reduction holds as precision increases. At 32 bits, our single-core design consumes only 142 LUTs, representing a reduction of 81% compared to parabolic synthesis and 97% compared to the Taylor series implementation.
Furthermore, the tables include a four-core version of our design to illustrate its scalability. To clarify the relationship between the two configurations: the four-core architecture simply instantiates four independent single-core modules operating in parallel. Because the single-core module requires four clock cycles to compute one sample, interleaving four such cores allows the overall system to output one valid sample every single clock cycle, thereby quadrupling the throughput at the expense of increased logic area.
While the resource usage increases, the four-core implementation remains highly competitive. For instance, at a 32-bit width, it requires 776 LUTs, which is comparable to the parabolic synthesis method (779 LUTs) but offers the potential for a four-fold increase in throughput. This highlights a key trade-off, allowing for a balance between minimal area for a single-core implementation and high performance for a multi-core architecture, all while maintaining a smaller footprint than conventional CORDIC, Taylor series, or Bhaskara-based solutions at higher bit-widths.
The speed performance of the proposed architecture is evaluated by analyzing the maximum combinational path delay, with results presented in Table 5.
The results indicate that our proposed architecture achieves a shorter critical path delay compared to all referenced designs across all data bus widths. For a 32-bit implementation, our single-core design has a delay of only 12.656 ns, which is 2.7× faster than the most competitive alternative (parabolic synthesis at 34.462 ns) and over 12× faster than the Bhaskara-based circuit. Hence, this short delay allows our design to be integrated into high-speed systems.
However, a four-cycle latency is required for our single-core design. To address this, the four-core version is presented. This parallel architecture is designed to produce one sample per clock cycle, thereby quadrupling the throughput. Notably, the four-core implementation exhibits a slightly lower combinational delay (e.g., 11.936 ns for 32 bits) than the single-core version due to synthesis tool optimizations across the smaller, more independent parallel paths. This demonstrates that our architecture can be configured for either maximum clock speed with moderate throughput (single-core) or for maximum throughput at a similarly high clock speed (four-core), offering a flexible trade-off between area (as shown in Table 4) and sample generation rate.
To quantify the numerical accuracy, the maximum relative error of the generated sine wave was measured for each implementation. The results, shown in Table 6, were obtained using the evaluation framework described in [10].
The analysis reveals that our proposed architecture achieves a level of precision that is superior to the state-of-the-art. At a 16-bit data width, our design’s relative error of 0.0048% is significantly lower than that of the Bhaskara, CORDIC, and parabolic synthesis methods, and is surpassed only slightly by the Taylor series implementation. As the data bus width increases to 24 bits, the precision of our design becomes virtually identical to that of the Taylor series method, with both exhibiting a relative error of approximately 0.0003%. At the highest tested precision of 32 bits, our work demonstrates a clear advantage, achieving a relative error of 0.00019%, which is about 37% lower than the 0.0003% error of the Taylor series approach. This shows that the accuracy of our algorithm scales more effectively with increasing bit width than other high-precision methods.
As expected, the precision is identical for both the single-core and four-core implementations since they execute the same algorithm and only differ in their parallel structure. This result, combined with the area and speed metrics from Table 4 and Table 5, confirms that our design provides top-tier precision without the significant hardware overhead or high latency characteristic of methods like the Taylor series or CORDIC, respectively.
To provide a holistic comparison that balances area, speed, and precision, we define a figure of merit (FoM) as depicted in Equation (5), to compare the proposed implementation with those in [10] on the same FPGA fabric.
F o M = A r e a L U T s · D e l a y n s · E r r o r %
The proposed FoM leverages an Error-Area-Delay Product (eADP), a well-established benchmark for digital systems, to emphasize the critical balance between hardware efficiency and wave synthesis quality. While alternative FoMs could be defined, such as an Area-Delay Product (ADP) for strictly throughput-centric applications, or metrics granting a larger weight to area for highly resource-constrained environments, the eADP provides the most comprehensive evaluation. Importantly, the overall conclusions exhibit low sensitivity to the specific metric chosen. Because the proposed architecture demonstrates simultaneous reductions in area and delay alongside equivalent or superior numerical accuracy, it maintains a distinct competitive advantage even if the error metric is omitted or if alternative weightings are applied. A lower FoM indicates a more efficient design, as it represents a better trade-off between resource consumption (area), processing time (delay), and numerical accuracy (error). The FoM was calculated for each implementation across 16-, 24-, and 32-bit data widths, with the results summarized in Table 7 and visualized in Figure 12.
For this analysis, several methodological considerations are important:
  • Single-Core Latency: To ensure a fair comparison of throughput, the delay component for our single-core synthesizer was multiplied by four, reflecting its four-cycle latency to produce a new sample.
  • CORDIC Reference: The speed for the CORDIC algorithm was not benchmarked in [10] due to its variable latency. Therefore, we conservatively used its minimum (best-case) delay value for this FoM calculation as a reference point.
  • Visual Scaling: In Figure 12, the FoM for the Bhaskara implementation was scaled down by a factor of 1000, while CORDIC was scaled down by 2. This was necessary for visual clarity to prevent its significantly larger value from obscuring the comparison between the other, more efficient designs.
The results demonstrate the superiority of our proposed architectures. As shown in Table 7, our single-core implementation achieves the lowest FoM at 16, 24 and 32 bits, outperforming the Taylor series method.
Furthermore, our four-core implementation, which delivers one sample per cycle, establishes a new state-of-the-art in overall efficiency. When compared against the best-performing reference design (Taylor series), our four-core architecture achieves a superior FoM by 48.2% at 16 bits, 96.6% at 24 bits, and 97.7% at 32 bits. In conclusion, the proposed design offers an excellent trade-off between performance, area, and precision.
While the proposed architecture demonstrates highly favorable resource and delay metrics, practical constraints must be considered during system-level integration. The minimal logic footprint of the single-core design inherently minimizes dynamic power consumption and routing congestion. However, scaling to the four-core interleaved architecture increases resource utilization, which may slightly impact routing complexity in highly dense FPGA fabrics. Furthermore, the scalability to higher clock rates is strictly bounded by the maximum combinational path delays detailed in Table 5. In our experimental validation, the system’s operating frequency was constrained to 50 MHz due to the 50 MSps sample rate of the specific external DAC utilized. Nevertheless, deploying a higher-speed DAC would allow the digital core to operate at higher frequencies, up to the theoretical limit of 1 / t d e l a y dictated by the target silicon fabric. From an ASIC implementation standpoint, the design can be easily ported into integrated circuits, with the requirement that a multiply–accumulate block with a custom-built single-cycle multiplier such as the one from [23] is required.

5.2. Experimental Validation

For experimental validation, the proposed sine/cosine synthesizer was integrated into an FPGA-based function generator controlled by a PC. The hardware setup, depicted in Figure 13, consists of the FPGA development board hosting the synthesizer and a custom PCB for digital-to-analog conversion. The FPGA board is powered by the host computer, while the analog PCB requires an external ± 15 V supply.
The sine wave synthesizer is configured through commands sent from the PC, and the output of the analog platform is observed and measured using an oscilloscope. All configurations were tested, including regular sine waves, half-wave rectified, and full-wave rectified modes. In addition, different signal parameters such as frequency, amplitude, and offset were adjusted to evaluate the flexibility and performance of the system.
The system’s waveform generation capabilities were validated through a series of tests under diverse parameter configurations. Initially, a standard 1.5 MHz sine wave with a 3 V amplitude and zero offset was generated to establish a baseline (Figure 14). The firmware’s dynamic frequency adaptation was then tested by generating a 1 kHz sine wave with a 2 V amplitude and a +2 V DC offset; this configuration required the firmware to select an optimal clock division factor and angular frequency step (Figure 15). Finally, the system’s non-linear output modes were demonstrated by generating a 3 V half-wave rectified signal (Figure 16) and a 2.5 V inverted full-wave rectified signal (Figure 17).
The waveform generator consistently delivered accurate and stable outputs across all configurations, including sinusoidal, half-wave, and full-wave rectified signals. Frequency tuning was smooth and effective, demonstrating the system’s flexibility and precise control. These results highlight the reliability and adaptability of the signal generation setup for a wide range of test and measurement scenarios.

6. Conclusions

A new architecture for ROM-less sinusoidal waveform synthesis was presented and implemented. The results of the FPGA implementation were compared with other solutions from literature. A figure of merit has been determined and used for an overall performance evaluation. The comparison shows that this work achieves 80.80% lower resource utilization, 80.89% reduced propagation delay, and 36.66% higher accuracy, when compared to the best existent solution, Taylor approach, both operating as 32-bit systems with a throughput of one sample per clock cycle. The proposed cosine synthesizer was integrated along with control and communication IPs in an FPGA-based function generator. Custom firmware was developed to configure the function generator and a custom PCB was created for the digital to analog conversion process. The complete system was validated in real-world use cases, with oscilloscope measurements.

Author Contributions

Conceptualization, F.-G.S.; Methodology, F.-G.S. and A.C.; Software, F.-G.S. and A.C.; Validation, F.-G.S. and A.C.; Writing—original draft, M.E., F.-G.S. and A.C.; Writing—review & editing, M.E., F.-G.S. and A.C.; Visualization, A.C.; Supervision, M.E.; Project administration, M.E. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded from the project “National Platform for Semiconductor Technologies”, Contract no. G 2024-85828/390008/27.11.2024, SMIS Code 351364, funded by the European Regional Development Fund under the Operational Program for Smart Growth, Digitization and Financial Instruments (POCIDIF), Priority 4—Development of Strategic Technologies for Europe—STEP.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ASICApplication-Specific Integrated Circuit
DACDigital-to-Analog Converter
DDSDirect Digital Synthesis
DMADirect Memory Access
EMIElectromagnetic Interference
FoMFigure of Merit
FPGAField-Programmable Gate Array
FSMFinite State Machine
IPIntellectual Property
LUTLookup Table
MACMultiply Accumulate
PCBPrinted Circuit Board
PWMPulse-Width Modulation
ROMRead-Only Memory
SoCSystem on Chip
UARTUniversal Asynchronous Receiver/Transmitter
USBUniversal Serial Bus

References

  1. Kim, J.; Mohamed, M.G.A.; Kim, H. Design of a Frequency Division Concurrent sine wave generator for an efficient touch screen controller SoC. In Proceedings of the 2015 International Symposium on Consumer Electronics (ISCE), Madrid, Spain, 24–26 June 2015; pp. 1–2. [Google Scholar]
  2. Bendicks, A.; Peters, A.; Frei, S. FPGA-Based Active Cancellation of the EMI of a Boost Power Factor Correction (PFC) by Injecting Modulated Sine Waves. In IEEE Letters on Electromagnetic Compatibility Practice and Applications; IEEE: New York, NY, USA, 2021; Volume 3, pp. 11–14. [Google Scholar]
  3. Xie, B.; Chen, T. Sine wave algorithm based on 2nd offset and its implementation in FPGA. In Proceedings of the IEEE 2011 10th International Conference on Electronic Measurement & Instruments, Chengdu, China, 16–19 August 2011; pp. 173–176. [Google Scholar]
  4. Adiono, T.; Timothy, V.; Ahmadi, N.; Candra, A.; Mufadli, K. CORDIC and Taylor based FPGA music synthesizer. In Proceedings of the TENCON 2015—2015 IEEE Region 10 Conference, Macao, China, 1–4 November 2015; pp. 1–6. [Google Scholar]
  5. Lutter, K.; Backer, A.; Drese, K.S. Guided Acoustic Waves in Polymer Rods with Varying Immersion Depth in Liquid. Sensors 2023, 23, 9892. [Google Scholar] [CrossRef] [PubMed]
  6. NXP Semiconductor. AN4869: Sinusoidal Control of BLDCM with Hall Sensors—Application Note, Rev 0.03; NXP Semiconductor: Eindhoven, The Netherlands, 2014. [Google Scholar]
  7. Miller, A. AN3312: Arbitrary Waveform Generator Using DAC and DMA; Microchip Technology Inc.: Chandler, AZ, USA, 2019. [Google Scholar]
  8. Revanna, N.; Viswanathan, T.R. Low frequency CMOS sinusoidal oscillator for impedance spectroscopy. In Proceedings of the 2014 IEEE Dallas Circuits and Systems Conference (DCAS), Richardson, TX, USA, 12–13 October 2014; pp. 1–4. [Google Scholar]
  9. Strelnikov, I.V.; Ryabov, I.V.; Klyuzhev, E.S. Direct Digital Synthesizer of Phase-Manipulated Signals, Based on the Direct Digital Synthesis Method. In Proceedings of the 2020 Systems of Signal Synchronization, Generating and Processing in Telecommunications (SYNCHROINFO), Svetlogorsk, Russia, 1–3 July 2020; pp. 1–3. [Google Scholar]
  10. Roy, S.; Kumar, D.; Dandapat, A.; Saha, P. Discretized Sinusoidal Waveform Generators for Signal Processing Applications. In Proceedings of the 2018 2nd International Conference on Trends in Electronics and Informatics (ICOEI), Tirunelveli, India, 11–12 May 2018; pp. 1350–1353. [Google Scholar]
  11. Verma, S.K.; Pullakandam, M.; Yanamala, R.M.R. Pipelined CORDIC Architecture Based DDFS Design and Implementation. In Proceedings of the 2023 IEEE 20th India Council International Conference (INDICON), Hyderabad, India, 14–17 December 2023; pp. 1440–1445. [Google Scholar]
  12. Chinnathambi, M.; Bharanidharan, N.; Rajaram, S. FPGA implementation of fast and area efficient CORDIC algorithm. In Proceedings of the 2014 International Conference on Communication and Network Technologies, Sivakasi, India, 18–19 December 2014; pp. 228–232. [Google Scholar]
  13. Brunelli, C.; Berg, H.; Guevorkian, D. Approximating sine functions using variable-precision Taylor polynomials. In Proceedings of the 2009 IEEE Workshop on Signal Processing Systems, Tampere, Finland, 7–9 October 2009; pp. 57–62. [Google Scholar]
  14. Nekounamm, M.; Eshghi, M. An efficient ROM-less direct digital synthesizer based on Bhaskara I’s sine approximation formula. In Proceedings of the 2012 IEEE International Frequency Control Symposium Proceedings, Baltimore, MD, USA, 21–24 May 2012; pp. 1–6. [Google Scholar]
  15. Li, X.; Lai, L.; Lei, A.; Lai, Z. A direct digital frequency synthesizer based on two segment fourth-order parabolic approximation. IEEE Trans. Consum. Electron. 2009, 55, 322–326. [Google Scholar]
  16. Stoica, F.-G.; Calinescu, A.; Enachescu, M. A High-Speed, Area-Optimized, ROM-Less (Co)Sine Wave Synthesis Accelerator. In Proceedings of the 2024 International Symposium on Electronics and Telecommunications (ISETC), Timisoara, Romania, 7–8 November 2024; pp. 1–4. [Google Scholar]
  17. Nanfak, A.; de Dieu Nkapkop, J.; Zourmba, K.; Ngono, J.M.; Moreno-López, M.F.; Tlelo-Cuautle, E.; Borda, M.; Effa, J.Y. Dynamic analysis and FPGA implementation of a 2D fractional sine-cosine map for image encryption using bit-level permutation and genetic algorithm. Math. Comput. Simul. 2025, 240, 105–136. [Google Scholar] [CrossRef]
  18. Schlör, L. Fast MiniMax Polynomial Approximations of Sine and Cosine. Available online: https://gist.github.com/publik-void/067f7f2fef32dbe5c27d6e215f824c91 (accessed on 15 February 2026).
  19. Linear Technology. LTC1666/LTC1667/LTC1668 Datasheet. Available online: https://www.analog.com/media/en/technical-documentation/data-sheets/166678f.pdf (accessed on 15 February 2026).
  20. Linear Technology. LT3032 Series Datasheet. Available online: https://www.alldatasheet.com/datasheet-pdf/pdf/332597/LINER/LT3032.html (accessed on 15 February 2026).
  21. Linear Technology. LT1809/LT1810s Datasheet. Available online: https://www.alldatasheet.com/datasheet-pdf/pdf/259954/LINER/LT1810.html (accessed on 15 February 2026).
  22. Linear Technology. LT5400 Datasheet. Available online: https://www.alldatasheet.com/datasheet-pdf/pdf/1299597/AD/LT5400.html (accessed on 15 February 2026).
  23. Yogeshwaran, K.; Yogesh, M.; Srinivass, B.N.; Veerendiran, S. Cutting Edge Design of High Performance Dadda Multipiler for Fast Arithmetic Operations. In Proceedings of the 2025 3rd International Conference on Advancements in Electrical, Electronics, Communication, Computing and Automation (ICAECA), Coimbatore, India, 4–5 April 2025; pp. 1–5. [Google Scholar]
Figure 1. Sine polynomial approximation error comparison on a logarithmic scale.
Figure 1. Sine polynomial approximation error comparison on a logarithmic scale.
Electronics 15 01093 g001
Figure 2. Average error of fixed-point implementation vs. number of fractional bits (top); precision saturation after 16 bits (bottom).
Figure 2. Average error of fixed-point implementation vs. number of fractional bits (top); precision saturation after 16 bits (bottom).
Electronics 15 01093 g002
Figure 3. Internal structure of the iterative cosine synthesizer.
Figure 3. Internal structure of the iterative cosine synthesizer.
Electronics 15 01093 g003
Figure 4. Achieving one sample per clock cycle by using 4 interleaved cosine synthesizers.
Figure 4. Achieving one sample per clock cycle by using 4 interleaved cosine synthesizers.
Electronics 15 01093 g004
Figure 5. Timing of outputs of a system built with 4 cosine synthesizers with interleaved activity. Each clock cycle generates a new cosine function output.
Figure 5. Timing of outputs of a system built with 4 cosine synthesizers with interleaved activity. Each clock cycle generates a new cosine function output.
Electronics 15 01093 g005
Figure 6. Validation framework.
Figure 6. Validation framework.
Electronics 15 01093 g006
Figure 7. System block diagram implemented on the FPGA.
Figure 7. System block diagram implemented on the FPGA.
Electronics 15 01093 g007
Figure 8. States of the control FSM.
Figure 8. States of the control FSM.
Electronics 15 01093 g008
Figure 9. Accelerator system control registers.
Figure 9. Accelerator system control registers.
Electronics 15 01093 g009
Figure 10. FPGA and D/A conversion PCB high level schematic.
Figure 10. FPGA and D/A conversion PCB high level schematic.
Electronics 15 01093 g010
Figure 11. Detailed high-speed DAC PCB schematic.
Figure 11. Detailed high-speed DAC PCB schematic.
Electronics 15 01093 g011
Figure 12. Figure of merit (FoM) comparison across 16-, 24-, and 32-bit data widths.
Figure 12. Figure of merit (FoM) comparison across 16-, 24-, and 32-bit data widths.
Electronics 15 01093 g012
Figure 13. Validation boards.
Figure 13. Validation boards.
Electronics 15 01093 g013
Figure 14. Regular sine: F = 1.5 MHz, V P P = 6 V, V O S = 0 V.
Figure 14. Regular sine: F = 1.5 MHz, V P P = 6 V, V O S = 0 V.
Electronics 15 01093 g014
Figure 15. Regular sine: F = 1 kHz, V P P = 2 V, V O S = 2 V.
Figure 15. Regular sine: F = 1 kHz, V P P = 2 V, V O S = 2 V.
Electronics 15 01093 g015
Figure 16. Half-wave rectified: F = 1 kHz, V P = 3 V, V O S = 0 V.
Figure 16. Half-wave rectified: F = 1 kHz, V P = 3 V, V O S = 0 V.
Electronics 15 01093 g016
Figure 17. Full-wave rectified: F = 1 kHz, V P = 2.5 V, V O S = 0 V, inverted.
Figure 17. Full-wave rectified: F = 1 kHz, V P = 2.5 V, V O S = 0 V, inverted.
Electronics 15 01093 g017
Table 1. Constant coefficients used in the proposed polynomial approximation of the cosine function.
Table 1. Constant coefficients used in the proposed polynomial approximation of the cosine function.
TermValueQs5.10Qs5.18Qs5.26
a 0 0.9999702106899530686263235870557280780x03FE0x03FFF70x03FFF82F
a 1 −0.4997827067046888091404666177263334550xFE010xFE003A0xFE0038F7
a 2 0.04136611496384822525693838725764599430x002A0x002A5B0x002A5BE0
a 3 −0.00124123975823986007021296049447201020xFFFF0xFFFEBB0xFFFEBA9E
Table 2. Error comparison for different sine approximation methods.
Table 2. Error comparison for different sine approximation methods.
ErrorThis WorkSplineTaylorBhaskara
RMS Error [ × 10 5 ]1.4837.073.8197.36
Max Absolute Error [ × 10 5 ]2.9789.0116.52163.17
Average Absolute Error [ × 10 5 ]1.2025.181.6583.64
Table 3. Firmware functions used for system control.
Table 3. Firmware functions used for system control.
Functions
dec_to_bin(value, sign, no_of_bits_for_int, no_of_bits_for_frac)
bin_to_dec(value, sign, no_of_bits_for_int, no_of_bits_for_frac)
sine_config(sin_type, amp, off, freq, phase)
control_signal(enable, inverted, load_trig)
write_reg(reg_name, write_value)
read_reg(reg_name)
read_sine_output(x_times)
Table 4. Utilization report.
Table 4. Utilization report.
LUTs Required for Different Data Bus Widths
Implementation16 bits24 bits32 bits
Bhaskara144333636065
CORDIC8009121024
Parabolic synthesis179406779
Taylor series58323594043
This Work (1 core)3656142
This Work (4 cores)342424776
Table 5. Timing report.
Table 5. Timing report.
Highest Delays for Different Data Bus Widths [ns]
Implementation16 bits24 bits32 bits
Bhaskara75.992119.387159.401
Parabolic synthesis18.84829.08234.462
Taylor series44.85856.8862.469
This Work (1 core)8.40210.35212.656
This Work (4 cores)8.2479.95211.936
Table 6. Relative error comparison.
Table 6. Relative error comparison.
Relative Error for Different Data Bus Widths [%]
Implementation16 bits24 bits32 bits
Bhaskara0.1530.1670.16
CORDIC0.0250.0210.021
Parabolic synthesis0.0170.0030.003
Taylor series0.0010.00030.0003
This Work (1 core)0.00480.000320.00019
This Work (4 cores)0.00480.000320.00019
Table 7. Figure of merit comparison.
Table 7. Figure of merit comparison.
This Work(1 core)This Work (4 cores)Taylor SeriesParabolic SynthesisCORDICBhaskara
16 bits5.80713.53826.15257.354164.94016,777.438
24 bits0.7421.35040.25435.422190.60167,050.246
32 bits1.3661.76075.76980.538256.672154,682.730
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Stoica, F.-G.; Calinescu, A.; Enachescu, M. ROM-Less Co(Sine) Synthesizer. Electronics 2026, 15, 1093. https://doi.org/10.3390/electronics15051093

AMA Style

Stoica F-G, Calinescu A, Enachescu M. ROM-Less Co(Sine) Synthesizer. Electronics. 2026; 15(5):1093. https://doi.org/10.3390/electronics15051093

Chicago/Turabian Style

Stoica, Florentina-Giulia, Alex Calinescu, and Marius Enachescu. 2026. "ROM-Less Co(Sine) Synthesizer" Electronics 15, no. 5: 1093. https://doi.org/10.3390/electronics15051093

APA Style

Stoica, F.-G., Calinescu, A., & Enachescu, M. (2026). ROM-Less Co(Sine) Synthesizer. Electronics, 15(5), 1093. https://doi.org/10.3390/electronics15051093

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop