Short-Time Fourier Transform Based on Metaprogramming and the Stockham Optimization Method

The extension for high-performance STFT (Short-Time Fourier Transform) algorithm written entirely in Java language for non-parallel computations is presented in the current paper. This solution could compete with the best available and most common algorithms supplied by libraries such as FFTW or JTransform. The main idea was to move complex computations and expensive functions to the program generation phase. Thus, only core and essential operations were executed during the runtime phase. Furthermore, new approach allows to eliminate the necessity for a rearrangement operation that uses the bit-reversal permutation technique. This article presents a brief description of the STFT solution that was worked out as an extension for the original application, in order to increase its efficiency. The solution remains a Stockham algorithm adapted using metaprogramming techniques and entails an additional reduction its execution time. Performance tests and experiments were conducted using a Java Platform and JMH library, which allowed for accurate execution time measurements. Major aspects of the Java VM like warm-up effects were also taken into consideration. Solution was applied into Electrical Capacitance Tomography measurement system in order to measure the material changes during the silo discharging industrial process.


Introduction
During silos discharging process, particularly with grain or other loose material, dynamic effects frequently occur. The dynamic effects and other related to material conveying processes phenomena have a major impact to the surrounding environment leading to material depreciation or even industrial disasters. The silo vibrations are commonly caused or induced by material discharge and a resonance between the eigen frequency of the silo structure, the frequencies of self-excited material particles and the frequencies of the operating machines [1].
One of the process tomography technique is Electrical Capacitance Tomography (ECT). This allows to measure parameters, such as distribution of the material concentration in the cross section of the silos, the change of material concentration in time during filling and emptying the silo and the vibrations of the transported material. STFT algorithm presented in this article was used to measure such frequency phenomena [2].
The methods for signal analysis like Frequency transform are still an essential tool in digital signal processing (DSP). They are used by many algorithms, e.g., linear filtration, correlation and frequency spectrum analysis. Particularly Discrete Fourier Transform (DFT) and its optimized version Fast Fourier Transform (FFT) are used. Both give good results for signals from stationary systems, but are not sufficient to describe signals generated by systems whose properties change over time (not stationary). Such signals require special treatment using dedicated analysis techniques. One such technique is the Short-Time Fourier Transform (STFT), a very popular tool used in analysis of frequency changes characteristic in time domain that provides information of harmonics magnitude in consecutive niques [28]. It can be explained as programming at a higher level of abstraction, using another program (metaprogram) to obtain the target application [29]. Metaprogramming may thus be described as an automatic approach to program creation [30].
AOP refers to code injections in an implicit manner [1]. The external part of the script is applied to the target application in places that have been earlier specified. Adding aspect code to the program can take place before its launch as well as during operation. Aspect programming is often used for code optimization and generalization.
Dynamic metaprogramming covers techniques such as introspection, reflection, dynamic polymorphism, and interpreters.
Static metaprogramming allows for precise code preparation before runtime, during which expensive computations are transferred outside. Static metaprogramming covers multiple solutions, such as preprocessors, compilators, linkers, macro, code templates, inline functions, and other code generators. The optimization techniques involve: unwinding the loop (loop body code multiplication that leads to reduction of expensive 'if' or 'jump' instruction usage); vectorization [31] (additional SIMD operations utilizations and lowlevel instruction execution during runtime); parametrization (the ability to generate final code based on its domain parameters); code templates (as part of static polymorphism); loop tiling or lookup table preparation [8].
Although metaprogramming techniques have been known for years, there has been little attempt to apply them to digital signal processing algorithms so far. Possibilities of using metaprogramming in DSP are discussed in [32]. An example implementation of Feedback Delay Network (FDN) artificial reverberation algorithm is also presented there. In [33] a Python non-uniform fast Fourier transform (PyNUFFT) package has been presented. Metaprogramming libraries have been employed for its acceleration. GPU implementation of FFT and DFT using meta programming optimization techniques is presented in [34].
In particular, a group of static metaprogramming techniques is particularly interesting for the purpose of optimizing the STFT. STFT algorithm has a large number of intermediate products obtained during code execution that can be successfully moved into the code generation phase. Obviously, this is extremely costly in terms of memory, but at the same time it derives a significant profit in terms of execution time.
The research presented below in the article was aimed at reducing the computational complexity of the STFT calculation algorithm with the use of various optimization techniques, with particular emphasis on metaprogramming implementation techniques.

Method
In order to increase STFT algorithm performance many optimization methods are used: The Danielson-Lanczos lemma enables the Fourier Transform to be computed in a faster way than using a naive DFT algorithm [6,35]. In this approach, the input sequence is divided into two subsequences, a consecutive series of odd and even indices. Such decomposition is called decimation, in this case decimation in time (DIT). There is also decimation in frequency (DIF), and both methods require N/2*log 2 (N) complex additions, subtractions and multiplications. When the decimation base is equal to 2 it is called RADIX-2 decimation. This has an additional impact on input data. The input series should have the size of the power of two, to divide the sequence into two subseries of equal length in recursive stages. The division ends when there are only one-element arrays.
In commonly-used applications, there are several techniques for time series division. For instance, RADIX-4 or the more efficient Split-RADIX may be used. When the input array has N elements, and N is not a prime number, factorization is possible, whereby, for example, from a 15-element array we obtain 3 arrays consisting of 5 elements or 5 arrays with 3 elements.
Both computations, DIT and DIF can be performed in-situ what means that it doesn't require additional memory. The disadvantage is that the output is not in order. It is possible to obtain output in order, but the computation cannot be done in-situ, and at least one additional array must be provided [6].
The FFT computation process is presented in Figure 1. On the left-hand side there is RADIX-2 decimation. This is only a logic decomposition to present the idea of the second step of real computations. Then two-element sequences are combined (DFT). For real numbers, such operations are trivial and quite fast. There are no complex number multiplications. The next stages require complex number arithmetic, which uses so-called twiddle factors whereby values depend on the subsequence length and the position of the current computation pair. Twiddle factors are often obtained from a look-up table, as one memorization technique. In metaprogramming, these values also may be explicitly placed in execution code, significantly increasing its performance. Each stage returns a partial output that is used in the following stage. The number of stages depends on the length of the input series. When RADIX-2 decimation is used, there are log 2 (n) stages, so for example in Figure 1 the given array is 8 in size and there are 3 stages. Finally, after all processing stages there is a Fourier Transform output with N complex numbers but disordered ( Figure 1).
To obtain an output in the appropriate order, the indexes should be rearranged. Here, several solutions are allowed. One of them is a bit-reversal permutation [6], which is applied with time-execution complexity O(n). In the example presented below (Equation (1)), which describes a situation where the possible values are in the range of [0, 7] giving k = 3 digits in binary system, it can be seen that the values 0 10 and (2 k −1) 10 are unchanged. K = 0 10 = 000 b and k = 3 => bit_rev_permutation(K) = 000 2 = 0 10 ← unchanged K = 6 10 = 110 b and k = 3 => bit_rev_permutation(K) = 011 2 = 3 10 K = 7 10 = 111 b and k = 3 => bit_rev_permutation(K) = 111 2 = 7 10 ← unchanged The number of array operations A op is lower than n (A op < n). Nonetheless, it is possible to omit this expensive overhead, as Swarztrauber observed [36]. Basically, each of the algorithms presented in Swarztrauber's paper computes the same result. They differ only in the ways that the intermediate computations are stored. These solutions present the possibility of adapting FFT mainly to vector computers. Cooley-Tukey and Pease algorithms compute the FFT in a permuted order, thus often a separate order phase is used to rearrange the output elements. This operation is time-consuming not only on vector computers.
In a paper that is fundamental to the early FFT literature [6] W.T. Cochran introduced an autosort FFT algorithm and attributed it to Stockham. The Stockham autosort algorithm has received much attention, because it computes the Fourier Transform in the proper order without an explicit order phase. Despite the fact that solution requires additional memory, bit-reversed permutations has very expensive arithmetic and read/write overhead operation, thus it is reasonable to use this method. Listing 1 shows Stockham solution written in FORTRAN [36]. To omit the sorting phase, an additional table is used that allows the data to be kept in order. Listing 1. The code of Stockham solution for FFT computation. Red color presents auxiliary array adjustment for reordering purposes and green color indicates two procedures that were used (code from [36]). follows, STFT modifications are presented that keep output data in order without requiring additional arithmetic and input/output operations. Depending on the processing stage, the initial or auxiliary array is used, as shown in Figure 2. While the consecutive stages (up to phase n-2) are executed, the solution remains in-situ. For phase n-1, the result is saved into an auxiliary array with a length of N-complex. The calculations themselves remain unchanged. In the final phase, n computations are performed with data from an aux array and the output is stored in the initial array in the correct order. To obtain this behavior, the addressing mechanism is changed and bit-reversal permutation coefficients are used only in the metaprogram. There are no reordering operations in run-time, so its execution time is reduced. Figure 3 shows a Short-Time Fourier Transform based on the above solution. In this example, the first two time windows are presented: m = 0, m = 1. The previous stages have been calculated as for regular FFT. The output of stage n-1 for the first FFT (m = 0) is written into an additional array that was highlighted with red color on Figure 3. The output from the last stage (n) is moved back to the initial array in the appropriate order. It is important to note that a feed-forward technique was applied to the metaprogramming solution. This reduces the number of computations, due to the fact that duplicated butterfly operations exist in the process [37]. This can have a major impact on the application. If we introduce an additional array, the data moved from the previous time window (m−1) should be copied to both the initial and auxiliary arrays, depending on the indices and the STFT overlapping factor. Figure 4 presents a rectangular time window and an overlapping factor equal to 1, which means that the time window is moved over the input data one by one sample. Thanks to this approach, there is no need to perform bit-reversal permutation rearrangement. Implementation was performed using the Java Platform with the JavaPoet library (JavaPoet; https://github.com/square/javapoet; accessed on 10 January 2021; author: Jake Wharton, version: 1.10.0). This enabled a large part of the algorithm to be moved into the metaprogram, such as indices management in loops, arithmetic and logical operations that do not depend on input data but only on algorithm structure and other additional arithmetic operations.
The metaprogram begins and generates a sequence of butterfly operations before the target code is executed. Previously, it was composed of seven consecutive phases: creating a model of butterfly operations for time window m = 1; creating a model of butterfly operations for time window m+oc (feedforward technique), where oc is the overlap coefficient; preparation of RADIX-2 decimation indices; identification of duplicated operations; bit-reversal permutation indices preparation and finally target code generation. The basis of the solution is to generate a model of butterfly operations for time window m = 1 and the same for time window m > 1 moved over the input array INPUT with the specified overlapping coefficient (oc). Thanks to these models, RADIX-2 decimation is performed and duplicated butterfly operations are identified (widely described in [8]). Additionally, this step provides information on how large the auxiliary array should be for the feedforward technique. During runtime, while the FFT for time window m =1 is processed, the output from duplicated operations is moved to the auxiliary array. Then, for FFT and time window m = 2, data from the auxiliary array is copied into selected parts of the input array that allows to omit its recalculation [8]. A block diagram of the solution is presented in Figure 4.
The new metaprogram algorithm has been changed to adapt the Stockham solution. Bit-reversal permutation indices must still be prepared, but only to preserve information about the place where the final stage output will be written. There is no need to generate code for an additional phase of indices rearrangement in the target program. All procedures connected to this phase belong only to the metaprogram. Listings 2 and 3 and present the changes that were made in the metaprogram.
Listing 2. Metaprogram that presents a part of butterfly operations generation (BEFORE STOCKHAM ALG. ADJUSTMENT). The generated code consists of variables like: tempR, tempI, FR, FI, FR2, FI2 (real and imaginary part) and tables: current and transitiontable for feedforward solution (presented with green color). Last loop stands for bit-reversal permutation. The array operations_rest stands for the sequence of butterfly operation parameters. Each operation has 5 elements (shown with red color) . . . As it can be seen above (Listing 2, Listing 3) STFT code generator was changed. Listings show that the main loop that is used to prepare target program goes through all butterfly operations. Its configuration is stored in operations_rest array, and iterator increase by 5 for each operation. After change it is increased with 6 because additional parameter was introduced. This parameter handle information about the processed stage. Last stage is indicated by value 0, thus for last stage −1 this parameter will be set to 1. If the last stage is detected, source table name for STFT operation is set to stockhamtable, otherwise it is set to current. Target table name depends on one before last stage. Thanks to this approach metaprogram generates the sequence of target application operations, where data is copied between current and auxiliary array as it was proposed by Stockham.
The result of running the metaprogram is the optimized code of the target application. For the STFT algorithm, the code fragment is presented in Figure 5. Places in the resulting program were indicated that best represent the above-mentioned profit. This code was generated for 8 element FFT series. Main advantage of use metaprogram is to reduce trigonometric operations, omit indices computations and possibility to remove duplicated butterfly operations. The solution is not based on conditional operations, but completely eliminates redundant code in the target application. Test procedure consisted of four steps: code preparation, test preparation, test execution, and results analysis. Code preparation means STFT code generation. This phase is not measured, because once the code has been generated the result may be saved in the repository. Next, test parameters are prepared. These consist of the window size (N = [8…2048]) and the number of windows for STFT (m = 200,000). Finally, an input array of real values is generated as common data for all the tested algorithms.
The measurement phase starts with a so-called warm-up. This approach is commonly used to eliminate the influence of Java virtual machine behavior connected with the justin-time compilation facility. It was assumed and confirmed that 5 iterations are sufficient to omit this effect. The tested solutions were jTransform (DoubleFFT_1D.complexForward method) [38], metaprogramming STFT, and metaprogramming STFT with the Stockham extension. The results of execution time measurements are presented in Figures 6 and 7.

Results
In order to examine the efficiency of the prepared solution, a personal computer equipped with an Intel Core i7-4710HQ processor and 12GB RAM was used. The underlying operating system was Windows 8. The metaprogram and target STFT code was implemented on a Java platform [Java(TM) SE Runtime Environment (build 1.8.0_231-b11)]. The whole verification process was supported by a JMH library created specifically for the purpose of performance analysis. Data that is processed is compliant with IEEE 754 double precision floating-point arithmetic.
Test procedure consisted of four steps: code preparation, test preparation, test execution, and results analysis. Code preparation means STFT code generation. This phase is not measured, because once the code has been generated the result may be saved in the repository. Next, test parameters are prepared. These consist of the window size (N = [8 . . . 2048]) and the number of windows for STFT (m = 200,000). Finally, an input array of real values is generated as common data for all the tested algorithms.
The measurement phase starts with a so-called warm-up. This approach is commonly used to eliminate the influence of Java virtual machine behavior connected with the justin-time compilation facility. It was assumed and confirmed that 5 iterations are sufficient to omit this effect. The tested solutions were jTransform (DoubleFFT_1D.complexForward method) [38], metaprogramming STFT, and metaprogramming STFT with the Stockham extension. The results of execution time measurements are presented in Figures 6 and 7. For each data input length (N) tests were performed 30 times. Final results presented below stands for the average values.  The execution times for the three solutions are very similar. Major changes are recorded above N ≥ 1024, where the jTrans algorithm is clearly better due to its possibility of concurrent computing, but when lower N are examined the metaprogramming solutions show higher performance (zoom). In order to better data presentation, the percentage changes were computed (Equation (2)) and presented in Figure 7. As the reference algorithm (REF) the metaprog-stock was used.
Authors performed additional tests on other operating systems to exclude an influence of execution environment. Linux (Zorin distro) was used. Computer was equipped with 32 GB RAM and Intel ® Core™ i7 10gen 10750H processor. Results were different but ratio between three solutions is similar. Results were presented in Table 1. The test stand was prepared at the Laboratory of T. Dyakowski at the Institute of Applied Computer Science at the Lodz University of Technology. The elements of the structure and measurement system consisted of: silo model with a diameter of 200 mm and a height of 2000 mm (wall thickness 4 mm), 8-electrode capacitive sensor with a tomograph (30 frames/s). During the experiment, time courses of material concentration vibrations were obtained, which were processed using STFT. Further processing allowed to extract dominant harmonic which was changing during silo discharging. The experiments were carried out for non-cohesive and cohesive loose materials. Medium-grain quartz sand (medium grain diameter) was used as a non-cohesive material d 50 = 0.8 × 10 −3 m. A mixture of medium-grained sand and water was used as the cohesive material [1]. Result was presented in Figure 8. Figure presents material concentration changes in silo cross-section during discharging process. The rectangular window was used.

Analysis and Discussion
The solution was implemented with JavaPoet library. Using this tool, both the metaprogram and the final application could be prepared in the same programming language, which has a significant impact on its maintenance. Programmer is not forced to switch between different programming languages. JavaPoet is a sufficient tool to keep target application syntax valid. Even the output code is very extensive. Each line meets Java syntax requirements and will not cause compilation errors.
The improvement achieves 60% acceleration in the range mainly for N = 64-256, compared to other common solutions such as the JTransform library. The improvement was also recorded comparing current solution to the previous one. Difference is around 2.3% to 28% compared to the former solution without the Stockham extension.
Very interesting is fact that for small N (8, 16) and large (1024 and more) the presented solution performance is slower than the reference (jTransform). This observation I would divide into two separated problems. First refers to N = 8, 16. As it can be seen in Figure 7 jTransform is better than metaprogram solution but worse than metaprogram with stockham extension. Metaprogram solution had bit reverse permutation for N = 8 and when it was removed then the code sped up its performance. JTransform for N = 4 and N = 8 doesn't have reordering phase. For N > 32 there is a function 'bitrv2conj' which seems to be very expensive. Second issue: This results from the fact that jTransform solution uses strategy pattern and adjust implementation based on input array size N. When N is larger than allowed array size (hardcoded: N*2 > 1073741824) then so called large array strategy is performed. If N is not a power of two then Bluestein plan or mixed_radix plan is triggered otherwise split_radix is used. In initialization mode specific preprocessing is done. For an example the lookup table with trigonometric values is also generated. It was possible to retrieve the source code from repository, but despite this, due to the complexity, the algorithm is difficult to read. For sure algorithm is based on the parallel computing. It uses ConcurrencyUtils.submit() method to trigger parallel computations that extremely speed up solution for higher N.
Despite of execution time analysis, the size of generated application was also studied. As can be seen in Figure 8, the presented solution has some disadvantages. Code size grows rapidly with increasing window size and finally (for N = 2048) gives almost the same value as the whole jTransform library. For an instance, metaprog has 245 KB and metaprog-stock 229 KB for N = 256, and increasing to 554 KB and 522 KB for N = 512, respectively. Obviously, for modern computers these numbers are not crucial.
Memory complexity was not studied as it change only due to utilizing auxiliary table that is size of N (length of input array).
Generating the code of the Short Time Fourier Transform algorithm as presented in this article has many advantages: • introduction to the solution the domain analysis of the problem; the principle of solving the problem remains unchanged, but it is possible to improve program efficiency, • possibility to control the code that is generated, • removal of loops (reduction of computation time), • removal of calculations resulting from known parameters, • control of table indexes during application generation, • generator parameterization (parameter-time window length, parameter-FFT frame overlap coefficient), • reduction of arithmetic operations on the basis of mathematical properties.
The results clearly show that it is beneficial to use the new tool for STFT when the window size is in the range of N = . Despite its restrictions, the solution is sufficient for short time windows data processing, especially when the non-stationary signal changes rapidly and the time window should be limited. An example of its possible application may be the time-frequency analysis of the Electrical Capacitance Tomography data (ECT). ECT is used to measure the distribution of material concentration in a silo cross-section during the discharging process. The measurement frequency is low and reaches 30-150 Hz [1].

Conclusions
It is extremely difficult to improve such a common signal processing algorithm as STFT. However, this paper has presented a consecutive solution that allows to further reduce its execution time.
The method was elaborated as an extension of a previously developed application, in order to improve its efficiency. Adaptation of the Stockham concept was crucial to omit bit-reversal output data modification and obtain STFT values in order.
This was possible at the expense of memory needed to transfer intermediate results between the final two stages. The increased memory requirements are negligible due to the current capabilities of modern computers. Thanks to use of the metaprogramming technique, it was possible to move time-consuming operations into the code generation phase. The improvement achieves 60% acceleration.
It is worth to emphasize that the same as jTransform the solution is characterized by high portability as it is written in Java language. Future work will explore the possibility of applying parallel computation approaches to metaprogramming, which can dramatically increase the performance of STFT computations, particularly for larger input data.