High-Speed Wavelet Image Processing Using the Winograd Method with Downsampling

: Wavelets are actively used to solve a wide range of image processing problems in various ﬁelds of science and technology. Modern image processing systems cannot keep up with the rapid growth in digital visual information. Various approaches are used to reduce the computational complexity and increase computational speeds. The Winograd method (WM) is one of the most promising. However, this method is used to obtain sequential values. Its use for wavelet image processing requires expanding the calculation methodology to cases of downsampling. This paper proposes a new approach to reduce the computational complexity of wavelet image processing based on the WM with decimation. Calculations have been carried out and formulas have been derived that implement digital ﬁltering using the WM with downsampling. The derived formulas can be used for 1D ﬁltering with an arbitrary downsampling stride. Hardware modeling of wavelet image ﬁltering on an FPGA showed that the WM reduces the computational time by up to 66%, with increases in the hardware costs and power consumption of 95% and 344%, respectively, compared to the direct method. A promising direction for further research is the implementation of the developed approach on ASIC and the use of modular computing for more efﬁcient parallelization of calculations and an even greater increase in the device speed.


Introduction
Wavelets are actively used to solve a wide range of image processing problems in various fields of science and technology, e.g., image denoising [1], reconstruction [2], analysis [3], and video analysis and processing [4].Wavelet processing methods are based on the discrete wavelet transform using 1D digital filtering.Digital filtering is performed through repeated additions and multiplications and has a high computational complexity.Modern image processing systems cannot keep up with the rapid growth in the volume of digital visual information that needs to be processed, stored, and transmitted.Improving microelectronic devices is one of the state-of-the-art approaches to increase computational efficiency [5].Many approaches reduce the computational complexity of wavelet image processing, including various hardware architectures [6].The authors of [7] developed a multidimensional wavelet construction method that constructs multidimensional inseparable wavelet filter banks from two 1D low-pass filters, one of which is an interpolation filter, to improve image processing speed.In paper [8], the authors developed a new 2D transform called the asymmetric 2D Haar transform and extended it to wavelet packets with an exponentially larger number of bases.The authors of [9] developed an algorithm for the 2D discrete wavelet transform of high-resolution images on Internet of Things nodes.All of these methods are based on pixel-by-pixel image processing.The Winograd method (WM) is based on matrix multiplication and is used as a modern alternative to the classic direct method (DM).The WM reduces the computational complexity of image processing due to the simultaneous calculation of several output values, in contrast to the above methods.The processed image is assembled not from a collection of individual pixels, but from fragments of a certain size.This approach reduces the number of computationally complex multiplications by increasing the number of additions.The WM is used to increase the speed of neural network image processing algorithms [10].In [11], digital filtering algorithms based on the WM were proposed for convolutional layers of neural networks (NNs), which are superior to the fast Fourier transform in terms of the performance of deep NNs when processing large arrays of visual data.Based on this work, architectures [12] and hardware accelerators [13] have been developed to implement WM-based NN image processing algorithms.In [14], a digital filter architecture based on the WM in a residue number system was developed, which accelerated image processing while increasing hardware costs.However, the WM is designed to obtain groups of adjacent values, while wavelet filtering reduces the sampling rate of the digital signal.In this regard, there is a need to generalize the WM to cases of signal downsampling with a stride of two to implement wavelet filtering of images.
The purpose of this paper is to generalize the Winograd method to the case of convolution with downsampling and to increase the computational speed of wavelet image processing using the WM with decimation.
The rest of the paper is organized as follows.Section 2 describes approaches to wavelet image processing based on the DM and the WM.Section 3 presents a high-speed implementation of the discrete wavelet transform using the proposed approach.Section 4 contains discussion of the received results.Section 5 contains the conclusions.

A. Wavelet filtering using the direct method
Wavelet filtering with decimation of a 2D image along DM lines can be represented as: where Z and N are the processed and original 2D images, respectively, and u and y are the row and column numbers of pixels processed by the wavelet filter R of order r.Wavelet image processing using the DM is performed through two computational channels corresponding to low-frequency and high-frequency wavelet filters.The scheme of 1D wavelet filtering of a DM image fragment is shown in Figure 1, where R L and R H are lowpass and high-pass filters and Z L (u, y) and Z H (u, y) are the processed images containing low-and high-frequency information about the original image, respectively.
Internet of Things nodes.All of these methods are based on pixel-by-pixel image processing.The Winograd method (WM) is based on matrix multiplication and is used as a modern alternative to the classic direct method (DM).The WM reduces the computational complexity of image processing due to the simultaneous calculation of several output values, in contrast to the above methods.The processed image is assembled not from a collection of individual pixels, but from fragments of a certain size.This approach reduces the number of computationally complex multiplications by increasing the number of additions.The WM is used to increase the speed of neural network image processing algorithms [10].In [11], digital filtering algorithms based on the WM were proposed for convolutional layers of neural networks (NNs), which are superior to the fast Fourier transform in terms of the performance of deep NNs when processing large arrays of visual data.Based on this work, architectures [12] and hardware accelerators [13] have been developed to implement WM-based NN image processing algorithms.In [14], a digital filter architecture based on the WM in a residue number system was developed, which accelerated image processing while increasing hardware costs.However, the WM is designed to obtain groups of adjacent values, while wavelet filtering reduces the sampling rate of the digital signal.In this regard, there is a need to generalize the WM to cases of signal downsampling with a stride of two to implement wavelet filtering of images.The purpose of this paper is to generalize the Winograd method to the case of convolution with downsampling and to increase the computational speed of wavelet image processing using the WM with decimation.
The rest of the paper is organized as follows.Section 2 describes approaches to wavelet image processing based on the DM and the WM.Section 3 presents a high-speed implementation of the discrete wavelet transform using the proposed approach.Section 4 contains discussion of the received results.Section 5 contains the conclusions.

A. Wavelet filtering using the direct method
Wavelet filtering with decimation of a 2D image along DM lines can be represented as: where  and  are the processed and original 2D images, respectively, and  and  are the row and column numbers of pixels processed by the wavelet filter  of order .Wavelet image processing using the DM is performed through two computational channels corresponding to low-frequency and high-frequency wavelet filters.The scheme of 1D wavelet filtering of a DM image fragment is shown in Figure 1, where  and  are low-pass and high-pass filters and  (, ) and  (, ) are the processed images containing low-and high-frequency information about the original image, respectively.All pixels are processed by a pair of wavelet filters of order r and require 2r multiplications and 2(r − 1) additions in the DM of wavelet processing.Multiplications have a higher computational complexity than additions and require significant resource costs when digital filtering is implemented on modern microelectronic devices.The WM is one of the main alternatives to the classic DM and is discussed below.

B. Digital filtering using the Winograd method
The WM reduces the computational complexity of image processing by simultaneously computing multiple output values through matrix calculations.The WM formula for 1D image filtering has the following general form [15]: where Z is the fragment of the processed image of size z × 1; R is the wavelet filter mask of size r × 1; N is the fragment of the original image of size n × 1, where n = z + r − 1; A T , G, B T are the transformation matrices of sizes z × n, n × r, n × n, respectively; and is the operator of element-wise matrix multiplication.Algorithms for compiling transformation matrices are described in detail in paper [16].The notation WM F(n, r) contains processed image fragments of size n and the order r of the wavelet used.The sizes of the transformation matrices and the original image fragments depend on them.For example, WM F(4, 2) uses matrices However, the WM in its classic representation is a method based on processing groups of consecutive pixels, while image filtering through two computational channels during wavelet processing is performed with decimation to reduce the signal sampling frequency and reduce computational redundancy.Thus, there is a need to generalize and expand the WM to a more general case, in which a fragment of the processed image may consist of non-consecutive pixels.

C. Filtration using the Winograd method with downsampling
We expand the WM to the case of filtering with decimation for use in wavelet image processing.Let x 1 , x 2 , . . ., x n , . . ., x n+r−1 be the pixel brightness values of a certain fragment of a line of the original image N, and r 1 , r 2 , . . ., r r be the coefficients of the wavelet filter R.Then, the calculation of the values z 1 , z 2 , . . ., z n of the fragment of the processed image Z can be represented in matrix form: Expand of processing a fragment with 10 pixels (n + r − 1 = 10) with a fifth order filter (r = 5) to obtain six values (n = 6) of a fragment of the processed image Z at the output: When wavelet filtering with decimation, the values z 1 , z 3 , z 5 are calculated: The resulting calculations can be implemented by a combination of the WM F(3, 3) using pixel brightness values x 1 , x 3 , x 5 , x 7 , x 9 and filter coefficients r 1 , r 3 , r 5 , and F(3, 2) using pixel brightness values x 2 , x 4 , x 6 , x 8 and filter coefficients r 2 , r 4 .Thus, in this case, instead of the F(6, 5) method for wavelet filtering with decimation, you can use the F(3, 5, 2) = F(3, 3) + F(3, 2) method, where the third number in F(3, 5, 2) denotes the degree of sampling rate reduction.In the general case of filtering the WM F(n, r, d) and reducing the signal sampling frequency d times, calculations are organized using a combination of methods according to the formula: where s 1 and s 2 are the partial quotient and the remainder of dividing r by d, respectively.The notation of the WM F(n, r, d) contains the size n of the processed image fragments, the order r of the wavelet used, and the decimation stride d.
The processing of a fragment with 14 (n + r − 1 = 14) pixels with a six-tap order filter (r = 6) is expanded to obtain nine values (n = 9) of a fragment of the processed image Z at the output.For d = 3, the values of z 1 , z 4 , z 7 are calculated: The resulting calculations can be implemented by the F(3,2) WM using pixel brightness values of x 1 , x 4 , x 7 , x 10 and filter coefficients of r 1 , r 4 , for values x 2 , x 5 , x 8 , x 11 and coefficients r 2 , r 5 , as well as values of x 3 , x 6 , x 9 , x 12 and coefficients of r 3 , r 6 .Thus, in this case, instead of using the F(9, 6) WM when filtering with a stride of 3, you can use the F(3, 6, 3) = 3F(3, 2) method.In the special case when d divides r, Formula (2) will take the form: The original signal is divided into two groups of samples (even and odd) during wavelet processing of the WM F(n, r, d) images.In this case, the calculations are divided into two computational channels corresponding to even and odd signal samples.Multiplications of matrices GR L and GR H by the WM are performed a priori, once for each filter used, and do not require additional computational costs.The B T N multiplication is calculated before the calculations are divided into two channels, so it is the same for both wavelet filters used.Thus, the number of operations when processing wavelet images can be reduced by one calculation of B T N with the subsequent division of calculations into two channels.The elements of the transformation matrices A T and B T are also known in advance and consist of zeros, powers of twos, and numbers represented in binary notation as a set of ones.Multiplications by these elements can be represented as scalings and additions.For example, multiplying a number by 3 = 11 2 can be executed by moving the point of this number one bit to the right and adding it to the original number.Thus, all calculations according to Formula (1) are implemented using additions, with the exception of element-wise multiplication .Element-wise multiplication is performed once for two n × 1 matrices.The scheme for processing wavelet images with decimation for the WM is presented in Figure 2, where Z L and Z H are fragments of the image processed by the WM using wavelet filters R L and R H , respectively.
Mathematics 2023, 11, x FOR PEER REVIEW 5 of 10 be reduced by one calculation of   with the subsequent division of calculations into two channels.The elements of the transformation matrices  and  are also known in advance and consist of zeros, powers of twos, and numbers represented in binary notation as a set of ones.Multiplications by these elements can be represented as scalings and additions.For example, multiplying a number by 3 = 11 can be executed by moving the point of this number one bit to the right and adding it to the original number.Thus, all calculations according to Formula (1) are implemented using additions, with the exception of element-wise multiplication ⊙.Element-wise multiplication is performed once for two  × 1 matrices.The scheme for processing wavelet images with decimation for the WM is presented in Figure 2, where  and  are fragments of the image processed by the WM using wavelet filters  and  , respectively.

D. Computationally Efficient Data Representation
All data in digital devices are stored with limited accuracy.Thus, it is necessary to quantize the coefficients of wavelet filters.For experiments, we use Daubechies wavelets db2 and db3.During wavelet processing of 8-bit images, the coefficients of the filters used are represented with an accuracy selected in accordance with the formula [17]: where  is the capacity of the quantized wavelet filter coefficients without taking into account the sign bit and ⌊•⌋ is the rounding down operator.The original filter coefficients  are scaled by  bits and rounded up: where  * is a quantized filter and ⌈•⌉ is the rounding up operator.For example, the initial coefficients of the high-frequency wavelet filter db2 are: According to Formula (3), for the considered wavelet filter,  = 11.The coefficients of the wavelet filter are quantized using Formula (4): The resulting values are scaled and rounded down after WM wavelet filtering to compensate for the rounding error.Thus, the limited accuracy of data representation in the device memory will not have a significant impact on the quality of wavelet image processing.

D. Computationally Efficient Data Representation
All data in digital devices are stored with limited accuracy.Thus, it is necessary to quantize the coefficients of wavelet filters.For experiments, we use Daubechies wavelets db2 and db3.During wavelet processing of 8-bit images, the coefficients of the filters used are represented with an accuracy selected in accordance with the formula [17]: where f is the capacity of the quantized wavelet filter coefficients without taking into account the sign bit and • is the rounding down operator.The original filter coefficients F are scaled by f bits and rounded up: where F * is a quantized filter and • is the rounding up operator.For example, the initial coefficients of the high-frequency wavelet filter db2 are: According to Formula (3), for the considered wavelet filter, f = 11.The coefficients of the wavelet filter are quantized using Formula (4): HD = −699 1212 −324 −187 .
The resulting values are scaled and rounded down after WM wavelet filtering to compensate for the rounding error.Thus, the limited accuracy of data representation in the device memory will not have a significant impact on the quality of wavelet image processing.
Below, we present the results of implementing both considered approaches to wavelet image processing on FPGAs and an analysis of the results obtained.

High-Speed Implementation of the Discrete Wavelet Transform Using the Direct Method and the Winograd Method
Hardware modeling of the discrete wavelet transform using the DM and the WM with decimation was carried out on an FPGA in the Xilinx Vivado 2018.2 environment in the Verilog on FPGA Family Virtex 7 on the board "xc7vx485tffg1157-1" with the synthesis parameters "Vivado Synthesis Defaults" and implementation parameters "Vivado Implementation Defaults", without using DSP blocks.Wallace tree structures were used as tools for implementing addition and multiplication, with further addition performed using carry-saving adders [18] and Kogge-Stone adders [19].These adders perform calculations in the least amount of time [20].The 8-bit pixels of the original image were used as input.Daubechies wavelets of four-tap and six-tap order were selected, the filter coefficients of which are quantized by 11 bits.The output data are 8-bit pixels of the processed image, since the WM receives several values of processed pixels in one iteration.The methods were evaluated based on the average resource costs for processing each pixel for a correct comparison.The simulation results are presented in Table 1.The graphs in Figures 3-6 show the hardware, time, and power consumption costs separately, as well as the area-delay product (ADP).         .Area-delay product with wavelet image processing using the direct method (one pixel) and the Winograd method (two-five pixels) with averaged values for each pixel.

Discussion
The following conclusions can be drawn according to the results presented in Table 1 and Figures 3-6: 1.The hardware costs for wavelet processing of WM images compared to the DM increase from 12% (|692.

Discussion
The following conclusions can be drawn according to the results presented in The best efficiency in terms of the ADP is observed when using the F(3,4,2) and F(5,6,2) WMs using four-tap and six-tap wavelets, respectively.5.
In general, the WM significantly speeds up calculations due to a moderate increase in hardware costs and power consumption.6.
The increase in the speed of wavelet image processing will be insignificant compared to the increases in hardware costs and power consumption and the further increase in the size of the processed image fragments using the WM.This will also significantly increase the calculation error and degrade the quality of image processing.
Table 2 presents a comparison of wavelet transform methods.In [21], a standard wavelet transform method was presented, using which hardware simulations were carried out and compared with the proposed method.The proposed method showed the best results in terms of the delay and overall device performance.Paper describes a wavelet transform method called the lifting scheme.The peculiarity of this method is the sequence of the action steps: split, predict, and update.The lifting scheme allows a reduction in the hardware costs of the device, but at the same time increases the computation time.The method is applicable only to a limited set of wavelets.In [23], a fast continuous wavelet transform was implemented, and it requires large time and hardware costs to implement.The method developed in this paper is free of these disadvantages and is a promising digital signal processing tool for practical use and can be used in fusion problems of noise removal, compression, and combination of signals.

Conclusions
In this paper, calculations were carried out and Formulas (2) and (3) were derived, which implement wavelet filtering using the WM with decimation.The scope of application of the WM has been expanded to the case of downsampling a signal with an arbitrary stride during processing.The main result from hardware simulations of the developed approach on an FPGA is a reduction in the computational delay by up to 66%, with increases in hardware costs and power consumption of 95% and 344%, respectively, compared to the DM depending on the selected method parameters.The best efficiency in terms of the ADP was found for the F(3, 4, 2) and F(5, 6, 2) WMs using four-tap and six-tap wavelets, respectively.The developed approach can be used in image processing systems to improve the performance of modern microelectronic devices for image denoising and compression, as well as pattern recognition.A promising direction for further research is the implemen-

Figure 1 .
Figure 1.Wavelet filtering scheme with decimation of a fragment of the original image using the direct method.

Figure 1 .
Figure 1.Wavelet filtering scheme with decimation of a fragment of the original image using the direct method.

Figure 2 .
Figure 2. Wavelet filtering scheme with decimation of a fragment of the original image using the Winograd method.

Figure 2 .
Figure 2. Wavelet filtering scheme with decimation of a fragment of the original image using the Winograd method.

Figure 3 .Figure 3 .
Figure 3.Hardware costs for wavelet image processing using the direct method (one pixel) and the Winograd method (two-five pixels) with averaged values for each pixel.

Figure 3 .
Hardware costs for wavelet image processing using the direct method (one pixel) and the Winograd method (two-five pixels) with averaged values for each pixel.

Figure 4 .
Figure 4. Delay results for wavelet image processing using the direct method (one pixel) and the Winograd method (two-five pixels) with averaged values for each pixel.

Figure 5 .Figure 4 .
Figure 5. Results of power consumption during wavelet image processing using the direct method (one pixel) and the Winograd method (two-five pixels) with averaged values for each pixel.

Figure 3 .
Figure 3.Hardware costs for wavelet image processing using the direct method (one pixel) and the Winograd method (two-five pixels) with averaged values for each pixel.

Figure 4 .
Figure 4. Delay results for wavelet image processing using the direct method (one pixel) and the Winograd method (two-five pixels) with averaged values for each pixel.

Figure 5 .Figure 5 .
Figure 5. Results of power consumption during wavelet image processing using the direct method (one pixel) and the Winograd method (two-five pixels) with averaged values for each pixel.

Figure 6
Figure6.Area-delay product with wavelet image processing using the direct method (one pixel) and the Winograd method (two-five pixels) with averaged values for each pixel.

Table 1 .
Results of modeling a wavelet image processing device using the direct method and the Winograd method with averaged values for each pixel.

Table 2 .
Comparison of wavelet transform methods.