Convolution Accelerator Designs Using Fast Algorithms

: Convolutional neural networks (CNNs) have achieved great success in image processing. However, the heavy computational burden it imposes makes it di ﬃ cult for use in embedded applications that have limited power consumption and performance. Although there are many fast convolution algorithms that can reduce the computational complexity, they increase the di ﬃ culty of practical implementation. To overcome these di ﬃ culties, this paper proposes several convolution accelerator designs using fast algorithms. The designs are based on the ﬁeld programmable gate array (FPGA) and display a better balance between the digital signal processor (DSP) and the logic resource, while also requiring lower power consumption. The implementation results show that the power consumption of the accelerator design based on the Strassen–Winograd algorithm is 21.3% less than that of conventional accelerators.


Introduction
In recent years, convolutional neural networks (CNNs) have been widely used in computer analysis of visual imagery, automatic driving and other fields because of their high accuracy in image processing [1][2][3][4].Many embedded applications have limited performance and the power consumption [5,6] and CNNs are computationally intensive and consume large amounts of processing time, making it difficult to apply them to these types of applications.
At present, there are two solutions to accelerate the use of CNNs.One is to reduce the computational complexity of the neural network; many such methods have been proposed that maintain the accuracy, including quantification, tailoring, sparsity and fast convolution [7][8][9][10][11][12][13]. The other solution is to use a high-performance, low-power hardware accelerator.A graphics processing unit (GPU) is an outstanding processor, but its power consumption is very large [14].Instead, a field-programmable gate array (FPGA) and the application-specific integrated circuit (ASIC) are energy-efficient integrated circuits.The FPGA has been extensively explored because of its flexibility [15][16][17][18][19].
In general, the convolution operation occupies most of the computation time of the CNN.There are several fast algorithms that can reduce the computational complexity of convolutions without losing accuracy [19,20].Reference [21] evaluated the Winograd algorithm on FPGAs and proved its efficiency.We propose a faster algorithm based on the Strassen and Winograd algorithms reported in our previously published paper [22].We propose several convolution accelerator designs using fast algorithms.The implementation results using an FPGA show that the design based on the Strassen-Winograd algorithm can reduce the power consumption by 21.3% compared to the traditional design.
The rest of this paper is organized as follows: Section 2 provides a detailed description of the CNN and fast algorithms, including the Strassen, Winograd and Strassen-Winograd algorithms; Section 3 describes the architecture of the accelerator designs based on fast algorithms; Section 4 describes the implementation results; and Section 5 presents the conclusions of our study.

Convolutional Neural Network (CNN)
A CNN generally consists of several layers, each composed of input feature maps, kernels and output feature maps.The convolutional layers carry most of the computation requirements in the network.Figure 1 shows the main process of the convolution where the convolutional layer extracts different features from the input feature maps via different kernels.
Algorithms 2019, 12, x FOR PEER REVIEW 2 of 15 Strassen-Winograd algorithm can reduce the power consumption by 21.3% compared to the traditional design.The rest of this paper is organized as follows: Section 2 provides a detailed description of the CNN and fast algorithms, including the Strassen, Winograd and Strassen-Winograd algorithms; Section 3 describes the architecture of the accelerator designs based on fast algorithms; Section 4 describes the implementation results; and Section 5 presents the conclusions of our study.

Convolutional Neural Network (CNN)
A CNN generally consists of several layers, each composed of input feature maps, kernels and output feature maps.The convolutional layers carry most of the computation requirements in the network.Figure 1 shows the main process of the convolution where the convolutional layer extracts different features from the input feature maps via different kernels.The entire convolution can be expressed as Equation (1), where Y denotes the output feature maps, W denotes the kernels of size Mw × Nw and X denotes the input feature maps of size Mx × Nx.The number of input feature maps is Q, and the number of output feature maps is R.The subscripts x and y indicate the position of the pixel in the feature map, while the subscripts u and v indicate the position of the parameter in the kernel.
, , , , , Equation ( 1) can be rewritten simply as Equation (2): We can use the special matrices Y, W and X to indicate all of the output feature maps, kernels and input feature maps, respectively.As Equation (4) shows, each element in the matrix denotes a feature map or a kernel.The computation of Y can be expressed as the special matrix convolution in Equation (3).In this case, we can use the fast algorithms to accelerate this special matrix convolution.The entire convolution can be expressed as Equation (1), where Y denotes the output feature maps, W denotes the kernels of size Mw × Nw and X denotes the input feature maps of size Mx × Nx.The number of input feature maps is Q, and the number of output feature maps is R.The subscripts x and y indicate the position of the pixel in the feature map, while the subscripts u and v indicate the position of the parameter in the kernel.
Equation ( 1) can be rewritten simply as Equation (2): We can use the special matrices Y, W and X to indicate all of the output feature maps, kernels and input feature maps, respectively.As Equation (4) shows, each element in the matrix denotes a feature map or a kernel.The computation of Y can be expressed as the special matrix convolution in Equation (3).In this case, we can use the fast algorithms to accelerate this special matrix convolution.

Strassen Algorithm
The Strassen algorithm is a fast way to perform matrix multiplication (a detailed description of the Strassen algorithm is given in Appendix A).For the product of 2 × 2 matrices, the Strassen algorithm requires 7 multiplications and 18 additions, in contrast to the conventional algorithm, which requires 8 multiplications and 4 additions.Winograd's variant of the Strassen algorithm only needs 15 additions [23], and it achieves relatively good results on GPUs [24].
Moreover, the Strassen algorithm is applicable for the special matrix convolution in Equation (3) [19] and is able to reduce the number of convolutions from eight to seven.The implementation of a 2 × 2 matrix is shown in Algorithm 1.The function Conv() completes the convolution between one feature map and a kernel.

Winograd Algorithm
The Winograd minimal filtering algorithm is a fast approach for convolutions.For a convolution between a 4 × 4 map and a 3 × 3 kernel, this algorithm reduces the number of multiplications from 36 to 16 (a detailed description of the Strassen algorithm is given in Appendix B) To apply this algorithm to the convolution in Equation (3), we need to divide each input feature map into smaller 4 × 4 feature maps.

Strassen-Winograd Algorithm
Our previous work [22] showed that the Strassen and Winograd algorithms can be used together in matrix convolutions.Algorithm 1 provides the implementation of the Strassen algorithm where the function Conv() is the main computation in the entire process.We can apply the Winograd algorithm to the function Conv() and denote this as the Strassen-Winograd algorithm.A brief description is given below in Algorithm 2. The function Winograd() is the convolution using the Winograd algorithm.

Architecture Design
In this paper, we propose four convolution accelerator designs based on the conventional, Strassen, Winograd and Strassen-Winograd algorithms.All our designs are for the same matrix convolution with a 3 × 3 kernel, a 224 × 224 input feature map and a 2 × 2 matrix.A detailed description of the architecture is given below.

Conventional Design
There are eight convolutions between the input feature maps and the kernels for the convolution of a 2 × 2 matrix.Figure 2 shows the architecture of our design with eight convolution modules (ConvNxN) instantiated in parallel.We divide the conventional algorithm into two stages.In Stage 1, input feature maps and kernels are sent to the module ConvNxN, which is designed for the convolution between the feature map and the kernel.In Stage 2, the output results from the convolutions are summarized to obtain the output feature maps.

Architecture Design
In this paper, we propose four convolution accelerator designs based on the conventional, Strassen, Winograd and Strassen-Winograd algorithms.All our designs are for the same matrix convolution with a 3 × 3 kernel, a 224 × 224 input feature map and a 2 × 2 matrix.A detailed description of the architecture is given below.

Conventional Design
There are eight convolutions between the input feature maps and the kernels for the convolution of a 2 × 2 matrix.Figure 2 shows the architecture of our design with eight convolution modules (ConvNxN) instantiated in parallel.We divide the conventional algorithm into two stages.In Stage 1, input feature maps and kernels are sent to the module ConvNxN, which is designed for the convolution between the feature map and the kernel.In Stage 2, the output results from the convolutions are summarized to obtain the output feature maps.Figure 3 shows the architecture of the module ConvNxN.Usually, a convolution requires high memory bandwidth.To reduce this bandwidth demand, we designed a data buffer to maximize data reuse and stored all kernels in memory on chip.Four pixels of the feature map are sent into the buffer module, which is an FIFO consisting of a shift registers group, every clock cycle.The width of the Figure 3 shows the architecture of the module ConvNxN.Usually, a convolution requires high memory bandwidth.To reduce this bandwidth demand, we designed a data buffer to maximize data reuse and stored all kernels in memory on chip.Four pixels of the feature map are sent into the buffer module, which is an FIFO consisting of a shift registers group, every clock cycle.The width of the FIFO is 4 pixels, and the depth of the FIFO is 112 + 2 pixels.The data buffering process can be divided into three stages as follows: Algorithms 2019, 12, x FOR PEER REVIEW 5 of 15 Stage b: as Figure 4b shows, four pixels are written into the FIFO, and four pixels are read out simultaneously (the FIFO is full in this stage).The eight earliest written pixels and the eight latest written pixels constitute a group of 4 × 4 pixels, shown as the dotted box in Figure 4b.This group of pixels is sent to the Conv module for the convolution operation.
Stage c: as Figure 4c shows, the eight earliest written pixels and the eight latest written pixels cannot constitute a group of 4 × 4 pixels.No valid pixels are sent to the Conv module during this clock cycle.

Strassen Design
The Strassen algorithm can reduce the number of convolutions from eight to seven.The architecture of the matrix convolution based on the Strassen algorithm is shown schematically in Figure 6, which shows the Strassen algorithm can be divided into three stages.In Stage 1, the input data are transformed using the parameter matrix in Equation (A16) before being sent to the ConvNxN module.In Stage 2, only seven ConvNxN modules are instantiated in parallel (the ConvNxN module is the same as in Figure 2).In Stage 3, to obtain the output feature maps, the output results from the ConvNxN module should be transformed again as Equation (A16) in Appendix A.

Strassen Design
The Strassen algorithm can reduce the number of convolutions from eight to seven.The architecture of the matrix convolution based on the Strassen algorithm is shown schematically in Figure 6, which shows the Strassen algorithm can be divided into three stages.In Stage 1, the input data are transformed using the parameter matrix in Equation (A16) before being sent to the ConvNxN module.In Stage 2, only seven ConvNxN modules are instantiated in parallel (the ConvNxN module is the same as in Figure 2).In Stage 3, to obtain the output feature maps, the output results from the ConvNxN module should be transformed again as Equation (A16) in Appendix A.

Winograd Design
The architecture of the matrix convolution based on the Winograd algorithm is shown schematically in Figure 7.In the figure, the design based on the Winograd algorithm is divided into two stages.In Stage 1, eight convolution modules are needed to complete the convolution between the input feature maps and the kernels.In Stage 2, the output results from the convolutions are summarized to obtain the output feature maps.

Winograd Design
The architecture of the matrix convolution based on the Winograd algorithm is shown schematically in Figure 7.In the figure, the design based on the Winograd algorithm is divided into two stages.In Stage 1, eight convolution modules are needed to complete the convolution between the input feature maps and the kernels.In Stage 2, the output results from the convolutions are summarized to obtain the output feature maps.The FilterNxN module utilizes the Winograd algorithm to complete the convolution.Figure 8 shows the architecture of the FilterNxN module, which adopts the same buffering strategy and module as the ConvNxN module.All the kernels are stored in registers.The FilterNxN module utilizes the Winograd algorithm to complete the convolution.Figure 8 shows the architecture of the FilterNxN module, which adopts the same buffering strategy and module as the ConvNxN module.All the kernels are stored in registers.Figure 9 shows the dataflow of the FilterNxN module.We can see from the figure that the 4 × 4 data and 3 × 3 kernel are sent each clock cycle, which are first transformed according to Equation (A26).Element-wise multiplication is performed immediately following the transformation.Finally, the product is transformed using the matrix K and the matrix K T .We can see from Equation (A26) that 16 multipliers are needed in this module.That is, the design in Figure 7 requires a total of 8 × 16 = 128 multipliers.

Strassen-Winograd Design
As shown in Section 2.4, the Strassen and Winograd algorithms can be used together in the matrix convolution.The architecture of the convolution based on the Strassen-Winograd algorithm is shown in Figure 10. Figure 9 shows the dataflow of the FilterNxN module.We can see from the figure that the 4 × 4 data and 3 × 3 kernel are sent each clock cycle, which are first transformed according to Equation (A26).Element-wise multiplication is performed immediately following the transformation.Finally, the product is transformed using the matrix K and the matrix K T .We can see from Equation (A26) that 16 multipliers are needed in this module.That is, the design in Figure 7  The FilterNxN module utilizes the Winograd algorithm to complete the convolution.Figure 8 shows the architecture of the FilterNxN module, which adopts the same buffering strategy and module as the ConvNxN module.All the kernels are stored in registers.Figure 9 shows the dataflow of the FilterNxN module.We can see from the figure that the 4 × 4 data and 3 × 3 kernel are sent each clock cycle, which are first transformed according to Equation (A26).Element-wise multiplication is performed immediately following the transformation.Finally, the product is transformed using the matrix K and the matrix K T .We can see from Equation (A26) that 16 multipliers are needed in this module.That is, the design in Figure 7 requires a total of 8 × 16 = 128 multipliers.

Strassen-Winograd Design
As shown in Section 2.4, the Strassen and Winograd algorithms can be used together in the matrix convolution.The architecture of the convolution based on the Strassen-Winograd algorithm is shown in Figure 10.

Implementation Results
All the designs were evaluated using the Xilinx kintex-7 325t FPGA.The designs were simulated and implemented in Vivado 2014.3 using Verilog, all the data used 16-bit fixed-point precision, and the FPGA implementation was performed at 100 Mhz.
We simulated our designs using the Vivado Simulator with the same input feature maps and kernels and recorded the time consumptions to compare their processing performances, as shown in Table 1.The time consumption counts from the first to the last output pixels.We can see from Table 1 that all the designs require almost the same time to complete the same matrix convolution.That is, these designs have the same processing performance.

Implementation Results
All the designs were evaluated using the Xilinx kintex-7 325t FPGA.The designs were simulated and implemented in Vivado 2014.3 using Verilog, all the data used 16-bit fixed-point precision, and the FPGA implementation was performed at 100 Mhz.
We simulated our designs using the Vivado Simulator with the same input feature maps and kernels and recorded the time consumptions to compare their processing performances, as shown in Table 1.The time consumption counts from the first to the last output pixels.We can see from Table 1 that all the designs require almost the same time to complete the same matrix convolution.That is, these designs have the same processing performance.Table 2 provides a detailed description of the resource utilizations.The table shows that the design based on the Strassen algorithm uses fewer resources than the conventional design.The design based on the Winograd algorithm requires less than half of the Digital Signal Processors (DSPs) required by the conventional design, but more than the other resources, like registers and look up tables(LUTs).Based on the utilization rate of the resources, the Winograd algorithm is observed to improve the overall resource utilization.Though more DSPs have been added in recent FPGAs, the DSP is still a limiting resource in most cases, except for the Winograd algorithm.Compared to the Winograd algorithm, the design based on the Strassen-Winograd algorithm requires even fewer resources.Xilinx provides the Vivado Power Analysis for power estimations.It provides accurate estimations because it can read the exact logic and routing resources from the implemented design and presents the power report from different views.The power consumption of the FPGA consists of the static and dynamic power, the latter accounting for most of the total power consumption.Dynamic power consists of the clock power, signal power, logic power and the DSP power.The detailed power consumption is recorded in Table 3 and Figure 11.Xilinx provides the Vivado Power Analysis for power estimations.It provides accurate estimations because it can read the exact logic and routing resources from the implemented design and presents the power report from different views.The power consumption of the FPGA consists of the static and dynamic power, the latter accounting for most of the total power consumption.Dynamic power consists of the clock power, signal power, logic power and the DSP power.The detailed power consumption is recorded in Table 3 and Figure 11.11 give detailed descriptions of the power distribution.We can see from the table that the DSP power occupies a large portion of the dynamic power consumption of the FPGA.The total dynamic power of the design based on the Strassen algorithm is 4% less than that of the conventional design.However, the signal and logic powers of the design based on the Strassen algorithm are higher than those of the conventional design.This is because the Strassen algorithm increases the signal rate.The total dynamic power of the design based on the Winograd algorithm is 20.5% less than that of the conventional design.The total dynamic power of the design based on the Strassen-Winograd algorithm is 21.3% less than that of the conventional design.Similarly, since the Strassen algorithm increases the signal rate, the signal and logic powers of the design based on the Strassen-Winograd algorithm are higher than those of the Winograd algorithm-based design.11 give detailed descriptions of the power distribution.We can see from the table that the DSP power occupies a large portion of the dynamic power consumption of the FPGA.The total dynamic power of the design based on the Strassen algorithm is 4% less than that of the conventional design.However, the signal and logic powers of the design based on the Strassen algorithm are higher than those of the conventional design.This is because the Strassen algorithm increases the signal rate.The total dynamic power of the design based on the Winograd algorithm is 20.5% less than that of the conventional design.The total dynamic power of the design based on the Strassen-Winograd algorithm is 21.3% less than that of the conventional design.Similarly, since the Strassen algorithm increases the signal rate, the signal and logic powers of the design based on the Strassen-Winograd algorithm are higher than those of the Winograd algorithm-based design.
We can see from Table 2 that the DSP is still a limiting resource for most designs.The designs in the paper are used for Matrix convolution with a 2 × 2 matrix.The designs require multiple parallel instances to achieve a high performance.We can calculate the maximum number of instantiations on this FPGA from Table 2, which is 2 for the conventional design, 3 for the Strassen design, 6 for the Winograd design and 7 for the Strassen-Winograd design.Thus, performance of the best Strassen-Winograd design on this FPGA is 3.5 times that of the conventional design, as each one achieved nearly the same processing performance.

Conclusions
CNNs have achieved a high accuracy in many image-processing fields, but one major problem is the heavy computational burden, especially for matrix convolutions.Several fast algorithms can reduce the computational complexity, but they incur difficulties in hardware implementation.Therefore, several matrix convolution accelerator designs based on the Winograd and Strassen-Winograd algorithms are proposed.The implementation results confirm that the proposed designs improve overall resource utilization and reduce power consumption.Moreover, they increase performance of the best designs that fit on the same FPGA.
where m 1 , m 2 , m 3 , m 4 , m 5 , m 6 , and m 7 are the seven temporary variables.The algorithm is effective as long as there are an even number of rows and columns in the matrix [19].
If we expand matrices A, B, and C into vector form, the entire process of this algorithm can be expressed as Equation (A14).
We denote the first, second and third parameter matrices in Equation (A14) as matrices E, G, and D, respectively.Equation (A14) can be rewritten as Equation (A16), where the • indicates element-wise multiplication.We can regard matrices G, D, and E as transform parameter matrices.The Strassen algorithm can be expressed as follows.First, matrices G and D are used to transform vectors A and B.Then, element-wise multiplication is performed.Finally, matrix E is used to transform the product.

Appendix B
This section explains the mathematical theory behind Winograd's minimal filtering algorithm.First, we introduce the Winograd algorithm with a one-dimensional vector convolution.We denote a three-tap Finite Impulse Response (FIR) filter with two outputs as F (2,3).The input data are x1, x2, x3, and x4 and the parameters of the filter are w1, w2, w3.The conventional algorithm for F(2,3) is given by Equation (A17).
The process of the minimal filtering algorithm is given by Equations (A18)-(A22) [26]: We see from the process that only four multiplications are needed.The entire process can be written in matrix form as Equation (A23).
We denote the first, second and third parameter matrices in Equation (A23) as K, L, and O, respectively.Equation (A23) can be rewritten as Equation (A24).
The convolution of the two-dimensional image can be generalized with a filter as Equation (A26) [20].
w 1,1 w 1,2 w 1,3 w 2,1 w 2,2 w 2,3 w 3,1 w 3,2 w 3,3 x 1,1 x 1,2 x 1,3 x 1,4 x 2,1 x 2,2 x 2,3 x 2,4 x 3,1 x 3,2 x 3,3 x 3,4 x 4,1 x 4,2 x 4,3 x 4,4 where the superscript T indicates the transpose operator.We see from Equation (A26) that calculating the convolution between data sized at 4 × 4 and a 3 × 3 kernel requires using matrix L and matrix L T to transform the kernel and the use of matrix O and matrix O T to transform the data before element-wise multiplication.Finally, we use matrix K and matrix K T to transform the product.Compared with the conventional algorithm, this algorithm can reduce the number of multiplications from 36 to 16.

Figure 1 .
Figure 1.Description of a convolution in CNNs.

Figure 1 .
Figure 1.Description of a convolution in CNNs.

Algorithm 1 .
Implementation of the Strassen Algorithm

Figure 2 .
Figure 2. Schematic representation of the architecture for the conventional design.

Figure 2 .
Figure 2. Schematic representation of the architecture for the conventional design.

Figure 5 .
Figure 5. Data flow of the conventional convolution algorithm.

Figure 4 .Figure 4 .
Figure 4. Schematic representation of the input sequence for the ConvNxN module.Stage b: as Figure4bshows, four pixels are written into the FIFO, and four pixels are read out simultaneously (the FIFO is full in this stage).The eight earliest written pixels and the eight latest written pixels constitute a group of 4 × 4 pixels, shown as the dotted box in Figure4b.This group of pixels is sent to the Conv module for the convolution operation.Stage c: as Figure4cshows, the eight earliest written pixels and the eight latest written pixels cannot constitute a group of 4 × 4 pixels.No valid pixels are sent to the Conv module during this clock cycle.The Conv module is designed to compute the convolution with Figure5showing its data flow, which completes the multiplications and additions in the convolution.The Conv module is designed in the pipeline mode.Four convolutions between the 4 × 4 input feature map and the 3 × 3 kernel are executed in parallel in the Conv module.Thus, the process requires 4 × 3 × 3 = 36 multipliers.There are eight ConvNxN modules in Figure2, which need a total of 8 × 36 = 288 multipliers.

Figure 5 .
Figure 5. Data flow of the conventional convolution algorithm.

Figure 5 .
Figure 5. Data flow of the conventional convolution algorithm.

3. 2 .
Strassen DesignThe Strassen algorithm can reduce the number of convolutions from eight to seven.The architecture of the matrix convolution based on the Strassen algorithm is shown schematically in Figure6, which shows the Strassen algorithm can be divided into three stages.In Stage 1, the input data are transformed using the parameter matrix in Equation (A16) before being sent to the ConvNxN module.In Stage 2, only seven ConvNxN modules are instantiated in parallel (the ConvNxN module is the same as in

Figure 2 ) 15 Figure 6 .
Figure 2).In Stage 3, to obtain the output feature maps, the output results from the ConvNxN module should be transformed again as Equation (A16) in Appendix A. Algorithms 2019, 12, x FOR PEER REVIEW 7 of 15

Figure 7 .
Figure 7. Architecture of the matrix convolution based on the Winograd algorithm.

Figure 6 .
Figure 6.Architecture of the matrix convolution based on the Strassen algorithm.

Figure 7 .
Figure 7. Architecture of the matrix convolution based on the Winograd algorithm.

Figure 7 .
Figure 7. Architecture of the matrix convolution based on the Winograd algorithm.

Figure 8 .
Figure 8. Schematic representation for the architecture of the FilterNxN module.

Figure 9 .
Figure 9. Dataflow of the convolution based on the Winograd algorithm.

Figure 8 .
Figure 8. Schematic representation for the architecture of the FilterNxN module.
Figure9shows the dataflow of the FilterNxN module.We can see from the figure that the 4 × 4 data and 3 × 3 kernel are sent each clock cycle, which are first transformed according to Equation (A26).Element-wise multiplication is performed immediately following the transformation.Finally, the product is transformed using the matrix K and the matrix K T .We can see from Equation (A26) that 16 multipliers are needed in this module.That is, the design in Figure7requires a total of 8 × 16 = 128 multipliers.

Figure 8 .
Figure 8. Schematic representation for the architecture of the FilterNxN module.

Figure 9 .
Figure 9. Dataflow of the convolution based on the Winograd algorithm.

Figure 9 .
Figure 9. Dataflow of the convolution based on the Winograd algorithm.

3. 4 .
Strassen-Winograd Design As shown in Section 2.4, the Strassen and Winograd algorithms can be used together in the matrix convolution.The architecture of the convolution based on the Strassen-Winograd algorithm is shown in Figure 10.

Figure 10 .
Figure 10.Architecture of the matrix convolution based on the Strassen-Winograd algorithm.
(DSPs)  required by the conventional design, but more than the other resources, like registers and look up tables(LUTs).Based on the utilization rate of the resources, the Winograd algorithm is observed to improve the overall resource utilization.Though more DSPs have been added in recent FPGAs, the DSP is still a limiting resource in most cases, except for the Winograd algorithm.Compared to

Figure 10 .
Figure 10.Architecture of the matrix convolution based on the Strassen-Winograd algorithm.As shown in Figure 10, the design based on the Strassen-Winograd algorithm is divided into three states.In Stage 1, the input data are transformed before being sent to the FilterNxN module.In Stage 2, seven convolution FilterNxN modules are instantiated in parallel.The FilterNxN module is the same as those in Figure 7.The output data should be then transformed in Stage 3 before being sent out.The FilterNxN module needs 16 multipliers; thus the design based on the Strassen-Winograd algorithm requires a total of 7 × 16 = 112 multipliers.

Figure 11 .
Figure 11.Power consumption of designs based on different algorithms.Figure 11.Power consumption of designs based on different algorithms.

Figure 11 .
Figure 11.Power consumption of designs based on different algorithms.Figure 11.Power consumption of designs based on different algorithms.

Table 1 .
Time consumption records for the different designs.

Table 2
provides a detailed description of the resource utilizations.The table shows that the design based on the Strassen algorithm uses fewer resources than the conventional design.The design based on the Winograd algorithm requires less than half of the Digital Signal Processors

Table 1 .
Time consumption records for the different designs.

Table 3 .
Power consumption for different designs.

Table 3 .
Power consumption for different designs.

Table 3 and
Figure

Table 3 and
Figure