# OpenCNN: A Winograd Minimal Filtering Algorithm Implementation in CUDA

^{*}

## Abstract

**:**

## 1. Introduction

- Some libraries such as cuDNN contain important sections of machine assembly code;
- Nvidia engineers hand-tune their libraries at Shader ASSembly (SASS) level to optimize the performance on their devices at a very low level;
- These implementations are not usually open-source, so it is not possible to tune the performance of these implementations for special use-cases.

- An only-CUDA implementation of the single-precision Winograd algorithm for $3\times 3$ kernels is introduced. The code was released as open-source software under the name openCNN and it can be found at https://github.com/UDC-GAC/openCNN, accessed on 28 July 2021.
- Its performance has been evaluated on NVIDIA Turing RTX 2080Ti and NVIDIA Ampere RTX 3090 and compared to the state-of-the-art Winograd implementation in cuDNN 8.2.0, achieving a speedup up to $1.76\times $ on RTX2080Ti and $1.85\times $ on RTX 3090 in ResNet [13].
- Moreover, the performance of our openCNN implementation has been compared with different convolution algorithms available in cuDNN. The average speedups achieved are $1.81\times $ (w.r.t. FFT), $3.77\times $ (FFT Tiling), $3.35\times $ (GEMM), $2.50\times $ (Implicit GEMM), $1.57\times $ (Precomputed implicit GEMM) and $0.93\times $ (Winograd Non-Fused version) on Turing NVIDIA RTX 2080Ti. Equivalently, on Ampere NVIDIA RTX 3090 the speedups obtained are $1.80\times $ (w.r.t. FFT), $3.26\times $ (FFT Tiling), $2.82\times $ (GEMM), $2.42\times $ (Implicit GEMM), $1.18\times $ (Precomputed implicit GEMM) and $0.97\times $ (Winograd Non-Fused version).

- Our implementation is fully written in CUDA and can be easily tuned and ported to different NVIDIA hardware platforms. However, the implementation described in [10] contains important fragments of architecture-specific assembly code which is restricted to a particular GPU architecture and ISA encoding.
- The implementation of the reference paper uses assembly code to improve the performance of key hotspots of the method, some of them containing costly memory movements. In our paper, these hotspots are encoded in CUDA, using microbenchmarking to find the most efficient way to implement them. The paper also discusses different alternative methods to implement those important sections of the code. As a consequence of all this, our implementation achieves speedups up to $2.15\times $ on NVIDIA Turing RTX 2080Ti and up to $2.07\times $ on NVIDIA Ampere RTX 3090 over the CUDA implementation of [10]. A direct comparison with [10] is not possible as its full code is not publicly available.
- Unlike the study in [10], the full code of our paper is released as open-source software.

## 2. The Winograd Convolution Algorithm

- h: input image height;
- w: input image width;
- r: filter height;
- s: filter width;
- n: number of input images;
- c: number of channels;
- k: number of filters.

## 3. Implementation of the Winograd Algorithm

- Filter transformation: the original filter g is operated with matrix G using the following expression ${U}_{c,k}=G{g}_{c,k}{G}^{T}$;
- Input transformation: the input tile ${d}_{c,h,w,n}$ is operated with matrix B using the following expression ${V}_{c,\tilde{h},\tilde{w},n}={B}^{T}{d}_{c,h,w,n}B$;
- Multiplication: the outputs of the two previous steps are multiplied ${M}_{k,\tilde{h},\tilde{w},n}={\sum}_{c=1}^{C}{U}_{c,\tilde{h},\tilde{w},n}\odot {V}_{c,k}$ using a Hadamard product;
- Output transformation: the result of the previous step is operated using the following expression ${Y}_{k,\tilde{h},\tilde{w},n}={A}^{T}{M}_{k,\tilde{h},\tilde{w},n}A$.

Algorithm 1: Pseudocode of the Winograd’s algorithm implementation for $\alpha =4$. SMEM stands for Shared MEMory while GMEM stands for Global MEMory. |

## 4. Optimization of the Method

#### 4.1. Alternative Output Buffer Layouts

#### 4.1.1. (1) Output Data Transposition

#### 4.1.2. (2) Output Data Transformation

- The red/yellow vectors will not be consecutive to the green/blue ones. Remember that in Figure 6, two buffers had been appended side-by-side that were not contiguous in memory according to the initial layout, in Figure 4. That enabled the full warp utilization in the transposition step, but the initial organization must be taken into account at this point to ensure result correctness.
- Float4 data types were divided into two float2 elements. For this reason, the final order in GMEM within each group of elements (yellow/red or blue/green) is as Figure 10 depicts. After the first column the flow has been simplified for clarity, but it would be as it is depicted here for the first four elements.
- Each thread computes two output tiles, so the wider instruction that can be used to store the results to GMEM is STG.64.

#### 4.1.3. (3) Output Data Storage

## 5. Experimental Results

## 6. Related Work

## 7. Conclusions

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## References

- Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst.
**2015**, 28, 91–99. [Google Scholar] [CrossRef] [Green Version] - Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- Hestness, J.; Narang, S.; Ardalani, N.; Diamos, G.F.; Jun, H.; Kianinejad, H.; Patwary, M.M.A.; Yang, Y.; Zhou, Y. Deep Learning Scaling is Predictable, Empirically. arXiv
**2017**, arXiv:1712.00409. [Google Scholar] - Chetlur, S.; Woolley, C.; Vandermersch, P.; Cohen, J.; Tran, J.; Catanzaro, B.; Shelhamer, E. cuDNN: Efficient Primitives for Deep Learning. arXiv
**2014**, arXiv:1410.0759. [Google Scholar] - ARM. Compute Library. Available online: https://github.com/ARM-software/ComputeLibrary (accessed on 28 June 2021).
- Intel. oneDNN. Available online: https://github.com/oneapi-src/oneDNN (accessed on 7 July 2021).
- Khan, J.; Fultz, P.; Tamazov, A.; Lowell, D.; Liu, C.; Melesse, M.; Nandhimandalam, M.; Nasyrov, K.; Perminov, I.; Shah, T.; et al. MIOpen: An open source library for deep learning primitives. arXiv
**2019**, arXiv:1910.00078. [Google Scholar] - Yan, D.; Wang, W.; Chu, X. Optimizing Batched Winograd Convolution on GPUs. In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, San Diego, CA, USA, 22–26 February 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 32–44. [Google Scholar]
- Zhang, X.; Tan, G.; Xue, S.; Li, J.; Zhou, K.; Chen, M. Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Austin, TX, USA, 4–8 February 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 31–43. [Google Scholar]
- Nervana Systems. Neon. Available online: https://github.com/NervanaSystems/neon (accessed on 25 June 2021).
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- Winograd, S. Arithmetic Complexity of Computations; Siam: Philadelphia, PA, USA, 1980; Volume 33. [Google Scholar]
- Lavin, A.; Gray, S. Fast algorithms for convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4013–4021. [Google Scholar]
- Horn, R.A. The hadamard product. Proc. Symp. Appl. Math.
**1990**, 40, 87–169. [Google Scholar] - Barabasz, B.; Gregg, D. Winograd Convolution for DNNs: Beyond Linear Polynomials. In AI*IA 2019–Advances in Artificial Intelligence; Alviano, M., Greco, G., Scarcello, F., Eds.; Springer International Publishing: Cham, Switzerland, 2019; pp. 307–320. [Google Scholar]
- Jia, Z.; Zlateski, A.; Durand, F.; Li, K. Optimizing N-dimensional, winograd-based convolution for manycore CPUs. ACM SIGPLAN Not.
**2018**, 53, 109–123. [Google Scholar] [CrossRef] - NVIDIA. NVIDIA Turing GPU Architecture. 2018. Available online: https://images.nvidia.com/aem-dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf (accessed on 18 June 2021).
- NVIDIA. CUDA C++ Programming Guide. Available online: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory-7-x (accessed on 24 June 2021).
- Jia, Z.; Maggioni, M.; Smith, J.; Scarpazza, D.P. Dissecting the NVidia Turing T4 GPU via Microbenchmarking. arXiv
**2019**, arXiv:1903.07486. [Google Scholar] - Yan, D.; Wang, W.; Chu, X. Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply. In Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), New Orleans, LA, USA, 18–22 May 2020; pp. 634–643. [Google Scholar]
- MLPerf Org. Available online: https://mlcommons.org/en/ (accessed on 13 August 2021).
- NVIDIA. How to Implement Performance Metrics in CUDA C/C++. 2019. Available online: https://developer.nvidia.com/blog/how-implement-performance-metrics-cuda-cc/ (accessed on 21 June 2021).
- Huang, Y.; Shen, J.; Wang, Z.; Wen, M.; Zhang, C. A high-efficiency FPGA-based accelerator for convolutional neural networks using Winograd algorithm. J. Phys. Conf. Ser.
**2018**, 1026, 012019. [Google Scholar] [CrossRef] - Gray, S. Maxas. Available online: https://github.com/NervanaSystems/maxas (accessed on 29 June 2021).
- Wu, R.; Zhang, F.; Zheng, Z.; Du, X.; Shen, X. Exploring deep reuse in winograd CNN inference. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Seoul, Korea, 27 February–3 March 2021; pp. 483–484. [Google Scholar]
- Zhang, X.; Xiao, J.; Tan, G. I/O lower bounds for auto-tuning of convolutions in CNNs. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Seoul, Korea, 27 February–3 March 2021; pp. 247–261. [Google Scholar]
- Mathieu, M.; Henaff, M.; LeCun, Y. Fast training of convolutional networks through FFTS. In Proceedings of the 2nd International Conference on Learning Representations, Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
- NVIDIA. Cutlass. Available online: https://github.com/NVIDIA/cutlass (accessed on 14 July 2021).
- Georganas, E.; Avancha, S.; Banerjee, K.; Kalamkar, D.; Henry, G.; Pabst, H.; Heinecke, A. Anatomy of High-Performance Deep Learning Convolutions on SIMD Architectures. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, Dallas, TX, USA, 11–16 November 2018; IEEE Press: New York, NY, USA, 2018. [Google Scholar]

**Figure 5.**Lane ID arrangement for output transform buffer (

**right side**, being p the padding elements) and pre-transform output data transposed in the first round (

**left side**).

(a) CPI | |||
---|---|---|---|

Width | |||

Type | 32 | 64 | 128 |

LDS | 2.11 | 4.00 | 8.00 |

STS | 4.06 | 6.01 | 10.00 |

(b) Throughput (Bytes/Cycle) | |||

Width | |||

Type | 32 | 64 | 128 |

LDS | 60.55 | 64.00 | 64.00 |

STS | 31.50 | 42.61 | 51.21 |

Layer | Output (H × W) | Filter (C, R×S, K) |
---|---|---|

Conv2 | 56 × 56 | [64, 3 × 3, 64] |

Conv3 | 28 × 28 | [128, 3 × 3, 128] |

Conv4 | 14 × 14 | [256, 3 × 3, 256] |

Conv5 | 7 × 7 | [512, 3 × 3, 512] |

**Table 3.**SpeedUp. 1.34× on average over cuDNN and 1.58x over the [10] CUDA C++ implementation on Turing RTX 2080Ti.

Code | N | Layers | |||
---|---|---|---|---|---|

Conv2 | Conv3 | Conv4 | Conv5 | ||

cuDNN | 32 | 1.18l× | 1.25× | 1.20× | 1.69× |

64 | 1.19× | 1.26× | 1.21× | 1.76× | |

96 | 1.19× | 1.25× | 1.18× | 1.73× | |

128 | 1.20× | 1.23× | 1.19× | 1.74× | |

Yan et al. | 32 | 2.15× | 1.75× | 1.53× | 1.31× |

64 | 2.02× | 1.58× | 1.44× | 1.29× | |

96 | 2.00× | 1.50× | 1.38× | 1.26× | |

128 | 2.01× | 1.46× | 1.36× | 1.27× |

**Table 4.**SpeedUp. 1.34× on average over cuDNN and 1.49× over the [10] CUDA C++ implementation on Ampere RTX 3090.

Code | N | Layers | |||
---|---|---|---|---|---|

Conv2 | Conv3 | Conv4 | Conv5 | ||

cuDNN | 32 | 1.16× | 1.26× | 1.12× | 1.64× |

64 | 1.19× | 1.23× | 1.26× | 1.60× | |

96 | 1.23× | 1.28× | 1.18× | 1.85× | |

128 | 1.24× | 1.26× | 1.25× | 1.78× | |

Yan et al. | 32 | 2.07× | 1.70× | 1.37× | 1.15× |

64 | 1.98× | 1.53× | 1.35× | 1.27× | |

96 | 1.97× | 1.43× | 1.25× | 1.14× | |

128 | 1.94× | 1.39× | 1.26× | 1.11× |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Castro, R.L.; Andrade, D.; Fraguela, B.B.
OpenCNN: A Winograd Minimal Filtering Algorithm Implementation in CUDA. *Mathematics* **2021**, *9*, 2033.
https://doi.org/10.3390/math9172033

**AMA Style**

Castro RL, Andrade D, Fraguela BB.
OpenCNN: A Winograd Minimal Filtering Algorithm Implementation in CUDA. *Mathematics*. 2021; 9(17):2033.
https://doi.org/10.3390/math9172033

**Chicago/Turabian Style**

Castro, Roberto L., Diego Andrade, and Basilio B. Fraguela.
2021. "OpenCNN: A Winograd Minimal Filtering Algorithm Implementation in CUDA" *Mathematics* 9, no. 17: 2033.
https://doi.org/10.3390/math9172033