# Taxonomy of Vectorization Patterns of Programming for FIR Image Filters Using Kernel Subsampling and New One

^{*}

## Abstract

**:**

## 1. Introduction

## 2. 2D FIR Image Filtering and Its Acceleration

#### 2.1. Definition of 2D FIR Image Filtering

#### 2.2. General Acceleration of FIR Image Filtering

## 3. Design Patterns of Vectorized Programming for FIR Image Filtering

#### 3.1. Data Loading and Storing in Vectorized Programming

#### 3.2. Image Data Structure

#### 3.3. Vectorization of FIR Filtering

#### 3.4. Color Loop Unrolling

#### 3.5. Kernel Loop Unrolling

#### 3.6. Pixel Loop Unrolling

## 4. Proposed Design Pattern of Vectorization

## 5. Material and Methods

#### 5.1. Gaussian Range Filter

#### 5.2. Bilateral Filter

#### 5.3. Adaptive Gaussian Filter

#### 5.4. Randomly-Kernel-Subsampled Filter

## 6. Experimental Results

## 7. Conclusions

- The two types of the proposed pattern, which are kernel loop vectorization and pixel loop vectorization, are both effective for adaptive kernel shapes, that is, randomized filters and the AGF.
- There remains, however, a trade-off in weight and data loading for changing spatial LUTs in each filtering pixel. Kernel loop unrolling is more suitable for weight loading, and loop vectorization is more suitable for data loading. Kernel loop vectorization is effective for weight and data loading; thus, the kernel loop vectorization is suitable for AGF, RKS-AGF, and RKS-BF.
- For the large-radius condition, the two types of the proposed pattern have moderate effectivity for other filters in the above effective cases, that is, the GRF and BF.

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## Appendix A

**Figure A1.**Processing time for and accuracy of image subsampling and kernel subsampling with respect to kernel radius of FIR filtering. (

**a**) Processing time; (

**b**) PSNR; (

**c**) PSNR vs. processing time. Image size is 512 × 512.

## References

- Moore, G.E. Cramming more components onto integrated circuits, Reprinted from Electronics, volume 38, number 8, April 19, 1965, pp.114 ff. IEEE Solid-State Circuits Soc. Newsl.
**2006**, 11, 33–35. [Google Scholar] [CrossRef] - Rotem, E.; Ginosar, R.; Mendelson, A.; Weiser, U.C. Power and thermal constraints of modern system-on-a-chip computer. In Proceedings of the 19th International Workshop on Thermal Investigations of ICs and Systems (THERMINIC), Berlin, Germany, 25–27 September 2013; pp. 141–146. [Google Scholar]
- Flynn, M.J. Some computer organizations and their effectiveness. IEEE Trans. Comput.
**1972**, 100, 948–960. [Google Scholar] [CrossRef] - Hughes, C.J. Single-instruction multiple-data execution. Synth. Lect. Comput. Archit.
**2015**, 10, 1–121. [Google Scholar] [CrossRef] - Rivera, G.; Tseng, C.W. Data Transformations for Eliminating Conflict Misses. SIGPLAN Not.
**1998**, 33, 38–49. [Google Scholar] [CrossRef] - Henretty, T.; Stock, K.; Pouchet, L.N.; Franchetti, F.; Ramanujam, J.; Sadayappan, P. Data Layout Transformation for Stencil Computations on Short-vector SIMD Architectures. In Proceedings of the International Conference on Compiler Construction: Part of the Joint European Conferences on Theory and Practice of Software (CC’11/ETAPS’11), Saarbrücken, Germany, 26 March–3 April 2011; pp. 225–245. [Google Scholar]
- Saegusa, T.; Maruyama, T.; Yamaguchi, Y. How fast is an FPGA in image processing? In Proceedings of the International Conference on Field Programmable Logic and Applications, Heidelberg, Germany, 8–10 September 2008; pp. 77–82. [Google Scholar]
- Asano, S.; Maruyama, T.; Yamaguchi, Y. Performance comparison of FPGA, GPU and CPU in image processing. In Proceedings of the International Conference on Field Programmable Logic and Applications, Prague, Czech Republic, 31 August–2 September 2009; pp. 126–131. [Google Scholar]
- Kurafuji, T.; Haraguchi, M.; Nakajima, M.; Nishijima, T.; Tanizaki, T.; Yamasaki, H.; Sugimura, T.; Imai, Y.; Ishizaki, M.; Kumaki, T.; et al. A Scalable Massively Parallel Processor for Real-Time Image Processing. IEEE J. Solid-State Circuits
**2011**, 46, 2363–2373. [Google Scholar] [CrossRef] - Batcher, K.E. Sorting networks and their applications. In Proceedings of the Spring Joint Computer Conference (SJCC), Atlantic City, NJ, USA, 30 April–2 May 1968; pp. 307–314. [Google Scholar]
- Hoare, C.A.R. Quicksort. Comput. J.
**1962**, 5, 10–16. [Google Scholar] [CrossRef] [Green Version] - Viola, P.; Jones, M. Rapid object detection using a boosted cascade of simple features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Kauai, HI, USA, 8–14 December 2001; pp. 511–518. [Google Scholar]
- Gonzalez, R.C.; Woods, R.E. Digital Image Processing; Prentice Hall: Upper Saddle River, NJ, USA, 2008. [Google Scholar]
- Treitel, S.; Shanks, J.L. The Design of Multistage Separable Planar Filters. IEEE Trans. Geosci. Electron.
**1971**, 9, 10–27. [Google Scholar] [CrossRef] - Lou, L.; Nguyen, P.; Lawrence, J.; Barnes, C. Image Perforation: Automatically Accelerating Image Pipelines by Intelligently Skipping Samples. ACM Trans. Graph.
**2016**, 35, 153:1–153:14. [Google Scholar] [CrossRef] - Banterle, F.; Corsini, M.; Cignoni, P.; Scopigno, R. A Low-Memory, Straightforward and Fast Bilateral Filter Through Subsampling in Spatial Domain. In Computer Graphics Forum; Wiley: Hoboken, NJ, USA, 2012; Volume 31, pp. 19–32. [Google Scholar]
- Deriche, R. Recursively Implementating the Gaussian and its Derivatives. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Singapore, 7–11 September 1992; pp. 263–267. [Google Scholar]
- Young, I.T.; Van Vliet, L.J. Recursive implementation of the Gaussian filter. Signal Process.
**1995**, 44, 139–151. [Google Scholar] [CrossRef] [Green Version] - Van Vliet, L.J.; Young, I.T.; Verbeek, P.W. Recursive Gaussian derivative filters. In Proceedings of the IEEE International Conference on Pattern Recognition, Brisbane, Australia, 20 August 1998; Volume 1, pp. 509–514. [Google Scholar]
- Wells, W.M. Efficient synthesis of Gaussian filters by cascaded uniform filters. IEEE Trans. Pattern Anal. Mach. Intell.
**1986**, 234–239. [Google Scholar] [CrossRef] - Elboher, E.; Werman, M. Cosine integral images for fast spatial and range filtering. In Proceedings of the IEEE International Conference on Image Processing, Brussels, Belgium, 11–14 September 2011; pp. 89–92. [Google Scholar]
- Sugimoto, K.; Kamata, S. Fast Gaussian filter with second-order shift property of DCT-5. In Proceedings of the International Conference on Image Processing (ICIP), Melbourne, Australia, 15–18 September 2013; pp. 514–518. [Google Scholar]
- Sugimoto, K.; Kamata, S. Efficient Constant-time Gaussian Filtering with Sliding DCT/DST-5 and Dual-domain Error Minimization. ITE Trans. Media Technol. Appl.
**2015**, 3, 12–21. [Google Scholar] [CrossRef] - Getreuer, P. A survey of Gaussian convolution algorithms. Image Process. Line
**2013**, 2013, 286–310. [Google Scholar] [CrossRef] - Durand, F.; Dorsey, J. Fast Bilateral Filtering for the Display of High-Dynamic-Range Images. ACM Trans. Graph.
**2002**, 21, 257–266. [Google Scholar] [CrossRef] - Sugimoto, K.; Fukushima, N.; Kamata, S. Fast Bilateral Filter for Multichannel Images via Soft-assignment Coding. In Proceedings of the 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Jeju, Korea, 13–16 December 2016. [Google Scholar]
- Sugimoto, K.; Kamata, S. Compressive Bilateral Filtering. IEEE Trans. Image Process.
**2015**, 24, 3357–3369. [Google Scholar] [CrossRef] [PubMed] - Chen, J.; Paris, S.; Durand, F. Real-Time Edge-Aware Image Processing with the Bilateral Grid. ACM Trans. Graph.
**2007**, 26, 103. [Google Scholar] [CrossRef] - Paris, S.; Durand, F. A Fast Approximation of the Bilateral Filter Using A Signal Processing Approach. Int. J. Comput. Vis.
**2009**, 81, 24–52. [Google Scholar] [CrossRef] - Fukushima, N.; Fujita, S.; Ishibashi, Y. Switching Dual Kernels for Separable Edge-Preserving Filtering. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, 19–24 April 2015. [Google Scholar]
- Pham, T.Q.; Vliet, L.J.V. Separable bilateral filtering for fast video preprocessing. In Proceedings of the International Conference on Multimedia and Expo (ICME), Amsterdam, The Netherlands, 6–8 July 2005. [Google Scholar]
- Chaudhury, K.N. Acceleration of the Shiftable O(1) Algorithm for Bilateral Filtering and Nonlocal Means. IEEE Trans. Image Process.
**2013**, 22, 1291–1300. [Google Scholar] [CrossRef] [PubMed] - Crow, F.C. Summed-Area Tables for Texture Mapping. In Proceedings of the ACM SIGGRAPH, Minneapolis, MN, USA, 23–27 July 1984; pp. 207–212. [Google Scholar]
- Mitzenmacher, M.; Upfal, E. Probability and Computing: Randomized Algorithms and Probabilistic Analysis; Cambridge University Press: New York, NY, USA, 2005. [Google Scholar]
- Motwani, R.; Raghavan, P. Randomized Algorithms; Cambridge University Press: New York, NY, USA, 1995. [Google Scholar]
- Cook, R.L. Stochastic Sampling in Computer Graphics. ACM Trans. Graph.
**1986**, 5, 51–72. [Google Scholar] [CrossRef] - Asahi, Y.; Latu, G.; Ina, T.; Idomura, Y.; Grandgirard, V.; Garbet, X. Optimization of Fusion Kernels on Accelerators with Indirect or Strided Memory Access Patterns. IEEE Trans. Parallel Distrib. Syst.
**2017**, 28, 1974–1988. [Google Scholar] [CrossRef] - Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE
**1998**, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version] - Chellapilla, K.; Puri, S.; Simard, P. High Performance Convolutional Neural Networks for Document Processing. In Proceedings of the International Workshop on Frontiers in Handwriting Recognition, La Baule, France, 23–26 October 2006. [Google Scholar]
- Chetlur, S.; Woolley, C.; Vandermersch, P.; Cohen, J.; Tran, J.; Catanzaro, B.; Shelhamer, E. cuDNN: Efficient Primitives for Deep Learning. arXiv, 2014; arXiv:1410.0759. [Google Scholar]
- Vasudevan, A.; Anderson, A.; Gregg, D. Parallel Multi Channel convolution using General Matrix Multiplication. In Proceedings of the IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP), Seattle, WA, USA, 10–12 July 2017; pp. 19–24. [Google Scholar]
- Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis.
**2004**, 60, 91–110. [Google Scholar] [CrossRef] [Green Version] - Tomasi, C.; Manduchi, R. Bilateral Filtering for Gray and Color Images. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Bombay, India, 7 January 1998; pp. 839–846. [Google Scholar]
- Deng, G.; Cahill, L. An adaptive Gaussian filter for noise reduction and edge detection. In Proceedings of the IEEE Nuclear Science Symposium and Medical Imaging Conference, San Francisco, CA, USA, 31 October– 6 November 1993; pp. 1615–1619. [Google Scholar]
- Bae, S.; Durand, F. Defocus magnification. In Computer Graphics Forum; Wiley: Hoboken, NJ, USA, 2007; Volume 26, pp. 571–579. [Google Scholar]
- Zhang, W.; Cham, W.K. Single image focus editing. In Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCV Workshops), Kyoto, Japan, 27 September–4 October 2009; pp. 1947–1954. [Google Scholar]

**Figure 1.**Example of kernel subsampling. Only samples of current (red) and reference (yellow) pixels are computed.

**Figure 4.**Code of vectorization patterns. (

**a**) Brute-force implementation; (

**b**) Color loop unrolling; (

**c**) Kernel loop unrolling; (

**d**) Pixel loop unrolling. The size of the SIMD register is 4. Usually, the data structure $I\left[y\right]\left[x\right]\left[c\right]$ represents RGB interleaving, where x and y are the horizontal and vertical positions, respectively, and c is the color channel. Splitting and merging the data by each channel are defined as follows: $\mathrm{I}\left[\mathrm{y}\right]\left[\mathrm{x}\right]\left[\mathrm{c}\right]\iff \mathrm{I}\left[\mathrm{c}\right]\left[\mathrm{y}\right]\left[\mathrm{x}\right].$ For these data structures, the data in the final operator $[\xb7]$ can be sequential access.

**Figure 5.**Vectorization pattern of vectorized programming: (

**a**) indicates color loop unrolling. (

**b**) indicates kernel loop unrolling. (

**c**) indicates arbitrary kernel loop unrolling. (

**d**) indicates pixel loop unrolling using the same weight for the reference pixels. (

**e**) pixel loop unrolling using different weights for the reference pixels is shown. (

**f**) indicates arbitrary pixel loop unrolling.

**Figure 6.**Kernel vectorization. (

**a**) Rearrange approach; (

**b**) Data structure of a pixel in gray image; (

**c**) Data structure of a pixel in color image. The size of the SIMD register is 4.

**Figure 7.**Code of loop vectorization. (

**a**) Loop vectorization for kernel loop; (

**b**) Loop vectorization for pixel loop. The size of the SIMD register is 4. $\mathrm{LV}$ represents the data structure transformed by loop vectorization. For the data structure, the data in the final operator $[\xb7]$ can be sequential access. The data structure is always accessed sequentially.

**Figure 8.**Processing time for Gaussian range filtering (GRF) with respect to the kernel radius of FIR filtering. (

**a**) Full sample in weight computation; (

**b**) 1/4 subsample in weight computation; (

**c**) 1/16 subsample in weight computation; (

**d**) Full sample in LUT; (

**e**) 1/4 subsample in LUT; (

**f**) 1/16 subsample in LUT. Note that arbitrary kernel loop unrolling is used instead of kernel loop unrolling under kernel-subsampling conditions.

**Figure 9.**The speedup ratio for Gaussian range filtering (GRF) with respect to the kernel radius of FIR filtering. (

**a**) Full sample in weight computation; (

**b**) 1/4 subsample in weight computation; (

**c**) 1/16 subsample in weight computation; (

**d**) Full sample in LUT; (

**e**) 1/4 subsample in LUT; (

**f**) 1/16 subsample in LUT. If the ratio exceeds 1, the given pattern is faster than the kernel loop vectorization.

**Figure 10.**Processing time for bilateral filtering (BF) with respect to the kernel radius of FIR filtering. (

**a**) Full sample in weight computation; (

**b**) 1/4 subsample in weight computation; (

**c**) 1/16 subsample in weight computation; (

**d**) Full sample in LUT; (

**e**) 1/4 subsample in LUT; (

**f**) 1/16 subsample in LUT. Note that arbitrary kernel loop unrolling is used instead of kernel loop unrolling in kernel-subsampling conditions.

**Figure 11.**The speedup ratio of bilateral filtering (BF) with respect to the kernel radius of FIR filtering. (

**a**) Full sample in weight computation; (

**b**) 1/4 subsample in weight computation; (

**c**) 1/16 subsample in weight computation; (

**d**) Full sample in LUT; (

**e**) 1/4 subsample in LUT; (

**f**) 1/16 subsample in LUT. If the ratio exceeds 1, the given pattern is faster than the kernel loop vectorization.

**Figure 12.**Processing time for adaptive Gaussian filtering (AGF) with respect to the kernel radius of FIR filtering. (

**a**) Full sample in weight computation; (

**b**) 1/4 subsample in weight computation; (

**c**) 1/16 subsample in weight computation; (

**d**) Full sample in LUT; (

**e**) 1/4 subsample in LUT; (

**f**) 1/16 subsample in LUT. Note that arbitrary kernel loop unrolling is used instead of kernel loop unrolling in the kernel-subsampling conditions.

**Figure 13.**The speedup ratio for adaptive Gaussian filtering (AGF) with respect to kernel radius of FIR filtering. (

**a**) Full sample in weight computation; (

**b**) 1/4 subsample in weight computation; (

**c**) 1/16 subsample in weight computation; (

**d**) Full sample in LUT; (

**e**) 1/4 subsample in LUT; (

**f**) 1/16 subsample in LUT. If the ratio exceeds 1, this pattern is faster than the kernel loop vectorization.

**Figure 14.**Processing time for randomly-kernel-subsampled Gaussian range filtering (RKS-GRF) with respect to the kernel radius of FIR filtering. (

**a**) 1/4 subsample in weight computation; (

**b**) 1/16 subsample in weight computation; (

**c**) 1/4 subsample in LUT; (

**d**) 1/16 subsample in LUT. Arbitrary pixel loop unrolling and arbitrary kernel loop unrolling are used in the place of pixel loop unrolling and kernel loop unrolling, respectively.

**Figure 15.**The speedup ratio of randomly-kernel-subsampled Gaussian range filtering (RKS-GRF) with respect to the kernel radius of FIR filtering. (

**a**) 1/4 subsample in weight computation; (

**b**) 1/16 subsample in weight computation; (

**c**) 1/4 subsample in LUT; (

**d**) 1/16 subsample in LUT. If the ratio exceeds 1, the pattern is faster than kernel loop vectorization.

**Figure 16.**Processing time for randomly-kernel-subsampled bilateral filtering (RKS-BF) with respect to the kernel radius of FIR filtering. (

**a**) 1/4 subsample in weight computation; (

**b**) 1/16 subsample in weight computation; (

**c**) 1/4 subsample in LUT; (

**d**) 1/16 subsample in LUT. Arbitrary pixel loop unrolling and arbitrary kernel loop unrolling are used in place of pixel loop unrolling and kernel loop unrolling, respectively.

**Figure 17.**The speedup ratio of randomly-kernel-subsampled bilateral filtering (RKS-BF) with respect to the kernel radius of FIR filtering. (

**a**) 1/4 subsample in weight computation; (

**b**) 1/16 subsample in weight computation; (

**c**) 1/4 subsample in LUT; (

**d**) 1/16 subsample in LUT. If the ratio exceeds 1, the given pattern is faster than kernel loop vectorization.

**Figure 18.**Processing time for randomly-kernel-subsampled adaptive Gaussian filtering (RKS-AGF) with respect to the kernel radius of FIR filtering. (

**a**) 1/4 subsample in weight computation; (

**b**) 1/16 subsample in weight computation; (

**c**) 1/4 subsample in LUT; (

**d**) 1/16 subsample in LUT. Arbitrary pixel loop unrolling and arbitrary kernel loop unrolling are used in place of pixel loop unrolling and kernel loop unrolling, respectively.

**Figure 19.**The speedup ratio for randomly-kernel-subsample adaptive Gaussian filtering (RKS-AGF) with respect to the kernel radius of FIR filtering. (

**a**) 1/4 subsample in weight computation; (

**b**) 1/16 subsample in weight computation; (

**c**) 1/4 subsample in LUT; (

**d**) 1/16 subsample in LUT. If the ratio exceeds 1, the given pattern is faster than the kernel loop vectorization.

**Figure 20.**PSNR with respect to kernel radius of FIR filtering. (

**a**) Gaussian range filter; (

**b**) Bilateral filter; (

**c**) Adaptive Gaussian filter; (

**d**) Randomly-kernel-subsampled Gaussian range filter; (

**e**) Randomly-kernel-subsampled filter; (

**f**) Randomly-kernel-subsampled adaptive Gaussian filter. Image size is 512 × 512.

**Figure 21.**Processing time for loop vectorization with respect to the kernel radius of FIR filtering. There are 2 × 4 lines, and their combinations represent image resolution (512 × 512 and 900 × 750) and kernel subsampling ratio (full, 1/4, and 1/16).

**Table 1.**Characteristics of the vectorization patterns of vectorization in finite impulse response image filtering.

Vectorization Pattern | Arbitrary Parameter/Non-Limitation | Restriction Parameter/Limitation |
---|---|---|

loop vectorization | image width, kernel width, kernel shape, aligned load | long preprocessing time, huge memory usage |

color loop unrolling | image width, kernel width, kernel shape, aligned load | requiring color image with padding |

kernel loop unrolling | image width | kernel width, kernel shape, non-aligned load |

arbitrary kernel loop unrolling | image width, kernel shape | kernel width, inefficient load, non-aligned load |

pixel loop unrolling | kernel width | image width, kernel shape, non-aligned load |

arbitrary pixel loop unrolling | kernel width, kernel shape | image width, inefficient load, non-aligned load |

**Table 2.**Characteristics of the Gaussian range filter (GRF), the bilateral filter (BF), the adaptive Gaussian filter (AGF), the randomly-kernel-subsampled Gaussian range filter (RKS-GRF), the randomly-kernel-subsampled bilateral filter (RKS-BF), and the randomly-kernel-subsampled adaptive Gaussian filter (RKS-AGF).

Filter | Weight Depending | LUT | Kernel Shape |
---|---|---|---|

GRF | pixel value | range | invariant |

BF | pixel value, pixel position | space, range | invariant |

AGF | parameter map, pixel position | space | variant |

RKS-GRF | pixel value | range | variant |

RKS-BF | pixel value, pixel position | space, range | variant |

RKS-AGF | parameter map, pixel position | space | variant |

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Maeda, Y.; Fukushima, N.; Matsuo, H.
Taxonomy of Vectorization Patterns of Programming for FIR Image Filters Using Kernel Subsampling and New One. *Appl. Sci.* **2018**, *8*, 1235.
https://doi.org/10.3390/app8081235

**AMA Style**

Maeda Y, Fukushima N, Matsuo H.
Taxonomy of Vectorization Patterns of Programming for FIR Image Filters Using Kernel Subsampling and New One. *Applied Sciences*. 2018; 8(8):1235.
https://doi.org/10.3390/app8081235

**Chicago/Turabian Style**

Maeda, Yoshihiro, Norishige Fukushima, and Hiroshi Matsuo.
2018. "Taxonomy of Vectorization Patterns of Programming for FIR Image Filters Using Kernel Subsampling and New One" *Applied Sciences* 8, no. 8: 1235.
https://doi.org/10.3390/app8081235