# DBGC: Dimension-Based Generic Convolution Block for Object Recognition

^{1}

^{2}

^{3}

^{4}

^{5}

^{6}

^{7}

^{8}

^{9}

^{10}

^{11}

^{*}

## Abstract

**:**

## 1. Introduction

#### 1.1. ShuffleNetv2

- When the number of input channels and output channels are in the same proportion (1:1), memory access costs are minimized.
- Excessive group convolution raises the cost of memory access: the group number should not be too large, as this would raise the memory access cost.
- Fragmentation of the network diminishes the degree of parallelism: fragmentation decreases the network’s efficiency in performing parallel calculations.
- Element-by-element procedures are not insignificant: although element-wise operations have few FLOPs, they can significantly increase memory access time.

#### 1.2. ESPNetv2

#### 1.3. DiCENet

#### 1.4. MobileNetv2

Architecture | Year | Parameters | Top1 | Top5 |
---|---|---|---|---|

ShuffleNetv2 [22] | 2018 | 2.3 M | 69.4 | 88.9 |

MobileNetv2 [25] | 2018 | 3.47 M | 71.8 | 91.0 |

ESPNetv2 [23] | 2019 | 3.49 M | 72.06 | 90.39 |

DiCENet [24] | 2020 | 2.65 M | 69.05 | 88.8 |

## 2. Materials and Methods

#### 2.1. Introduction to Separable Convolution

#### 2.1.1. Spatial Separable Convolution

#### 2.1.2. Depth-Wise Separable Convolution

#### Depth-Wise Convolution

#### Point-Wise Convolution

#### 2.2. Introduction to Convolution Kernels

#### 2.2.1. 1D Convolution

#### 2.2.2. 2D Convolution

#### 2.2.3. 3D Convolution

## 3. DBGC—Dimension-Based Generic Convolution Unit

#### 3.1. Convolution Based on Dimension (ConvDim)

^{H × D × W}, where H, D and W correspond to height, depth and width of I, respectively. As shown in Figure 12, ConvDim has three kernels, one for each dimension. They apply various dimension-wise kernels, such as H height-wise convolutional kernel K

_{H}∈ R

^{1 × n × n}along height; D depth-wise convolutional kernel K

_{D}∈ R

^{n × 1 × n}along depth; and W width-wise convolutional kernel K

_{W}∈ R

^{n × n × 1}along width. Those kernels produce outputs denoted as Y

_{H}, Y

_{D}and Y

_{W}∈ R

^{H × D × W}which encode information provided in the input tensor. The outputs of these independent branches are concatenated in the dimension selector block, such that the first spatial plane of YD, YW, and YH are put together and so on, to produce the output YDim.

#### 3.2. Dimension Selector (Ds)

_{D}∪ K

_{W}∪ K

_{H}}. It is also possible to select any two combinations of kernels such that Ds ∈ {K

_{D}, K

_{W}∪ K

_{H}, K

_{W}∪ K

_{D}, K

_{H}}, where Ds is Dimension selector and K

_{D}, K

_{H}, K

_{W}are dimension-based kernels (depth, width, and height, respectively), Ds ∈ {K

_{D}, K

_{W}, K

_{H}}. So, in Dimension selector there are total of seven possibilities (only height; only width; only depth; height and width; width and depth; height and depth; and height and width and depth). Based on the selection of kernels, the appropriate dimensions will be provided to YDim. Various dimensions can be selected as shown in Figure 13.

#### 3.3. Dimension-Wise Blends (DimBlend)

_{Dim}∈ R

^{3DXHXW}and create an output Y ∈ R

**, a point-wise convolutional layer applies point-wise kernels Kp ∈ R**

^{DXHXW}^{3DX1X1}and executes 3D

^{2}HW operations. This takes a lot of time to compute. However, the dimension-wise blend module allows one to mix representations of Y

_{Dim}effectively, given DimConv’s capacity to encode spatial and channel-wise information (though separately). DimBlend factorizes the point-wise convolution in two phases, as shown in Figure 12: (1) local fusion and (2) global fusion [26].

_{Dim}concatenates the output from the dimension selector module. It concatenates spatial planes of each dimension. DimBlend uses a group point-wise convolution layer in order to combine dimension-wise information received from Y

_{Dim}. As shown in Figure 12, Kg operates independently on each dimension group D. This process is denoted as local fusion.

## 4. Results and Analysis

#### 4.1. Implementation of DBGC

#### 4.1.1. Experimental Setup

#### 4.1.2. Dataset Details

#### 4.2. Results Analysis

#### 4.2.1. Unoptimized Kernel Dimensions

#### 4.2.2. Semi-Optimized Kernel Dimensions

#### 4.2.3. Optimized Kernel Dimension

## 5. Conclusions

## 6. Future Work

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Bhatt, D.; Patel, C.; Talsania, H.; Patel, J.; Vaghela, R.; Pandya, S.; Modi, K.; Ghayvat, H. CNN Variants for Computer Vision: History, Architecture, Application, Challenges and Future Scope. Electronics
**2021**, 10, 2470. [Google Scholar] [CrossRef] - Bhatt, D.; Bhensadadiya, N.P. Survey On Various Intelligent Traffic Management Schemes For Emergency Vehicles. Int. J. Recent Innov.
**2013**, 1, 11–16. [Google Scholar] - Garg, S.; Patel, C.; Tank, H.; Ukani, V. Efficient Vehicle Detection and Classification for Traffic Surveillance System. In Proceedings of the International Conference on Advances in Computing and Data Sciences, Ghaziabad, India, 11–12 November 2016; pp. 495–503. [Google Scholar]
- Patel, R.; Patel, P.; Patel, C. Object Detection and Segmentation using Local and Global Property. Int. J. Emerg. Technol. Sci. Eng.
**2012**, 2, 2–10. [Google Scholar] - Garg, S.; Patel, C.I. Comparative analysis of traditional methods for moving object detection in video sequence. Int. J. Comput. Sci. Commun.
**2015**, 6, 309–315. [Google Scholar] - Garg, S.; Zaveri, T.; Banerjee, A.; Patel, C.I. Top-down and bottom-up cues based moving object detection for varied background video sequences. Adv. Multimed.
**2014**, 2014, 13. [Google Scholar] - Bosamiya, D.; Fuletra, J.D. A survey on drivers drowsiness detection techniques. Int. J. Recent Innov. Trends Comput. Commun.
**2013**, 1, 816–819. [Google Scholar] - Bosamiya, D.; Kamariya, N.; Miyatra, A. A Survey on Disease and Nutrient Deficiency Detection in Cotton Plant. Int. J. Recent Innov. Trends Comput. Commun.
**2013**, 1, 812–815. [Google Scholar] - Parikh, M.; Bhatt, D.; Patel, M. Animal detection using template matching algorithm. Int. J. Res. Mod. Eng. Emerg. Technol.
**2013**, 1, 26–32. [Google Scholar] - Bhatt, D.; Patel, C.; Sharma, P. Intelligent Farm Surveillance System for Bird Detection. Glob. J. Eng. Des. Technol.
**2012**, 1, 14–18. [Google Scholar] - Bhatt, D.; Patel, M. Review of Different Techniques for Ripe Fruit Detection. Int. J. Eng. Dev. Res.
**2016**, 4, 247–250. [Google Scholar] - Ghayvat, H.; Pandya, S.N.; Bhattacharya, P.; Zuhair, M.; Rashid, M.; Hakak, S.; Dev, K. CP-BDHCA: Blockchain-based Confidentiality-Privacy preserving Big Data scheme for healthcare clouds and applications. IEEE J. Biomed. Health Inform.
**2021**, 9, 168455–168484. [Google Scholar] [CrossRef] - Patel, C.; Thakkar, S. Iris Recognition Supported Best Gabor Filters and Deep Learning CNN Options. In Proceedings of the 2020 International Conference on Industry 4.0 Technology (I4Tech), Pune, India, 13–15 February 2020. [Google Scholar] [CrossRef]
- Bhagchandani, A.; Bhatt, D.; Chopade, M. Various Big Data Techniques to Process and Analyse Neuroscience Data. In Proceedings of the 2018 5th International Conference on “Computing for Sustainable Global Development”, New Delhi, India, 14–16 March 2018; pp. 397–402. [Google Scholar]
- Soni, R.; Bhatt, D.; Patel, R. Tumor Detection using Normalized Cross Co-Relation. Int. J. Res. Mod. Eng. Emerg. Technol.
**2013**, 1, 21–25. [Google Scholar] - Patel, R.; Patel, P.; Patel, C.I. Goal Detection from Unsupervised Video Surveillance. In Proceedings of the International Conference on Advances in Computing and Information Technology, Chennai, India, 15–17 July 2011; Springer: Berlin/Heidelberg, Germany, 2011; pp. 76–88. [Google Scholar]
- Patel, R.; Patel, C.I. Robust face recognition using distance matrice. Int. J. Comput. Electric. Eng.
**2013**, 5, 401–404. [Google Scholar] [CrossRef] [Green Version] - Bhatt, D.; Bhagchandani, A. Machine Learning Model for Predicting Social Media Influence on Sports. In Proceedings of the Ires International Conference, Online, 9–10 October 2020. [Google Scholar]
- Mehta, M.; Passi, K.; Chatterjee, I.; Patel, R. Knowledge Modelling and Big Data Analytics in Healthcare: Advances and Applications; Taylor and Francis: Abingdon, UK, 2021. [Google Scholar]
- Labana, D.; Pandya, S.; Modi, K.; Ghayvat, H.; Awais, M.; Patel, C.I. Histogram of Oriented Based Fusion of Features for Human Activity Recognition. Sensors
**2020**, 20, 7299. [Google Scholar] - Garg, S.; Zaveri, T.; Banerjee, J.; Patel, R.; Patel, C.I. Human action recognition using fusion of features for unconstrained video sequences. Comput. Electric. Eng.
**2018**, 70, 284–301. [Google Scholar] - Zhang, X.; Zheng, H.-T.; Sun, J.; Ma, N. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. In Proceedings of the Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; Available online: https://arxiv.org/abs/1807.11164v1 (accessed on 20 January 2022).
- Rastegari, M.; Shapiro, L.; Hajishirzi, H.; Mehta, S. ESPNetv2: A Light-Weight, Power Efficient, and General Purpose Convolutional Neural Network. arXiv
**2019**, arXiv:1811.11431. [Google Scholar] - Mehta, S.; Hajishirzi, H.; Rastegari, M. DiCENet: Dimension-wise Convolutions for Efficient Networks. IEEE Trans. Pattern Anal. Mach. Intell.
**2020**. [Google Scholar] [CrossRef] - Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C.; Sandler, M. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
- Restegari, M.; Caspi, A.; Shapiro, L.; Hajishirzi, H.; Mehta, S. ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation. arXiv
**2018**, arXiv:1803.06815. [Google Scholar] - Zhang, X.; Ren, S.; Sun, J.; He, K. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. arXiv
**2015**, arXiv:1502.01852v1. [Google Scholar] - Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
- Feng, C.; Zhuo, S.; Zhang, X.; Shen, L.; Aleksic, M.; Sheng, T. A Quantization-Friendly Separable Convolution for MobileNets. In Proceedings of the 1st Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2), Williamsburg, VA, USA, 25 March 2018; pp. 14–18. [Google Scholar]
- Zhu, F.; Liu, J.; Liu, G.; Zhang, R. Depth-Wise Separable Convolutions and Multi-Level Pooling for an Efficient Spatial CNN-Based Steganalysis. IEEE Trans. Inf. Forensics Secur.
**2020**, 15, 1138–1150. [Google Scholar] - Yin, Z.; Wu, M.; Wu, Z.; Kamal, K.C. Depthwise separable convolution architectures for plant disease classification. Comput. Electron. Agric.
**2019**, 165, 104948. [Google Scholar] - Choi, Y.; Choi, H.; Yoo, B. Fast Depthwise Separable Convolution for Embedded Systems. In Proceedings of the International Conference on Neural Information Processing (ICONIP), Siem Reap, Cambodia, 13–16 December 2018. [Google Scholar]
- Kaiser, L.; Gomez, A.N.; Chollet, F. Depthwise Separable Convolutions for Neural Machine Translation. arXiv
**2017**, arXiv:1706.03059. [Google Scholar] - Tran, M.-K.; Yeung, S.-K.; Hua, B.-S. Pointwise Convolutional Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 984–993. [Google Scholar]
- Bracewell, R. Two-Dimensional Convolution. In Fourier Analysis and Imaging; Springer: Boston, MA, USA, 2003. [Google Scholar]
- Wang, H.; Zhang, Q.; Yoon, S.W.; Won, D.; Lu, H. A 3D Convolutional Neural Network for Volumetric Image Semantic Segmentation. Procedia Manuf.
**2019**, 39, 422–428. [Google Scholar] - Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Howard, A.G. Mobilenets:Efficient convolutional neural networks for mobile vision applications. arXiv
**2017**, arXiv:1704.04861. [Google Scholar] - Shen, L.; Sun, G.; Hu, J. Squeeze-and-Excitation Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- PASCAL VOC Dataset. Available online: http://host.robots.ox.ac.uk/pascal/VOC/voc2012/ (accessed on 10 January 2022).
- Gosling, J.B. Floating Point Operation. In Design of Arithmetic Units for Digital Computers; Springer: New York, NY, USA, 1980. [Google Scholar]
- MS COCO Dataset. Available online: https://cocodataset.org/#download (accessed on 20 January 2022).

**Figure 6.**Depth-wise convolution uses three kernels to produce an 8 × 8 × 1 image from a 12 × 12 × 1 image.

**Figure 7.**Point-wise convolution transforms an image of three channels into an image of one channel.

**Figure 13.**Implementation of Convolution based on Dimension (ConvDim): in (

**a**), each kernel is applied to a pixel (represented by a small dot) independently; in (

**b**), any two kernels are (height-width, height-depth, and width-depth) applied to a pixel simultaneously by allowing information to be combined using tensors; finally in (

**c**), all kernels are applied to a pixel simultaneously, allowing information to be aggregated from the tensor efficiently. Convolutional kernels are highlighted in color (depth-wise, width-wise, and height-wise).

**Figure 26.**Box plot to show unoptimized, semi-optimized, and optimized kernel performances for ESPNetv2 versus ESPNetV2 (DBGC) and ShuffleNetv2 versus ShuffleNetV2 (DBGC).

**Figure 27.**Graph to visualize the performances of unoptimized, semi-optimized, and optimized kernels for ESPNetv2 versus ESPNetV2 (DBGC), and ShuffleNetv2 versus ShuffleNetV2 (DBGC) using the PASCAL and COCO datasets.

Sr. No | Layer | The Equation to Calculate FLOPs |
---|---|---|

1 | Convolution Layer | 2 × No. of Kernel × Kernel’s Shape × Output Shape × Repeat Count (if available) |

2 | Pooling Layer (Without stride) | Height × Width × Depth of an input Image |

3 | Pooling Layer (With stride) | (Height/Stride) × Depth × (Width/Stride) of an input Image |

4 | Fully Connected Layer (FC Layer) | 2 × Input Size × Output Size |

Model | Dataset | Image Size | FLOP (In Millions) | Top1 | Top5 |
---|---|---|---|---|---|

ESPNet v2 | PASCAL | 224 × 224 | 86 | 66.1 | 70.02 |

ShuffleNetv2 | PASCAL | 224 × 224 | 71 | 63.9 | 62.30 |

ESPNetv2(DBGC-Kw) | PASCAL | 224 × 224 | 24 | 35.64 | 43.86 |

ShuffleNetv2 (DBGC-Kw) | PASCAL | 224 × 224 | 21 | 34.5 | 39.54 |

Model | Dataset | Image Size | FLOP (In Millions) | Top1 | Top5 |
---|---|---|---|---|---|

ESPNet v2 | PASCAL | 224 × 224 | 86 | 66.1 | 70.02 |

ShuffleNetv2 | PASCAL | 224 × 224 | 71 | 63.9 | 62.30 |

ESPNetv2(DBGC-KH) | PASCAL | 224 × 224 | 24 | 33.4 | 37.66 |

ShuffleNetv2 (DBGC-KH) | PASCAL | 224 × 224 | 21 | 32.15 | 36.84 |

Model | Dataset | Image Size | FLOP (In Millions) | Top1 | Top5 |
---|---|---|---|---|---|

ESPNet v2 | PASCAL | 224 × 224 | 86 | 66.1 | 70.02 |

ShuffleNetv2 | PASCAL | 224 × 224 | 71 | 63.9 | 62.30 |

ESPNetv2(DBGC-KD) | PASCAL | 224 × 224 | 24 | 33.34 | 36.62 |

ShuffleNetv2 (DBGC-KD) | PASCAL | 224 × 224 | 21 | 31.95 | 35.74 |

Model | Dataset | Image Size | FLOP (In Millions) | Top1 | Top5 |
---|---|---|---|---|---|

ESPNet v2 | PASCAL | 224 × 224 | 86 | 66.1 | 70.02 |

ShuffleNetv2 | PASCAL | 224 × 224 | 71 | 63.9 | 62.30 |

ESPNetv2(DBGC-KDW) | PASCAL | 224 × 224 | 48 | 66.31 | 71.58 |

ShuffleNetv2 (DBGC-KDW) | PASCAL | 224 × 224 | 42 | 65.88 | 69.65 |

Model | Dataset | Image Size | FLOP (In Millions) | Top1 | Top5 |
---|---|---|---|---|---|

ESPNet v2 | PASCAL | 224 × 224 | 86 | 66.1 | 70.02 |

ShuffleNetv2 | PASCAL | 224 × 224 | 71 | 63.9 | 62.30 |

ESPNetv2(DBGC-KDH) | PASCAL | 224 × 224 | 48 | 65.25 | 69.63 |

ShuffleNetv2 (DBGC-KDH) | PASCAL | 224 × 224 | 42 | 65.95 | 68.53 |

Model | Dataset | Image Size | FLOP (In Millions) | Top1 | Top5 |
---|---|---|---|---|---|

ESPNet v2 | PASCAL | 224 × 224 | 86 | 66.1 | 70.02 |

ShuffleNetv2 | PASCAL | 224 × 224 | 71 | 63.9 | 62.30 |

ESPNetv2(DBGC-KHW) | PASCAL | 224 × 224 | 48 | 65.85 | 71.25 |

ShuffleNetv2(DBGCKHW) | PASCAL | 224 × 224 | 42 | 64.82 | 67.43 |

Model | Dataset | Image Size | FLOP (In Millions) | Top1 | Top5 |
---|---|---|---|---|---|

ESPNet v2 | PASCAL | 224 × 224 | 86 | 66.1 | 70.02 |

ShuffleNetv2 | PASCAL | 224 × 224 | 71 | 63.9 | 62.30 |

ESPNetv2(DBGC-KHWD) | PASCAL | 224 × 224 | 72 | 70.83 | 74.56 |

ShuffleNetv2 (DBGC-KHWD) | PASCAL | 224 × 224 | 63 | 69.92 | 74.53 |

**Table 10.**Unoptimized, semi-optimized, and optimized kernel performances for ESPNetv2 versus ESPNetV2 (DBGC) and ShuffleNetv2 versus ShuffleNetV2 (DBGC) for the PASCAL and COCO datasets.

Dataset | ESPNetv2 (Ev2) | ShufleNetv2 (Sv2) | Ev2(DBGC-K_{W}) | Sv2 (DBGC-K_{W}) | Ev2(DBGC-KH) | Sv2(DBGC-KH) | Ev2(DBGC-KD) | Sv2(DBGC-KD) | Ev2(DBGC-KDW) | Sv2(DBGC-KDW) | Ev2(DBGC-KDH) | Sv2(DBGC-KDH) | Ev2(DBGC-KHW) | Sv2(DBGC-KHW) | Ev2(DBGC-KDHW) | Sv2(DBGC-KDHW) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

FLOP (M) | PASCAL | 86 | 71 | 24 | 21 | 24 | 21 | 24 | 21 | 48 | 42 | 48 | 42 | 48 | 42 | 72 | 63 |

Top1 | 66.1 | 63.9 | 35.64 | 34.5 | 33.4 | 32.15 | 33.34 | 31.95 | 66.31 | 65.88 | 65.25 | 65.95 | 65.85 | 64.82 | 70.83 | 69.92 | |

Top5 | 70.2 | 62.3 | 43.86 | 39.54 | 37.66 | 36.84 | 36.62 | 35.74 | 71.58 | 69.65 | 69.63 | 68.53 | 71.25 | 67.43 | 74.56 | 74.53 | |

FLOP (M) | COCO | 92 | 79 | 31 | 28 | 31 | 28 | 31 | 28 | 62 | 56 | 62 | 56 | 62 | 56 | 78 | 69 |

Top1 | 64.34 | 60.3 | 33.41 | 31.05 | 33.54 | 31.5 | 33.42 | 31.17 | 63.33 | 61.54 | 64.21 | 62.53 | 61.53 | 66.05 | 68.34 | 64.23 | |

Top5 | 69.23 | 64.8 | 40.53 | 37.31 | 40.83 | 37.11 | 41.53 | 37.41 | 69.25 | 65.72 | 71.63 | 64.53 | 66.85 | 66.55 | 75.46 | 72.5 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Patel, C.; Bhatt, D.; Sharma, U.; Patel, R.; Pandya, S.; Modi, K.; Cholli, N.; Patel, A.; Bhatt, U.; Khan, M.A.;
et al. DBGC: Dimension-Based Generic Convolution Block for Object Recognition. *Sensors* **2022**, *22*, 1780.
https://doi.org/10.3390/s22051780

**AMA Style**

Patel C, Bhatt D, Sharma U, Patel R, Pandya S, Modi K, Cholli N, Patel A, Bhatt U, Khan MA,
et al. DBGC: Dimension-Based Generic Convolution Block for Object Recognition. *Sensors*. 2022; 22(5):1780.
https://doi.org/10.3390/s22051780

**Chicago/Turabian Style**

Patel, Chirag, Dulari Bhatt, Urvashi Sharma, Radhika Patel, Sharnil Pandya, Kirit Modi, Nagaraj Cholli, Akash Patel, Urvi Bhatt, Muhammad Ahmed Khan,
and et al. 2022. "DBGC: Dimension-Based Generic Convolution Block for Object Recognition" *Sensors* 22, no. 5: 1780.
https://doi.org/10.3390/s22051780