# Less Is More: Adaptive Trainable Gradient Dropout for Deep Neural Networks

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Related Work

## 3. Proposed Method

## 4. Experiments and Results

#### 4.1. Datasets

**CIFAR-10**[24]: The CIFAR-10 dataset features 60,000 32 × 32 color images, divided into 10 classes of 6000 images each. The training set consists of 50,000 images, whereas the test set contains 10,000 images, randomly selected from each class.**USPS Handwritten Digits (USPS)**[25]: USPS is a dataset of handwritten digits featuring 7291 training and 2007 8 × 8 testing examples, coming from 10 classes.**Fashion-MNIST**[26]: Fashion-MNIST is structured based on MNIST [27], a handwritten digit dataset, which is considered an almost solved problem, and is designed as a more challenging dataset; it consists of clothing images divided into a training set of 60,000 samples and a test set of 10,000 28 × 28 grayscale samples of 10 classes.**SVHN**[28]: SVHN is an image dataset of house numbers, obtained from Google Street View images. The dataset’s structure is similar to that of the MNIST dataset; each of the 10 classes consists of images of one digit. The dataset contains over 600,000 digit images, split into 73,257 digits for training, 26,032 digits for testing, and 531,131 additional training examples.**STL-10**[29]: The STL-10 dataset is an image recognition dataset inspired by the CIFAR-10 dataset. The dataset shares the same structure as the CIFAR-10 dataset, with 10 classes of 500 96 × 96 training images and 800 96 × 96 test images in each. However, the dataset also contains 100,000 unlabeled images for unsupervised training, with content extracted from similar, but not the same categories as the original classes, acquired from Imagenet [30]. Although this dataset was designed for developing scalable unsupervised methods, in this study, it was used as a standard supervised classification dataset.

#### 4.2. Implementation Details

#### 4.3. Results

## 5. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Allen, D.M. The Relationship between Variable Selection and Data Agumentation and a Method for Prediction. Technometrics
**1974**, 16, 125–127. [Google Scholar] [CrossRef] - Kohavi, R. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In 14th International Joint Conference on Artificial Intelligence—Volume 2; IJCAI’95; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1995; pp. 1137–1143. [Google Scholar]
- Freund, Y. Boosting a Weak Learning Algorithm by Majority. Inf. Comput.
**1995**, 121, 256–285. [Google Scholar] [CrossRef] [Green Version] - Breiman, L. Bagging predictors. Mach. Learn.
**1996**, 24, 123–140. [Google Scholar] [CrossRef] [Green Version] - Perez, L.; Wang, J. The Effectiveness of Data Augmentation in Image Classification using Deep Learning. arXiv
**2017**, arXiv:1712.04621. [Google Scholar] - Cubuk, E.D.; Zoph, B.; Mané, D.; Vasudevan, V.; Le, Q.V. AutoAugment: Learning Augmentation Policies from Data. arXiv
**2018**, arXiv:1805.09501. [Google Scholar] - Ohashi, H.; Al-Naser, M.; Ahmed, S.; Akiyama, T.; Sato, T.; Nguyen, P.; Nakamura, K.; Dengel, A. Augmenting Wearable Sensor Data with Physical Constraint for DNN-Based Human-Action Recognition. In Proceedings of the ICML 2017 Times Series Workshop, PMLR, Sydney, Australia, 6–11 August 2017. [Google Scholar]
- Prechelt, L. Early Stopping-But When? In Neural Networks: Tricks of the Trade; This Book Is an Outgrowth of a 1996 NIPS Workshop; Springer: Berlin/Heidelberg, Germany, 1998; pp. 55–69. [Google Scholar]
- Krogh, A.; Hertz, J.A. A Simple Weight Decay Can Improve Generalization. In Proceedings of the 4th International Conference on Neural Information Processing Systems; NIPS’91. Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1991; pp. 950–957. [Google Scholar]
- Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res.
**2014**, 15, 1929–1958. [Google Scholar] - Wan, L.; Zeiler, M.D.; Zhang, S.; LeCun, Y.; Fergus, R. Regularization of Neural Networks using DropConnect. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013. [Google Scholar]
- Huang, G.; Sun, Y.; Liu, Z.; Sedra, D.; Weinberger, K. Deep Networks with Stochastic Depth. arXiv
**2016**, arXiv:1603.09382. [Google Scholar] - Ghiasi, G.; Lin, T.Y.; Le, Q.V. DropBlock: A regularization method for convolutional networks. arXiv
**2018**, arXiv:1810.12890. [Google Scholar] - DeVries, T.; Taylor, G.W. Improved Regularization of Convolutional Neural Networks with Cutout. arXiv
**2017**, arXiv:1708.04552. [Google Scholar] - Larsson, G.; Maire, M.; Shakhnarovich, G. FractalNet: Ultra-Deep Neural Networks without Residuals. arXiv
**2017**, arXiv:1605.07648. [Google Scholar] - Zoph, B.; Vasudevan, V.; Shlens, J.; Le, Q.V. Learning Transferable Architectures for Scalable Image Recognition. arXiv
**2018**, arXiv:1707.07012. [Google Scholar] - Gastaldi, X. Shake-Shake regularization. arXiv
**2017**, arXiv:1705.07485. [Google Scholar] - Yamada, Y.; Iwamura, M.; Akiba, T.; Kise, K. Shakedrop Regularization for Deep Residual Learning. IEEE Access
**2019**, 7, 186126–186136. [Google Scholar] [CrossRef] - Goodfellow, I.; Warde-Farley, D.; Mirza, M.; Courville, A.; Bengio, Y. Maxout Networks. In Proceedings of the 30th International Conference on Machine Learning; Dasgupta, S., McAllester, D., Eds.; PMLR: Atlanta, GA, USA, 2013; Volume 28, pp. 1319–1327. [Google Scholar]
- Tseng, H.Y.; Chen, Y.W.; Tsai, Y.H.; Liu, S.; Lin, Y.Y.; Yang, M.H. Regularizing Meta-Learning via Gradient Dropout. arXiv
**2020**, arXiv:2004.05859. [Google Scholar] - Ba, J.; Frey, B. Adaptive dropout for training deep neural networks. In Proceedings of the Advances in Neural Information Processing Systems; Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Lake Tahoe, NA, USA, 2013; Volume 26. [Google Scholar]
- Gomez, A.N.; Zhang, I.; Kamalakara, S.R.; Madaan, D.; Swersky, K.; Gal, Y.; Hinton, G.E. Learning Sparse Networks Using Targeted Dropout. arXiv
**2019**, arXiv:1905.13678. [Google Scholar] - Lin, H.; Zeng, W.; Ding, X.; Huang, Y.; Huang, C.; Paisley, J. Learning Rate Dropout. arXiv
**2019**, arXiv:1912.00144. [Google Scholar] [CrossRef] [PubMed] - Krizhevsky, A.; Nair, V.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; University of Toronto: Toronto, ON, Canada, 2019. [Google Scholar]
- Seewald, A.K. Digits—A Dataset for Handwritten Digit Recognition; Institute for Artificial Intelligence: Vienna, Austria, 2005. [Google Scholar]
- Xiao, H.; Rasul, K.; Vollgraf, R. Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv
**2017**, arXiv:1708.07747. [Google Scholar] - Deng, L. The Mnist Database of Handwritten Digit Images for Machine Learning Research. IEEE Signal Process. Mag.
**2012**, 29, 141–142. [Google Scholar] [CrossRef] - Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; Ng, A.Y. Reading Digits in Natural Images with Unsupervised Feature Learning NIPS Workshop on Deep Learning and Unsupervised Feature Learning; Springer: Granada, Spain, 2011. [Google Scholar]
- Coates, A.; Ng, A.; Lee, H. An Analysis of Single Layer Networks in Unsupervised Feature Learning. In Proceedings of the Artificial Intelligence and Statistics AISTATS, Ft. Lauderdale, FL, USA, 2011; Available online: https://cs.stanford.edu/~acoates/papers/coatesleeng_aistats_2011.pdf (accessed on 11 December 2022).
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv
**2014**, arXiv:1312.6114. [Google Scholar] - Lu, L. Dying ReLU and Initialization: Theory and Numerical Examples. Commun. Comput. Phys.
**2020**, 28, 1671–1706. [Google Scholar] [CrossRef]

**Figure 1.**Example of the proposed method. The auxiliary network ensemble

**A**, comprising networks ${A}^{1},{A}^{2},\dots ,{A}^{k},k\in 0,1,2,\dots ,{L}^{C}$, is responsible for providing each layer of network C with a binary mask ${\mathbf{M}}_{k}$, which controls which parts of the ${l}_{j}^{C}$ layer will be trained. The applied masks’ performance is evaluated on the next forward pass, and, if the proposals are accepted, the training procedure continues with the modified version of C, Z.

**Figure 2.**Architecture of the C network consisting of the input image, four convolutional layers, and two fully connected layers. The proposed method is beneficial to the network even in such minimal setups.

**Figure 3.**Threshold effect on the third convolution filter, for $p=0.01$, $p=0.05$, and $p=0.1$. Red squares in the first dense example indicate same pixel neighborhoods for easier comprehension.

**Figure 4.**Performances of LIM and vanilla dropout methods on the CIFAR10 dataset, trained and tested for 100 epochs. Graphs were smoothed for better comprehension; original graphs can be seen in the background. Curves correspond to Table 2 scores. Black: vanilla. Red: 1. Green: 2. Blue: 3. Orange: 4. Purple: 5.

**Figure 5.**Performance of LIM and vanilla dropout methods on the USPS dataset, trained and tested for 100 epochs. Graphs were smoothed for better comprehension; original graphs can be seen in the background. Curves correspond to Table 2 scores. Black: vanilla, purple: 1, red: 2, orange: 3, green: 4, blue: 5.

**Figure 6.**Performances of LIM and vanilla dropout methods on the fashion-MNIST dataset, trained and tested for 50 epochs. Curves correspond to Table 3 scores; black: vanilla, red: 1, orange: 2, green: 3, blue: 4, purple: 5.

**Figure 7.**Performances of LIM and vanilla dropout methods on the STL-10 dataset, trained and tested for 100 epochs. Graphs were smoothed for better comprehension; original graphs can be seen in the background. Curves correspond to Table 4 scores; black: vanilla, red: 1, purple: 2, green: 3, orange: 4, blue: 5.

**Figure 8.**Performances of LIM and vanilla dropout methods on the SVHN dataset, trained and tested for 50 epochs. Graphs were smoothed for better comprehension; original graphs can be seen in the background. Curves correspond to Table 5 scores; black: vanilla, red: 1, green: 2, blue: 3, orange: 4, purple: 5.

**Table 1.**Accuracy scores, parameter tuning, and convergence times for different experiments on the CIFAR10 dataset. The first row holds the best accuracy score for the vanilla dropout version and the epoch at which it was attained.

CIFAR10 | |||||||||
---|---|---|---|---|---|---|---|---|---|

# | acc | epoch | conv1 | conv2 | conv3 | conv4 | fc1 | int | wMask |

v | 72.8 | 70 | |||||||

1 | 73.26 | 93 | 0.01 | 0.01 | 0.05 | 0.08 | 0.5 | 10 | 10 |

2 | 72.97 | 49 | 0.001 | 0.002 | 0.003 | 0.05 | 0.4 | 1 | 1 |

3 | 72.74 | 78 | 0.001 | 0.002 | 0.004 | 0.01 | 0.7 | 10 | 10 |

4 | 72.25 | 51 | 0.0 | 0.0 | 0.0 | 0.0 | 0.7 | 1 | - |

5 | 71.9 | 6 | 0.01 | 0.02 | 0.05 | 0.1 | 0.4 | 1 | 10 |

**Table 2.**Accuracy scores, parameter tuning, and convergence times for different experiments on the USPS dataset. The first row holds the best accuracy score for the vanilla dropout version and the epoch at which it was attained.

USPS | |||||||||
---|---|---|---|---|---|---|---|---|---|

# | acc | epoch | conv1 | conv2 | conv3 | conv4 | fc1 | int | wMask |

v | 96.21 | 94 | |||||||

1 | 96.71 | 49 | 0.0 | 0.0 | 0.0 | 0.0 | 0.7 | 1 | - |

2 | 96.71 | 85 | 0.01 | 0.01 | 0.02 | 0.03 | 0.3 | 1 | 4 |

3 | 96.46 | 28 | 0.002 | 0.002 | 0.003 | 0.05 | 0.2 | 1 | 1 |

4 | 96.41 | 92 | 0.001 | 0.001 | 0.01 | 0.2 | 0.5 | 5 | 1 |

5 | 96.41 | 53 | 0.001 | 0.001 | 0.002 | 0.01 | 0.1 | 1 | 20 |

**Table 3.**Accuracy scores, parameter tuning, and convergence times for different experiments on the fashion-MNIST dataset. The first row holds the best accuracy score for the vanilla dropout version and the epoch at which it was attained.

Fashion-MNIST | |||||||||
---|---|---|---|---|---|---|---|---|---|

# | acc | epoch | conv1 | conv2 | conv3 | conv4 | fc1 | int | wMask |

v | 90.51 | 48 | |||||||

1 | 92.55 | 28 | 0.0 | 0.002 | 0.007 | 0.01 | 0.5 | 1 | 1 |

2 | 92.26 | 35 | 0.008 | 0.008 | 0.008 | 0.008 | 0.8 | 1 | 1 |

3 | 91.84 | 23 | 0.005 | 0.01 | 0.05 | 0.1 | 0.5 | 1 | 1 |

4 | 91.44 | 34 | 0.005 | 0.005 | 0.01 | 0.02 | 0.5 | 1 | 1 |

5 | 91.36 | 43 | 0.01 | 0.01 | 0.01 | 0.2 | 0.5 | 1 | 1 |

**Table 4.**Accuracy scores, parameter tuning, and convergence times for different experiments on the STL-10 dataset. The first row holds the best accuracy score for the vanilla dropout version and the epoch at which it was attained.

STL-10 | |||||||||
---|---|---|---|---|---|---|---|---|---|

# | acc | epoch | conv1 | conv2 | conv3 | conv4 | fc1 | int | wMask |

v | 50.08 | 99 | |||||||

1 | 52.26 | 22 | 0.001 | 0.001 | 0.002 | 0.05 | 0.2 | 5 | 1 |

2 | 51.76 | 100 | 0.001 | 0.002 | 0.01 | 0.01 | 0.1 | 2 | 1 |

3 | 50.86 | 28 | 0.1 | 0.1 | 0.15 | 0.2 | 0.4 | 5 | 1 |

4 | 50.79 | 12 | 0.02 | 0.02 | 0.1 | 0.2 | 0.4 | 5 | 1 |

5 | 50.56 | 48 | 0.001 | 0.001 | 0.002 | 0.01 | 0.3 | 5 | 5 |

**Table 5.**Accuracy scores, parameter tuning, and convergence times for different experiments on the SVHN dataset. The first row holds the best accuracy score for the vanilla dropout version and the epoch at which it was attained.

SVHN | |||||||||
---|---|---|---|---|---|---|---|---|---|

# | acc | epoch | conv1 | conv2 | conv3 | conv4 | fc1 | int | wMask |

v | 90.19 | 17 | |||||||

1 | 90.45 | 12 | 0.001 | 0.001 | 0.002 | 0.01 | 0.6 | 1 | 1 |

2 | 90.43 | 11 | 0.001 | 0.003 | 0.008 | 0.01 | 0.5 | 1 | 1 |

3 | 90.36 | 12 | 0.001 | 0.001 | 0.01 | 0.02 | 0.5 | 1 | 1 |

4 | 90.25 | 26 | 0.001 | 0.001 | 0.002 | 0.005 | 0.6 | 1 | 1 |

5 | 90.17 | 5 | 0.001 | 0.003 | 0.008 | 0.01 | 0.3 | 1 | 2 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Avgerinos, C.; Vretos, N.; Daras, P.
Less Is More: Adaptive Trainable Gradient Dropout for Deep Neural Networks. *Sensors* **2023**, *23*, 1325.
https://doi.org/10.3390/s23031325

**AMA Style**

Avgerinos C, Vretos N, Daras P.
Less Is More: Adaptive Trainable Gradient Dropout for Deep Neural Networks. *Sensors*. 2023; 23(3):1325.
https://doi.org/10.3390/s23031325

**Chicago/Turabian Style**

Avgerinos, Christos, Nicholas Vretos, and Petros Daras.
2023. "Less Is More: Adaptive Trainable Gradient Dropout for Deep Neural Networks" *Sensors* 23, no. 3: 1325.
https://doi.org/10.3390/s23031325