# Empirical Evaluation of the Effect of Optimization and Regularization Techniques on the Generalization Performance of Deep Convolutional Neural Network

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Optimization

**Speed of convergence**: The time needed for algorithm to reach the optimal value**Generalization**: The model performance on new data

#### 2.1. Classical Iterative Optimization Algorithms

#### 2.1.1. Stochastic Gradient Descent (SGD)

#### 2.1.2. Momentum

**momentum**term $\mathit{m}$ in classical stochastic gradient descent helps to accelerate learning in relevant directions and reduce oscillations during training by slowing down along dimensions where the gradient is inconsistent, i.e., in dimensions where the sign of gradient often changes. The momentum [22] update rule is given by

#### 2.1.3. Nesterov Accelerated Gradient Descent (NAG)

#### 2.2. Adaptive Learning Rate Optimizers

#### 2.2.1. Adagrad

#### 2.2.2. Adadelta

#### 2.2.3. RMSProp

#### 2.2.4. Adam

- Estimates of the first moment and second raw moment, i.e., accumulation variables $\mathit{m}$ and $\mathit{v}$, respectively, used for parameter update in Adam are calculated using exponential moving average.
- Adam includes initialization bias-correction terms for the first and second moment estimates, which are due to their initialization to the vector of zeros in initial iterations biased towards 0.

#### 2.2.5. AdaMax

**AdaMax**, which uses ${L}_{\infty}$ norm instead of ${L}_{2}$ norm, is presented.

#### 2.2.6. Nadam

#### 2.3. Batch Normalization

## 3. Regularization

#### 3.1. ${L}_{2}$ Regularization

#### 3.2. ${L}_{1}$ Regularization

#### 3.3. Noise Injection

#### 3.4. Dropout

- adding Bernoulli noise into the input and hidden units, where noise can be seen as the loss of the information carried by the dropped unit; and
- averaging output of approximately ${2}^{n}$ subnetworks that share weights obtained from the original network with n units by randomly removing non-output units.

#### 3.5. Data Augmentation

#### 3.6. Ensemble Learning

- Training multiple neural networks is computationally expensive.
- Constructing an ensemble of neural networks with different architectures requires fine-tuning hyperparameters for each of them.
- Training one deep neural network requires large amount of data; training k networks on entirely different datasets requires k times more training data.

#### Bagging

## 3.7. Early Stopping

- stop the training if validation error increased in p successive epochs (with respect to the lowest validation error up to that point);
- stop the training if there was no decrease in validation error of at least ${\delta}_{min}>0$ in p successive epochs; or
- stop if the validation error exceeds some predefined threshold.

## 4. Experiments

#### 4.1. Baseline Model Architectures and Dataset Description

#### 4.1.1. Baseline Model Architectures

#### 4.1.2. Datasets

#### 4.2. Results

#### 4.2.1. Evaluation of Optimization Algorithms

- The best test set results, in terms of accuracy, are obtained using the classical Nesterov optimization algorithm and adaptive optimization algorithm Adam and its variant AdaMax.
- Compared to Nesterov optimization algorithm, Adam and AdaMax show less stable performance (with many “jumps”) on validation data. The most stable, not necessarily the best, performance on validation data, especially loss, among adaptive optimizers show Adagrad and Adadelta optimization algorithms.
- RMSProp optimization algorithm, in all six cases, has considerably larger validation loss than other optimizers that consistently keeps growing. Interestingly, despite great discrepancy between RMSProp and others optimizers losses, its validation, and finally test set accuracy remains reasonable well and comparable with others.
- In terms of test set accuracy, the ranking of classical optimizers stays consistent across all six models; Nesterov ranked as best, followed by Momentum, and SGD at last. Ranking of adaptive optimizers places Adagrad optimizer on last place, closely followed by RMSProp optimizer.
- Most of the optimizers in 350 epochs succeeded in reaching the ≈0 loss and ≈100% accuracy on the training data in all six models. Exceptions can be found in SGD and RMSProp optimizers, with the overall worst performance obtained by SGD with 95.43% training accuracy.
- In the early stages, especially on Model 2 and Model 3 (Fashion-MNIST), all optimizers beside SGD on Model 1 show signs of overfitting. A large gap between accuracies on the training and new data is noticeable during the whole training process.

- Batch Normalization

#### 4.2.2. Evaluation of Different Regularization Techniques

#### Weight Decay

#### Noise Injection

#### Dropout

#### Data Augmentation

#### Early Stopping

#### Ensemble Learning

- BaggingLet ${\mathcal{D}}_{1},\cdots ,{\mathcal{D}}_{5}$ be datasets of size 40,000 obtained by random sampling with replacement from the CIFAR-10 or Fashion-MNIST training dataset. A Baseline Learner (BL) that has the same architecture as the chosen model is trained on dataset ${\mathcal{D}}_{i}$, $i=1,\cdots ,5$. $\mathit{Ensi}$ denotes the ensemble $\{BL1,BL2,\dots ,BLi\}$ of the first i baseline learners.Each model from the bagged ensemble has accuracy lower than the accuracy of the baselines noted in the last rows of Table 11 and Table 12 caused by the less diverse training dataset, which contain multiple identical images, but together they outperform the baseline. Each ensemble has better generalization performance than any of its members. Accuracy of ensemble increases together with its size. Generalization performance of models that obtained the highest accuracy on test data further increases when we apply the bagging technique, as shown in Table 13. The downside of the Bagging method is the additional time necessary to train all of the base learners to obtain desired enhancement in generalization performance.
- The ensemble of members trained with different settingsBelow, we examine how ensembling models with different architectures and settings (in terms of used regularization and optimization techniques) affect the ensemble’s generalization performance compared to the bagging ensembling approach.Final accuracies of such ensembles on CIFAR-10 and Fashion-MNIST data are given in Table 14. Baseline learners used for CIFAR-10 image classification in Table 14a are Model 3 (NAG + Data Augmentation), Model 1 (NAG + Data Augmentation + Batch Normalization), Model 2 (Adam + Data Augmentation + Batch Normalization), Model 1 (NAG + Dropout), and Model 2 (Adam + Dropout + Batch Normalization). For Fashion-MNIST baseline learners given in Table 14ab are Model 1 (Adam + Dropout + Batch Normalization), Model 2 (Adam + Data Augmentation + Batch Normalization), Model 3 (Data Augmentation), Model 2 (Adam + Data Augmentation + Batch Normalization + Dropout) and Model 1 (Adam + Data Augmentation + Batch Normalization + Dropout). The ensemble formed of “different” members outperforms bagged ensemble created using the one model with the best generalization performance among the “different” ones.

## 5. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## Appendix A. Optimizers Update Rules Table Summary

Classical optimizers | ||
---|---|---|

SGD(1951, [21]) | ${\mathbf{\theta}}_{t}={\mathbf{\theta}}_{t-1}-\eta \nabla J\left({\mathbf{\theta}}_{t-1}\right)$ | inputs: •$\eta $ |

Momentum(1964, [22]) | $\begin{array}{cc}\hfill {\mathit{m}}_{0}& =0\hfill \\ \hfill {\mathit{m}}_{t}& =\gamma {\mathit{m}}_{t}-\eta \nabla J\left({\mathbf{\theta}}_{t-1}\right)\hfill \\ \hfill {\mathbf{\theta}}_{t}& ={\mathbf{\theta}}_{t}+{\mathit{m}}_{t-1}\hfill \end{array}$ | inputs: •$\eta $ •$\gamma $ |

Nesterov(1983, [23]) | $\begin{array}{cc}\hfill {\mathit{m}}_{0}& =0\hfill \\ \hfill {\mathit{m}}_{t}& =\gamma {\mathit{m}}_{t-1}-\eta \nabla J({\mathbf{\theta}}_{t-1}+\gamma {\mathit{m}}_{t-1})\hfill \\ \hfill {\mathbf{\theta}}_{t}& ={\theta}_{t}+{\mathit{m}}_{t-1}\hfill \end{array}$ | inputs: •$\eta $ •$\gamma $ |

Optimizers with adaptive learning rate | ||

Adagrad(2011, [24]) | $\begin{array}{cc}\hfill {\mathit{v}}_{0}& =\mathbf{0}\hfill \\ \hfill {\mathit{v}}_{t}& ={\mathit{v}}_{t}+{\left(\right)}^{\nabla}2\hfill \end{array}$ | inputs: •$\eta $ •$\u03f5$ |

Adadelta(2012, [25]) | $\begin{array}{cc}\hfill {\mathit{m}}_{0}& =0,\phantom{\rule{2.84526pt}{0ex}}{\mathit{v}}_{\mathbf{0}}=0\hfill \\ \hfill {\mathit{v}}_{t}& =\beta {\mathit{v}}_{\mathbf{t}-\mathbf{1}}+(1-\beta ){\left(\right)}^{\nabla}2\hfill \end{array}$ | inputs: •$\beta $ •$\u03f5$ |

RMSProp(2012, [26]) | $\begin{array}{cc}\hfill {\mathit{v}}_{0}& =\mathbf{0}\hfill \\ \hfill {\mathit{v}}_{t}& =\beta {\mathit{v}}_{t-1}+(1-\beta ){\left(\right)}^{\nabla}2\hfill \end{array}$ | inputs: •$\eta $ •$\beta $ •$\u03f5$ |

Adam& AdaMax(2014, [28]) | Adam: | inputs: •$\eta $ •${\beta}_{1}$ •${\beta}_{2}$ •$\u03f5$ |

$\begin{array}{cc}\hfill {\mathit{m}}_{0}& =\mathbf{0},\phantom{\rule{2.84526pt}{0ex}}{\mathit{v}}_{0}=\mathbf{0}\hfill \end{array}$ | ||

$\begin{array}{cc}\hfill {\mathit{m}}_{t}& ={\beta}_{1}{\mathit{m}}_{t-1}+(1-{\beta}_{1})\nabla J\left({\mathbf{\theta}}_{t-1}\right)\hfill \end{array}$ | ||

$\begin{array}{cc}\hfill {\mathit{v}}_{t}& ={\beta}_{2}{\mathit{v}}_{t-1}+(1-{\beta}_{2}){\left(\right)}^{\nabla}2\hfill \end{array}$ | ||

$\begin{array}{cc}\hfill {\widehat{\mathit{m}}}_{t}& {\displaystyle =\frac{{\mathit{m}}_{t}}{1-{\beta}_{1}^{t}},\phantom{\rule{5.69054pt}{0ex}}{\widehat{\mathit{v}}}_{t}=\frac{{\mathit{v}}_{t}}{1-{\beta}_{2}^{t}}}\hfill \end{array}$ | ||

$\begin{array}{cc}\hfill {\mathbf{\theta}}_{t}& ={\mathbf{\theta}}_{t-1}-\frac{\eta}{\sqrt{{\widehat{\mathit{v}}}_{t}}+\u03f5}\circ {\widehat{\mathit{m}}}_{t}\hfill \end{array}$ | ||

AdaMax: | ||

$\begin{array}{cc}\hfill {\mathit{m}}_{0}& =\mathbf{0},\phantom{\rule{2.84526pt}{0ex}}{\mathit{u}}_{0}=\mathbf{0}\hfill \end{array}$ | ||

$\begin{array}{cc}\hfill {\mathit{m}}_{t}& ={\beta}_{1}{\mathit{m}}_{t-1}+(1-{\beta}_{1})\nabla J\left({\mathbf{\theta}}_{t-1}\right)\hfill \end{array}$ | ||

$\begin{array}{cc}\hfill {\widehat{\mathit{m}}}_{t}& {\displaystyle =\frac{{\mathit{m}}_{t}}{1-{\beta}_{1}^{t}},\phantom{\rule{2.84526pt}{0ex}}{\mathit{u}}_{t}=max\left(\right)open="\{"\; close="\}">{\beta}_{2}{\mathit{u}}_{t-1},\phantom{\rule{1.42262pt}{0ex}}|\nabla J\left({\mathbf{\theta}}_{t-1}\right)\phantom{\rule{0.56905pt}{0ex}}|}\hfill \end{array}$ | ||

$\begin{array}{cc}\hfill {\mathbf{\theta}}_{t}& ={\mathbf{\theta}}_{t-1}-\frac{\eta}{{\mathit{u}}_{t}}\circ {\widehat{\mathit{m}}}_{t}\hfill \end{array}$ | ||

Nadam(2015, [29]) | $\begin{array}{cc}\hfill {\mathit{m}}_{0}& =\mathbf{0},\phantom{\rule{2.84526pt}{0ex}}{\mathit{v}}_{0}=\mathbf{0}\hfill \\ \hfill {\mu}_{t}& ={\beta}_{1}\left(\right)open="("\; close=")">1-0.5\xb7{0.96}^{0.004t}\hfill \end{array}\hfill {\widehat{\mathit{m}}}_{t}& =\frac{{\mathit{m}}_{t}}{1-{\displaystyle \prod _{i=1}^{t+1}}{\mu}_{i}},\phantom{\rule{11.38109pt}{0ex}}{\widehat{\mathbf{g}}}_{t}=\frac{\nabla J\left({\mathbf{\theta}}_{t-1}\right)}{1-{\displaystyle \prod _{i=1}^{t}}{\mu}_{i}}\hfill \\ \hfill {\overline{\mathit{m}}}_{t}& =(1-{\mu}_{t}){\widehat{\mathbf{g}}}_{t}+{\mu}_{t+1}{\widehat{\mathit{m}}}_{t}\hfill \\ \hfill {\widehat{\mathit{v}}}_{t}& {\displaystyle =\frac{{\mathit{v}}_{t}}{1-{\beta}_{2}^{t}}}\hfill \\ \hfill {\mathbf{\theta}}_{t}& ={\theta}_{t-1}-\frac{\eta}{\sqrt{{\widehat{\mathit{v}}}_{t}}+\u03f5}\circ {\overline{\mathit{m}}}_{t}\hfill $ | inputs: •$\eta $ •${\beta}_{1}$ •${\beta}_{2}$ •$\u03f5$ |

## Appendix B. Used Hyperparameters for Optimizer Per Model

Optimizer | Model 1 | Model 2 | Model 3 |
---|---|---|---|

SGD | lr = 0.01 | lr = 0.05 | lr = 0.01 |

Momentum | lr = 0.01, momentum = 0.9 | lr = 0.01, momentum = 0.9 | lr = 0.01, momentum = 0.9 |

Nesterov | lr = 0.01, momentum = 0.95 | lr = 0.01, momentum = 0.9 | lr = 0.005, momentum = 0.95 |

Adagrad | lr = 0.05, $\u03f5={10}^{-7}$ | lr = 0.05, $\u03f5={10}^{-7}$ | lr = 0.01, $\u03f5={10}^{-7}$ |

Adadelta | lr = 0.5, $\rho =0.95$, $\u03f5={10}^{-7}$ | lr = 0.5, $\rho =0.95$, $\u03f5={10}^{-7}$ | lr = 0.5, $\rho =0.95$, $\u03f5={10}^{-7}$ |

RMSProp | lr = 0.001, $\rho =0.9$, $\u03f5={10}^{-7}$ | lr = 0.0005, $\rho =0.9$, $\u03f5={10}^{-7}$ | lr = 0.0001, $\rho =0.9$, $\u03f5={10}^{-7}$ |

Adam | lr = 0.0005, ${\beta}_{1}=0.9$, ${\beta}_{2}=0.999$, $\u03f5={10}^{-7}$ | lr = 0.0005, ${\beta}_{1}=0.9$, ${\beta}_{2}=0.999$, $\u03f5={10}^{-7}$ | lr = 0.0001, ${\beta}_{1}=0.9$, ${\beta}_{2}=0.999$, $\u03f5={10}^{-7}$ |

Adamax | lr = 0.001, ${\beta}_{1}=0.9$, ${\beta}_{2}=0.999$, $\u03f5={10}^{-7}$ | lr = 0.001, ${\beta}_{1}=0.9$, ${\beta}_{2}=0.999$, $\u03f5={10}^{-7}$ | lr = 0.001, ${\beta}_{1}=0.9$, ${\beta}_{2}=0.999$, $\u03f5={10}^{-7}$ |

Nadam | lr = 0.0005, ${\beta}_{1}=0.9$, ${\beta}_{2}=0.999$, $\u03f5={10}^{-7}$ | lr = 0.0005, ${\beta}_{1}=0.9$, ${\beta}_{2}=0.999$, $\u03f5={10}^{-7}$ | lr = 0.0001, ${\beta}_{1}=0.9$, ${\beta}_{2}=0.999$, $\u03f5={10}^{-7}$ |

Optimizer | Model 1 | Model 2 | Model 3 |
---|---|---|---|

SGD | lr = 0.05 | lr = 0.05 | lr = 0.05 |

Momentum | lr = 0.01, momentum = 0.95 | lr = 0.005, momentum = 0.95 | lr = 0.01, momentum = 0.9 |

Nesterov | lr = 0.01, momentum = 0.95 | lr = 0.01, momentum = 0.95 | lr = 0.01, momentum = 0.95 |

Adagrad | lr = 0.05, $\u03f5={10}^{-7}$ | lr = 0.05, $\u03f5={10}^{-7}$ | lr = 0.05, $\u03f5={10}^{-7}$ |

Adadelta | lr = 0.5, $\rho =0.95$, $\u03f5={10}^{-7}$ | lr = 0.5, $\rho =0.95$, $\u03f5={10}^{-7}$ | lr = 1, $\rho =0.95$, $\u03f5={10}^{-7}$ |

RMSProp | lr = 0.001, $\rho =0.9$, $\u03f5={10}^{-7}$ | lr = 0.0001, $\rho =0.9$, $\u03f5={10}^{-7}$ | lr = 0.0001, $\rho =0.9$, $\u03f5={10}^{-7}$ |

Adam | lr = 0.0005, ${\beta}_{1}=0.9$, ${\beta}_{2}=0.999$, $\u03f5={10}^{-7}$ | lr = 0.0005, ${\beta}_{1}=0.9$, ${\beta}_{2}=0.999$, $\u03f5={10}^{-7}$ | lr = 0.0001, ${\beta}_{1}=0.9$, ${\beta}_{2}=0.999$, $\u03f5={10}^{-7}$ |

Adamax | lr = 0.0001, ${\beta}_{1}=0.9$, ${\beta}_{2}=0.999$, $\u03f5={10}^{-7}$ | lr = 0.001, ${\beta}_{1}=0.9$, ${\beta}_{2}=0.999$, $\u03f5={10}^{-7}$ | lr = 0.001, ${\beta}_{1}=0.9$, ${\beta}_{2}=0.999$, $\u03f5={10}^{-7}$ |

Nadam | lr = 0.001, ${\beta}_{1}=0.9$, ${\beta}_{2}=0.999$, $\u03f5={10}^{-7}$ | lr = 0.0001, ${\beta}_{1}=0.9$, ${\beta}_{2}=0.999$, $\u03f5={10}^{-7}$ | lr = 0.001, ${\beta}_{1}=0.9$, ${\beta}_{2}=0.999$, $\u03f5={10}^{-7}$ |

## References

- Liu, Y.; Wang, Y.; Wang, S.; Liang, T.; Zhao, Q.; Tang, Z.; Ling, H. Cbnet: A novel composite backbone network architecture for object detection. arXiv
**2019**, arXiv:1909.03625. [Google Scholar] [CrossRef] - Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
- Saon, G.; Kurata, G.; Sercu, T.; Audhkhasi, K.; Thomas, S.; Dimitriadis, D.; Cui, X.; Ramabhadran, B.; Picheny, M.; Lim, L.L.; et al. English conversational telephone speech recognition by humans and machines. arXiv
**2017**, arXiv:1703.02136. [Google Scholar] - Edunov, S.; Ott, M.; Auli, M.; Grangier, D. Understanding back-translation at scale. arXiv
**2018**, arXiv:1808.09381. [Google Scholar] - Dai, Z.; Yang, Z.; Yang, Y.; Carbonell, J.; Le, Q.V.; Salakhutdinov, R. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv
**2019**, arXiv:1901.02860. [Google Scholar] - Zhang, C.; Bengio, S.; Hardt, M.; Recht, B.; Vinyals, O. Understanding deep learning requires rethinking generalization. arXiv
**2017**, arXiv:1611.03530. [Google Scholar] - Dogo, E.; Afolabi, O.; Nwulu, N.; Twala, B.; Aigbavboa, C. A comparative analysis of gradient descent-based optimization algorithms on convolutional neural networks. In Proceedings of the 2018 International Conference on Computational Techniques, Electronics and Mechanical Systems (CTEMS), Belgaum, India, 21–22 December 2018; pp. 92–99. [Google Scholar]
- Choi, D.; Shallue, C.J.; Nado, Z.; Lee, J.; Maddison, C.J.; Dahl, G.E. On empirical comparisons of optimizers for deep learning. arXiv
**2019**, arXiv:1910.05446. [Google Scholar] - Bera, S.; Shrivastava, V.K. Analysis of various optimizers on deep convolutional neural network model in the application of hyperspectral remote sensing image classification. Int. J. Remote Sens.
**2020**, 41, 2664–2683. [Google Scholar] [CrossRef] - Kandel, I.; Castelli, M.; Popovič, A. Comparative Study of First Order Optimizers for Image Classification Using Convolutional Neural Networks on Histopathology Images. J. Imaging
**2020**, 6, 92. [Google Scholar] [CrossRef] - Soydaner, D. A Comparison of Optimization Algorithms for Deep Learning. Int. J. Pattern Recognit. Artif. Intell.
**2020**, 2052013. [Google Scholar] [CrossRef] - Smirnov, E.A.; Timoshenko, D.M.; Andrianov, S.N. Comparison of regularization methods for imagenet classification with deep convolutional neural networks. Aasri Procedia
**2014**, 6, 89–94. [Google Scholar] [CrossRef] - Nusrat, I.; Jang, S.B. A comparison of regularization techniques in deep neural networks. Symmetry
**2018**, 10, 648. [Google Scholar] [CrossRef] [Green Version] - Garbin, C.; Zhu, X.; Marques, O. Dropout vs. batch normalization: An empirical study of their impact to deep learning. Multimed. Tools Appl.
**2020**, 79, 1–39. [Google Scholar] [CrossRef] - Chen, G.; Chen, P.; Shi, Y.; Hsieh, C.Y.; Liao, B.; Zhang, S. Rethinking the Usage of Batch Normalization and Dropout in the Training of Deep Neural Networks. arXiv
**2019**, arXiv:1905.05928. [Google Scholar] - Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv
**2014**, arXiv:1409.1556. [Google Scholar] - Kukačka, J.; Golkov, V.; Cremers, D. Regularization for deep learning: A taxonomy. arXiv
**2017**, arXiv:1710.10686. [Google Scholar] - Krizhevsky, A.; Nair, V.; Hinton, G. CIFAR-10 (Canadian Institute for Advanced Research). Available online: http://www.cs.toronto.edu/kriz/cifar.html (accessed on 29 August 2020).
- Xiao, H.; Rasul, K.; Vollgraf, R. Fashion-mnist: A novel image dataset for benchmarking machine learning algorithms. arXiv
**2017**, arXiv:1708.07747. [Google Scholar] - Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv
**2015**, arXiv:1502.03167. [Google Scholar] - Robbins, H.; Monro, S. A stochastic approximation method. Ann. Math. Stat.
**1951**, 22, 400–407. [Google Scholar] [CrossRef] - Polyak, B.T. Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys.
**1964**, 4, 1–17. [Google Scholar] [CrossRef] - Nesterov, Y.E. A method for solving the convex programming problem with convergence rate O (1/k
^{2}). Dokl. Akad. Nauk Sssr**1983**, 269, 543–547. [Google Scholar] - Duchi, J.; Hazan, E.; Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res.
**2011**, 12, 2121–2159. [Google Scholar] - Zeiler, M.D. Adadelta: An adaptive learning rate method. arXiv
**2012**, arXiv:1212.5701. [Google Scholar] - Tieleman, T.; Hinton, G. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA Neural Netw. Mach. Learn.
**2012**, 4, 26–31. [Google Scholar] - Graves, A. Generating sequences with recurrent neural networks. arXiv
**2013**, arXiv:1308.0850. [Google Scholar] - Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv
**2014**, arXiv:1412.6980. [Google Scholar] - Dozat, T. Incorporating Nesterov Momentum into Adam. 2016. Available online: http://cs229.stanford.edu/proj2015/054_report.pdf (accessed on 29 August 2020).
- Sutskever, I.; Martens, J.; Dahl, G.; Hinton, G. On the importance of initialization and momentum in deep learning. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; pp. 1139–1147. [Google Scholar]
- Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016; Available online: http://www.deeplearningbook.org (accessed on 29 August 2020).
- Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res.
**2014**, 15, 1929–1958. [Google Scholar] - Breiman, L. Bagging predictors. Mach. Learn.
**1996**, 24, 123–140. [Google Scholar] [CrossRef] [Green Version] - Prechelt, L. Early stopping-but when? In Neural Networks: Tricks of the Trade; Springer: Berlin/Heidelberg, Germany, 1998; pp. 55–69. [Google Scholar]
- Springenberg, J.T.; Dosovitskiy, A.; Brox, T.; Riedmiller, M. Striving for simplicity: The all convolutional net. arXiv
**2014**, arXiv:1412.6806. [Google Scholar] - Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems; NIPS’12 Conference Organizer: Lake Tahoe, Nevada, 2012; pp. 1097–1105. [Google Scholar]

**Figure 5.**The effect of Batch Normalization on the loss of baseline models trained on CIFAR-10 dataset.

**Figure 6.**The effect of Batch Normalization on the loss of baseline models trained on Fashion-MNIST dataset.

**Figure 7.**Loss learning curves of models that incorporate ${L}_{2}$ and ${L}_{1}$ weight decay in baseline models trained on CIFAR-10 dataset.

**Figure 8.**Loss learning curves of models that incorporate ${L}_{2}$ and ${L}_{1}$ weight decay in baseline models trained on Fashion-MNIST dataset.

**Figure 9.**Comparison of baseline’s weights with weights obtained by models that use weight decay regularization methods with regularization parameter $\lambda =5\times {10}^{-5}$.

**Figure 13.**The effect of the Batch Normalization on the accuracy of models trained on the CIFAR-10 dataset that incorporate Dropout regularization.

**Figure 14.**The effect of the Batch Normalization on the accuracy of models trained on the Fashion-MNIST dataset that incorporate Dropout regularization.

**Figure 15.**Examples from new CIFAR-10 (

**top**) and Fashion-MNIST (

**bottom**) test sets with missing pixel values.

Model 1 | Model 2 | Model 3 |
---|---|---|

Conv 96, $3\times 3$ | Conv 64, $3\times 3$ | Conv 96, $3\times 3$ |

Conv 96, $3\times 3$ | Conv 128, $3\times 3$ | MaxPooling |

MaxPooling, | MaxPooling | Conv 256, $3\times 3$ |

Conv 192, $3\times 3$ | Conv 128, $3\times 3$ | MaxPooling |

Conv 192, $3\times 3$ | Conv 256, $3\times 3$ | Conv 384, $3\times 3$ |

MaxPooling | MaxPooling | Conv 384, $3\times 3$ |

Conv 192, $3\times 3$ | FC 128 | Conv 256, $3\times 3$ |

Conv 192, $1\times 1$ | FC-Softmax 10 | FC 4096 |

Conv 10, $1\times 1$ | FC 4096 | |

GlobalAveraging | FC-Softmax 10 | |

FC-Softmax 10 | ||

≈955 K params | ≈2.1 M params | ≈56 M params |

Model 1 | Model 2 | Model 3 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

Optimizer | Loss | Accuracy (%) | Loss | Accuracy (%) | Loss | Accuracy (%) | ||||||

Train | Test | Train | Test | Train | Test | Train | Test | Train | Test | Train | Test | |

SGD | $2.25\times {10}^{-4}$ | $1.803$ | $100.00$ | $80.61$ | $7.15\times {10}^{-6}$ | 2.688 | $100.00$ | $76.49$ | $7.73\times {10}^{-5}$ | $2.119$ | $100.00$ | $76.25$ |

Momentum | $2.65\times {10}^{-6}$ | 1.926 | $100.00$ | $83.17$ | $1.36\times {10}^{-6}$ | 2.442 | $100.00$ | $78.37$ | $8.37\times {10}^{-7}$ | 2.171 | $100.00$ | $79.23$ |

NAG | $5.64\times {10}^{-7}$ | 1.516 | 100.00 | 83.68 | $9.79\times {10}^{-7}$ | 2.434 | $100.00$ | 79.11 | $5.87\times {10}^{-7}$ | 2.007 | $100.00$ | 80.40 |

Adagrad | $1.38\times {10}^{-5}$ | 2.105 | 100.00 | 82.20 | $3.65\times {10}^{-6}$ | 2.586 | $100.00$ | $77.49$ | $4.13\times {10}^{-5}$ | 2.269 | $100.00$ | $77.38$ |

Adadelta | $4.94\times {10}^{-7}$ | 2.218 | 100.00 | 83.65 | $1.92\times {10}^{-7}$ | $2.884$ | $100.00$ | $79.09$ | $1.32\times {10}^{-7}$ | $2.567$ | $100.00$ | $78.46$ |

RMSProp | $9.02\times {10}^{-2}$ | 9.545 | 99.34 | 82.04 | $7.59\times {10}^{-2}$ | $35.001$ | 99.79 | $78.99$ | $4.74\times {10}^{-2}$ | $4.064$ | 99.39 | $76.37$ |

Adam | $5.86\times {10}^{-5}$ | 1.528 | 100.00 | 82.93 | $2.00\times {10}^{-10}$ | $3.893$ | $100.00$ | 79.84 | $2.86\times {10}^{-11}$ | $2.517$ | $100.00$ | $79.07$ |

AdaMax | $6.95\times {10}^{-9}$ | 2.341 | 100.00 | 82.83 | $4.33\times {10}^{-4}$ | 3.119 | $99.99$ | $78.44$ | $0.00\times {10}^{0}$ | 3.405 | $100.00$ | $80.61$ |

Nadam | $9.63\times {10}^{-10}$ | 3.807 | 100.00 | 82.24 | $5.00\times {10}^{-3}$ | 4.925 | $100.00$ | $78.65$ | $0.00\times {10}^{0}$ | $2.400$ | $100.00$ | $80.09$ |

Model 1 | Model 2 | Model 3 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

Optimizer | Loss | Accuracy (%) | Loss | Accuracy (%) | Loss | Accuracy (%) | ||||||

Train | Test | Train | Test | Train | Test | Train | Test | Train | Test | Train | Test | |

SGD | $1.25\times {10}^{-1}$ | 0.353 | $95.43$ | $89.64$ | $1.34\times {10}^{-5}$ | 0.833 | $100.00$ | $92.43$ | $1.13\times {10}^{-5}$ | $0.954$ | $100.00$ | $91.88$ |

Momentum | $1.36\times {10}^{-6}$ | 0.845 | $100.00$ | $92.54$ | $3.58\times {10}^{-6}$ | 0.779 | $100.00$ | $92.81$ | $1.35\times {10}^{-6}$ | 1.022 | $100.00$ | $92.13$ |

NAG | $1.09\times {10}^{-6}$ | 0.842 | 100.00 | 92.66 | $5.82\times {10}^{-7}$ | 0.863 | $100.00$ | 93.34 | $5.38\times {10}^{-7}$ | 0.937 | $100.00$ | 92.27 |

Adagrad | $4.34\times {10}^{-5}$ | 0.940 | 100.00 | 91.46 | $5.57\times {10}^{-6}$ | 0.852 | $100.00$ | $92.62$ | $1.93\times {10}^{-6}$ | 1.139 | $100.00$ | $91.53$ |

Adadelta | $7.82\times {10}^{-7}$ | 0.947 | 100.00 | 93.11 | $5.13\times {10}^{-7}$ | $0.961$ | $100.00$ | $93.20$ | $1.41\times {10}^{-7}$ | $1.142$ | $100.00$ | $92.19$ |

RMSProp | $1.72\times {10}^{-2}$ | 2.965 | 99.81 | 92.32 | $4.06\times {10}^{-4}$ | $1.219$ | 99.99 | $92.60$ | $1.64\times {10}^{-2}$ | $2.211$ | $99.73$ | $91.05$ |

Adam | $5.84\times {10}^{-8}$ | 0.920 | 100.00 | 93.27 | $1.27\times {10}^{-10}$ | $1.029$ | $100.00$ | 93.53 | $1.01\times {10}^{-9}$ | $1.328$ | $100.00$ | $92.56$ |

AdaMax | $9.93\times {10}^{-5}$ | 0.664 | 100.00 | 92.90 | $6.41\times {10}^{-8}$ | 0.964 | $100.00$ | $93.32$ | $0.00\times {10}^{0}$ | $1.790$ | $100.00$ | 92.58 |

Nadam | $7.95\times {10}^{-11}$ | 1.662 | 100.00 | 93.00 | $3.61\times {10}^{-6}$ | 0.755 | $100.00$ | $93.01$ | $2.46\times {10}^{-7}$ | $1.770$ | $100.00$ | $91.62$ |

(a) CIFAR-10 | |||||
---|---|---|---|---|---|

Model | Loss | Accuracy (%) | |||

Train | Test | Train | Test | ||

1. | NAG | $5.64\times {10}^{-7}$ | 1.516 | 100.00 | 83.68 |

+ BatchNorm | $2.34\times {10}^{-5}$ | 0.728 | 100.00 | 86.45 | |

2. | Adam | $2.00\times {10}^{-10}$ | $3.893$ | $100.00$ | 79.84 |

+ BatchNorm | $8.96\times {10}^{-5}$ | 2.203 | 100.00 | 82.89 | |

3. | NAG | $5.87\times {10}^{-7}$ | 2.007 | $100.00$ | 80.40 |

+ BatchNorm | $2.25\times {10}^{-6}$ | 1.633 | $100.00$ | 81.21 | |

(b) Fashion-MNIST | |||||

Model | Loss | Accuracy (%) | |||

Train | Test | Train | Test | ||

1. | Adam | $5.84\times {10}^{-4}$ | 0.920 | 100.00 | 93.27 |

+ BatchNorm | $1.70\times {10}^{-3}$ | 0.405 | 99.96 | 93.25 | |

2. | Adam | $1.27\times {10}^{-10}$ | $1.029$ | $100.00$ | 93.53 |

+ BatchNorm | $6.10\times {10}^{-5}$ | 0.455 | 100.00 | 93.61 | |

3. | AdaMax | $0.00\times {10}^{0}$ | $1.790$ | $100.00$ | 92.58 |

+ BatchNorm | $1.20\times {10}^{-3}$ | 0.724 | $99.96$ | 91.85 |

(a) CIFAR-10 | |||||
---|---|---|---|---|---|

Model | Loss | Accuracy (%) | |||

Train | Test | Train | Test | ||

1. | NAG | $5.64\times {10}^{-7}$ | 1.516 | 100.00 | 83.68 |

${L}_{2}$, $\lambda =5\times {10}^{-6}$ | $1.26\times {10}^{-2}$ | 13.615 | 100.00 | 83.33 | |

${L}_{1}$, $\lambda ={10}^{-6}$ | $2.99\times {10}^{-2}$ | 1.477 | 100.00 | 83.83 | |

2. | Adam | $2.00\times {10}^{-10}$ | $3.893$ | $100.00$ | 79.84 |

${L}_{2}$, $\lambda ={10}^{-5}$ | $2.90\times {10}^{-2}$ | 1.483 | 100.00 | 79.94 | |

${L}_{1}$, $\lambda ={10}^{-6}$ | $9.27\times {10}^{-2}$ | 1.339 | 100.00 | 76.40 | |

3. | NAG | $5.87\times {10}^{-7}$ | 2.007 | $100.00$ | 80.40 |

${L}_{2}$, $\lambda ={10}^{-5}$ | $7.76\times {10}^{-2}$ | 1.569 | 100.00 | 80.70 | |

${L}_{1}$, $\lambda ={10}^{-6}$ | $2.15\times {10}^{-1}$ | 1.650 | 100.00 | 79.99 | |

(b) Fashion-MNIST | |||||

Model | Loss | Accuracy (%) | |||

Train | Test | Train | Test | ||

1. | Adam | $5.84\times {10}^{-4}$ | 0.920 | 100.00 | 93.27 |

${L}_{2}$, $\lambda =5\times {10}^{-6}$ | $2.90\times {10}^{-2}$ | 0.593 | 100.00 | 92.96 | |

${L}_{1}$, $\lambda =5\times {10}^{-6}$ | $5.37\times {10}^{-2}$ | 0.515 | 99.66 | 92.05 | |

2. | Adam | $1.27\times {10}^{-10}$ | $1.029$ | $100.00$ | 93.53 |

${L}_{2}$, $\lambda =5\times {10}^{-5}$ | $3.64\times {10}^{-2}$ | 0.462 | 99.69 | 92.11 | |

${L}_{1}$, $\lambda ={10}^{-6}$ | $1.51\times {10}^{-2}$ | 0.670 | 100.00 | 93.20 | |

3. | AdaMax | $0.00\times {10}^{0}$ | $1.790$ | $100.00$ | 92.58 |

${L}_{2}$, $\lambda ={10}^{-6}$ | $2.50\times {10}^{-2}$ | 0.843 | 100.00 | 92.39 | |

${L}_{1}$, $\lambda ={10}^{-6}$ | $1.61\times {10}^{-2}$ | 0.581 | 99.98 | 91.99 |

${\mathit{L}}_{2}$ | ${\mathit{L}}_{1}$ | NAG | |
---|---|---|---|

min | $4.8\times {10}^{-8}$ | 0 | $5.57\times {10}^{-8}$ |

$1st$ quartile | $7.49\times {10}^{-5}$ | $7.02\times {10}^{-10}$ | $1.34\times {10}^{-4}$ |

median | $1.47\times {10}^{-4}$ | $2.12\times {10}^{-9}$ | $2.73\times {10}^{-4}$ |

$3rd$ quartile | $2.22\times {10}^{-4}$ | $3.54\times {10}^{-9}$ | $2.74\times {10}^{-4}$ |

max | 3.5251 | 6.6516 | 2.3028 |

mean | 0.0216 | 0.0076 | 0.0349 |

std | 0.0275 | 0.0426 | 0.0321 |

(a) CIFAR-10 | |||||
---|---|---|---|---|---|

Model | Loss | Accuracy (%) | |||

Train | Test | Train | Test | ||

1. | NAG | $5.64\times {10}^{-7}$ | 1.516 | 100.00 | 83.68 |

Noise, $\sigma =0.05$ | $1.01\times {10}^{-2}$ | 19.875 | 99.74 | 80.70 | |

2. | Adam | $2.00\times {10}^{-10}$ | 3.893 | $100.00$ | 79.84 |

Noise, $\sigma =0.01$ | $9.30\times {10}^{-3}$ | 4.305 | 99.77 | 78.57 | |

3. | NAG | $5.87\times {10}^{-7}$ | 2.007 | $100.00$ | 80.40 |

Noise, $\sigma =0.01$ | $6.17\times {10}^{-3}$ | 2.111 | $100.00$ | 80.42 | |

(b) Fashion-MNIST | |||||

Model | Loss | Accuracy (%) | |||

Train | Test | Train | Test | ||

1. | Adam | $5.84\times {10}^{-4}$ | 0.920 | 100.00 | 93.27 |

Noise, $\sigma =0.01$ | $1.92\times {10}^{-1}$ | 0.649 | 100.00 | 92.92 | |

2. | Adam | $1.27\times {10}^{-10}$ | 1.029 | $100.00$ | 93.53 |

Noise, $\sigma =0.01$ | $5.40\times {10}^{-3}$ | 11.151 | $99.86$ | 92.63 | |

3. | AdaMax | $0.00\times {10}^{0}$ | $1.790$ | $100.00$ | 92.58 |

Noise, $\sigma =0.05$ | $2.90\times {10}^{-3}$ | 0.910 | $99.95$ | 90.95 |

(a) CIFAR-10 | ||||
---|---|---|---|---|

Model | Loss | Accuracy (%) | ||

Train | Test | Train | Test | |

1. NAG | $5.64\times {10}^{-7}$ | 1.516 | 100.00 | 83.68 |

Dropout | $8.10\times {10}^{-2}$ | 0.754 | 97.31 | 86.54 |

Dropout + inputs | $1.11\times {10}^{-1}$ | 0.700 | 96.51 | 85.38 |

Dropout + BatchNorm | $2.00\times {10}^{-2}$ | 0.748 | 99.34 | 88.25 |

2. Adam | $2.00\times {10}^{-10}$ | $3.893$ | $100.00$ | 79.84 |

Dropout | $3.49\times {10}^{-2}$ | 0.728 | 98.89 | 83.64 |

Dropout + inputs | $3.90\times {10}^{-2}$ | 1.128 | 98.72 | 80.77 |

Dropout + BatchNorm | $9.9\times {10}^{-3}$ | 0.904 | 99.66 | 86.33 |

3. NAG | $5.87\times {10}^{-7}$ | 2.007 | $100.00$ | 80.40 |

Dropout | $1.08\times {10}^{-2}$ | 1.099 | 99.63 | 80.50 |

Dropout + inputs | $1.22\times {10}^{-2}$ | 1.056 | 99.63 | 81.26 |

Dropout + BatchNorm | $1.22\times {10}^{-2}$ | 522.8 | 99.59 | 67.22 |

(b) Fashion-MNIST | ||||

Model | Loss | Accuracy (%) | ||

Train | Test | Train | Test | |

1. Adam | $5.84\times {10}^{-4}$ | 0.920 | 100.00 | 93.27 |

Dropout | $3.41\times {10}^{-2}$ | 0.471 | 98.79 | 93.59 |

Dropout+ inputs | $6.76\times {10}^{-2}$ | 0.346 | 97.55 | 93.32 |

Dropout + BatchNorm | $1.13\times {10}^{-2}$ | 0.412 | 99.66 | 93.84 |

2. Adam | $1.27\times {10}^{-10}$ | 1.029 | $100.00$ | 93.53 |

Dropout | $1.46\times {10}^{-2}$ | 0.485 | 99.55 | 93.88 |

Dropout + inputs | $2.11\times {10}^{-2}$ | 0.465 | 99.27 | 93.21 |

Dropout + BatchNorm | $6.60\times {10}^{-3}$ | 0.434 | 99.77 | 94.25 |

3. AdaMax | $0.00\times {10}^{0}$ | $1.790$ | $100.00$ | l92.58 |

Dropout | $1.18\times {10}^{-2}$ | 0.576 | 99.60 | 92.01 |

Dropout + inputs | $1.18\times {10}^{-2}$ | 0.638 | 99.62 | 92.04 |

Dropout + BatchNorm | $8.10\times {10}^{-3}$ | 0.612 | 99.75 | 92.12 |

(a) CIFAR-10 | ||||
---|---|---|---|---|

Model | Loss | Accuracy (%) | ||

Train | Test | Train | Test | |

1. NAG | $5.64\times {10}^{-7}$ | 1.516 | 100.00 | 83.68 |

DataAugm | $4.51\times {10}^{-2}$ | 0.887 | 98.48 | 87.73 |

DataAugm + BatchNorm | $1.57\times {10}^{-2}$ | 0.748 | 99.49 | 89.54 |

DataAugm + Dropout | $1.61\times {10}^{-1}$ | 0.640 | 94.74 | 89.27 |

DataAugm + BatchNorm + Dropout | $6.38\times {10}^{-2}$ | 0.655 | 97.85 | 89.15 |

2. Adam | $2.00\times {10}^{-10}$ | $3.893$ | $100.00$ | 79.84 |

DataAugm | $8.06\times {10}^{-2}$ | 0.728 | 97.33 | 86.99 |

DataAugm + BatchNorm | $3.80\times {10}^{-2}$ | 0.673 | 98.74 | 88.54 |

DataAugm + Dropout | $1.54\times {10}^{-1}$ | 0.649 | 94.74 | 86.10 |

DataAugm + BatchNorm + Dropout | $9.06\times {10}^{-2}$ | 0.519 | 96.78 | 88.67 |

3. NAG | $5.87\times {10}^{-7}$ | 2.007 | $100.00$ | 80.40 |

DataAugm | $1.88\times {10}^{-2}$ | 0.830 | 99.30 | 86.98 |

DataAugm + BatchNorm | $1.66\times {10}^{-2}$ | 0.966 | 99.41 | 85.00 |

DataAugm + Dropout | $5.72\times {10}^{-2}$ | 0.713 | 98.04 | 86.19 |

DataAugm + BatchNorm + Dropout | $2.48\times {10}^{-1}$ | 12723 | 94.88 | 80.58 |

(b) Fashion-MNIST | ||||

Model | Loss | Accuracy (%) | ||

Train | Test | Train | Test | |

1. Adam | $5.84\times {10}^{-4}$ | 0.920 | 100.00 | 93.27 |

DataAugm | $1.18\times {10}^{-2}$ | 0.634 | 99.59 | 92.58 |

DataAugm + BatchNorm | $5.30\times {10}^{-3}$ | 0.431 | 99.81 | 93.49 |

DataAugm + Dropout | $8.93\times {10}^{-3}$ | 0.267 | 96.82 | 93.92 |

DataAugm + BatchNorm + Dropout | $3.79\times {10}^{-2}$ | 0.283 | 98.62 | 94.53 |

2. Adam | $1.27\times {10}^{-10}$ | 1.029 | $100.00$ | 93.53 |

DataAugm | $7.70\times {10}^{-3}$ | 0.726 | 99.77 | 93.88 |

DataAugm + BatchNorm | $4.00\times {10}^{-3}$ | 0.443 | 99.87 | 94.17 |

DataAugm + Dropout | $4.12\times {10}^{-2}$ | 0.312 | 98.59 | 94.32 |

DataAugm + BatchNorm + Dropout | $2.09\times {10}^{-2}$ | 0.291 | 99.26 | 94.60 |

3. AdaMax | $0.00\times {10}^{0}$ | $1.790$ | $100.00$ | 92.58 |

DataAugm | $4.40\times {10}^{-3}$ | 0.758 | 99.88 | 92.78 |

DataAugm + BatchNorm | $2.80\times {10}^{-3}$ | 0.508 | 99.91 | 92.77 |

DataAugm + Dropout | $5.71\times {10}^{-2}$ | 0.372 | 97.96 | 92.54 |

DataAugm + BatchNorm + Dropout | $3.55\times {10}^{-2}$ | 0.409 | 98.72 | 92.20 |

**Table 13.**Bagging results applied on the best models on: (a) CIFAR-10 dataset (NAG model that incorporates Data Augmentation and Batch Normalization); and (b) Fashion-MNIST dataset (Adam model that uses Data Augmentation, Batch Normalization and Dropout regularization methods.).

**Table 14.**Ensemble of models with different architectures and incorporated regularization techniques.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Marin, I.; Kuzmanic Skelin, A.; Grujic, T.
Empirical Evaluation of the Effect of Optimization and Regularization Techniques on the Generalization Performance of Deep Convolutional Neural Network. *Appl. Sci.* **2020**, *10*, 7817.
https://doi.org/10.3390/app10217817

**AMA Style**

Marin I, Kuzmanic Skelin A, Grujic T.
Empirical Evaluation of the Effect of Optimization and Regularization Techniques on the Generalization Performance of Deep Convolutional Neural Network. *Applied Sciences*. 2020; 10(21):7817.
https://doi.org/10.3390/app10217817

**Chicago/Turabian Style**

Marin, Ivana, Ana Kuzmanic Skelin, and Tamara Grujic.
2020. "Empirical Evaluation of the Effect of Optimization and Regularization Techniques on the Generalization Performance of Deep Convolutional Neural Network" *Applied Sciences* 10, no. 21: 7817.
https://doi.org/10.3390/app10217817