Empirical Evaluation of the Effect of Optimization and Regularization Techniques on the Generalization Performance of Deep Convolutional Neural Network

: The main goal of any classiﬁcation or regression task is to obtain a model that will generalize well on new, previously unseen data. Due to the recent rise of deep learning and many state-of-the-art results obtained with deep models, deep learning architectures have become one of the most used model architectures nowadays. To generalize well, a deep model needs to learn the training data well without overﬁtting. The latter implies a correlation of deep model optimization and regularization with generalization performance. In this work, we explore the effect of the used optimization algorithm and regularization techniques on the ﬁnal generalization performance of the model with convolutional neural network (CNN) architecture widely used in the ﬁeld of computer vision. We give a detailed overview of optimization and regularization techniques with a comparative analysis of their performance with three CNNs on the CIFAR-10 and Fashion-MNIST image datasets.


Introduction
The state-of-the-art results in different fields, such as computer vision [1,2], speech recognition [3] and natural language processing [4,5], are obtained using deep neural networks. Deep neural networks have high representative capacity; trained on a large dataset, they can automatically learn complex relations between raw input data and given output. The high representative capacity of deep models comes with the cost of overfitting: fitting available training data too well, i.e., memorizing training data along with the noise contained in them and failing to generalize on new, unseen data. Zhang et al. [6] showed that deep neural networks can easily fit data with random labels (achieving 100% accuracy on the training set). In this case, there is no apparent connection between input data and target labels that model needs to learn, and yet it succeeds in fitting the given training data perfectly.
The main goal is to obtain a model that generalizes well. The generalization error of a deep learning model refers to the expected prediction error on new data. Because generalization error is not directly accessible, in practice, it is estimated using a misclassification rate on an independent test set not used during the training. To obtain a low generalization error, the model needs to learn available data without overfitting. Optimization and regularization are two significant parts of deep learning research that play an essential role in the final performance of a deep model. Optimization considers different methods and algorithms used for model training, i.e., learning the underlying mapping from inputs to outputs by choosing the right set of parameters that will reduce the error on the training data. Regularization, on the other hand, is focused on preventing the overfitting to the training data by adding penalties or constraints on a model that incorporates some prior knowledge of underlying mapping or preference toward a specific class of models. The term regularization has been used quite freely to denote any technique that aims to enhance model performance on the test data. This work aims to provide the reader with a deeper understanding of commonly used optimization algorithms and regularization techniques by giving necessary theoretical background and systematic overview for both, together with the empirical evaluations and analysis of their effect on the training process and the final generalization performance of the model.
The performance of convolutional neural networks with respect to optimization algorithms and regularization techniques has been investigated in a number of works. Many variations of the reported results are related to different optimizers and regularizing approaches taken under consideration, or their combinations, different model architectures and datasets. The studies in [7][8][9][10][11] are examples of works where existing optimization algorithms are reviewed, compared and evaluated from a different perspectives. The reported results show that optimization effect differs not only with the selection of optimization algorithm but with a problem under consideration. In parallel, the works in [12][13][14][15][16][17] are some of the representative literature on regularization techniques ranging from studies on influence in deep learning models to taxonomy definition and review. Smirnov et al. [12] compared three regularization techniques, Dropout, Generalized Dropout and Data Augmentation, and demonstrated improvements on ImageNet classification task. Another work that deals with comparison of regularization techniques in deep neural networks is the work of Nusrat and Jang [13], who reported that models using regularization techniques such as Data Augmentation and Batch Normalization exhibit improved performance against the baseline on the weather prediction task. In our work, optimization and regularization are considered as a complementary techniques, which are worth of deeper investigation whenever a network is developed for a particular case. A work with a similar idea to ours is the empirical study of Garbin et al. [14] who investigated the behavior of Dropout and Batch Normalization with respect to two optimizers, the SGD and RMSProp on single network, reporting favorable results with Batch Normalization but not with the Dropout on CNN. The difference in our work is that our empirical evaluation studies a broad set of methods: we empirically evaluate the effect of nine optimizers, Batch Normalization and six regularization techniques with three CNNs on two image datasets, CIFAR-10 [18] and Fashion-MNIST [19].
The rest of the paper is structured in three sections as follows. Section 2 gives a theoretical background and systematic overview of different optimization algorithms used for training deep neural networks together with the Batch Normalization [20] technique. In Section 3, an overview of different regularization methods is given. Section 4 provides a comparative analysis of different optimization algorithms and regularization methods on the image classification problem supplemented with appropriate visualizations that give a deeper insight into the effect of each method (or their combination) on the training process and generalization performance. In Section 5, concluding remarks are given.

Optimization
Neural network training is an optimization problem with non-convex objective function J: the minimization problem min θ J(θ; D train ). During the training process, model parameters θ are iteratively updated in order to reduce the cost J on the training data D train . In subsequent sections, we use bold symbols such as θ for vector quantities and regular ones for scalars. The most commonly used stopping criterion for the iterative training process is the predefined number of passes through all available training data, i.e., epochs. One epoch often consists of multiple iterations.
There are various optimization algorithms used for training neural networks, which differ in the way they update network parameters. We describe the most commonly used optimization algorithms in the subsections below. Up to date, there is no clear answer nor consensus on which optimization algorithm is universally the best. Two metrics often used to evaluate efficiency of an optimization algorithm are: • Speed of convergence: The time needed for algorithm to reach the optimal value • Generalization: The model performance on new data The optimization of deep neural networks comes with many challenges. One of them is a highly non-convex objective function J with numerous suboptimal local minima and saddle points. Other challenges include high-dimensionality of search space (deep models often have to learn millions of parameters) and choice of appropriate values for hyperparameters of the model. We overview classical and adaptive optimization algorithms commonly used to optimize neural networks' cost in the following subsections. Summary of update rules of overviewed optimization algorithms can be found in the Appendix A.

Classical Iterative Optimization Algorithms
The main idea behind all optimization algorithms is to update parameters in the direction of a negative gradient −∇ θ J(θ; D train ), direction of the steepest descent. In each iteration, parameters are updated by where η > 0 is a hyperparameter called learning rate, which controls the amount of update.
In the sections that follow, we denote parameters in iteration t ∈ N with θ t , while θ 0 is used to denote initial parameters of the model, which are usually small random numbers from a normal or uniform distribution with 0 expectation.

Stochastic Gradient Descent (SGD)
In iteration t, an approximation of gradient ∇ θ J(θ t−1 ; D train ) is calculated using a mini-batch D ⊆ D train of training data and then used to modify parameters from previous time step t − 1 according to the update rule [21] θ t = θ t−1 − η∇ θ J(θ t−1 ; D).
(2) (In the literature, the term stochastic gradient descent is often used for a variant of gradient descent in which one training example is used for approximation of the gradient. When the approximation of gradient is calculated on a mini-batch of training examples, then the term mini-batch gradient descent is used. Here, term SGD refers to mini-batch gradient descent as it is the case in the most deep learning frameworks.) In the rest of the article, ∇J(θ t−1 ) denotes approximation ∇ θ J(θ t−1 ; D).
The choice of learning rate η plays a crucial role in the convergence of SGD. Choosing too small learning rate results in slow learning and choosing too high learning rate can lead to divergence. When SGD gets very close to a local optimum, the parameter values sometimes oscillate back and forth around the optima. It also takes a lot of time for SGD to navigate flat regions, which are common around local optima where the gradient is close to zero. These problems led to developing new optimization algorithms that incorporate the momentum term.

Momentum
Adding a momentum term m in classical stochastic gradient descent helps to accelerate learning in relevant directions and reduce oscillations during training by slowing down along dimensions where the gradient is inconsistent, i.e., in dimensions where the sign of gradient often changes. The momentum [22] update rule is given by where γ ∈ [0, 1 is the decay constant. By setting γ to 0, we get classical SGD without momentum. In iteration t, parameter update is equal to From (6), it can be seen that update in iteration t takes into account all gradients calculated so far with more weight put on the recent ones. As t increases, we have lesser and lesser trust in the gradients calculated in iterations at the beginning of the training.
The ith component of vector m, which corresponds to update made to parameter i of the given network, accumulates speed when partial derivatives ∂ i J point in the same direction and slows down when they point in different directions. This property helps momentum to more quickly escape flat regions where the gradient is close to zero but often points in the same direction. Accumulated speed sometimes leads to overshooting the local minimum, which results in many oscillations back and forth around the minimum before convergence.

Nesterov Accelerated Gradient Descent (NAG)
Momentum's update m t can be interpreted as a two-step movement. First, we move according to decayed update history γm t−1 , and then we make a step in the direction of the current gradient calculated using parameters θ t−1 from iteration t − 1. If we know that we will move in the direction of history γm t−1 , then we can first make the movement and then calculate gradient from the point θ t−1 + γm t−1 in which we arrive instead of calculating gradient in the point from the previous iteration θ t−1 . The formal update rule for Nesterov accelerated gradient [23] (NAG) is given with θ t = θ t−1 + m t (apply update).
When overshooting local minimum due to accumulated speed happens, looking ahead in (7) helps NAG correct its course more quickly than in the case with regular momentum.

Adaptive Learning Rate Optimizers
While previously presented optimization algorithms use the same learning rate to modify all parameters of the model, some new optimization algorithms developed from the 2010s seek to upgrade this original behavior of SGD by allowing the algorithm to adaptively change the learning rate per parameter during the training process. A brief overview of the most commonly used optimizers considered adaptive is given below.

Adagrad
Adagrad optimization algorithm was first introduced by Duchi et al. [24]. It implements parameter-specific learning rates: corresponding learning rates of parameters that are updated more frequently are smaller and larger for parameters that are updated infrequently. The update rule for Adagrad is given by where • denotes Hadamard (element-wise) product and (∇J(θ t−1 )) 2 denotes the element-wise square of the given gradient. Division and square root in of the latter vector corresponds to the learning rate that is used to update parameter i in iteration t. The main weakness of Adagrad optimizer is the constant growth of accumulator v during the whole training process, in each iteration on ith coordinate the corresponding squared (and therefore non-negative) partial derivative of the cost function J is added, which eventually results with infinitely small learning rates approximately ≈0 that basically stops the training process.

Adadelta
The Adadelta [25] optimization algorithm tries to correct the diminishing learning rate problem in the Adagrad algorithm by accumulating the squared gradients over the fixed-size window instead of using gradients from all previous iterations. Instead of inefficient storing of all previous squared gradients from the current window, Adadelta implements accumulator as exponentially decaying average of squared gradients.
Constant in Equation (12) is added in the denominator to condition it better and in the numerator to ensure that first update ∆θ 1 = 0 and also to ensure progress when update accumulator m becomes small. It should be noted that the Adadelta optimization algorithm does not use the learning rate η. Instead, the size of an update made to parameter i in iteration t is controlled by the ith component of the vector that can be viewed as the quotient of RMS of updates ∆J(θ) and gradients up to time t, i.e., the update rule (14) can be rewritten as

RMSProp
The RMSProp algorithm [26] shown on Hinton's slides from the Coursera class was developed independently from Adadelta around the same time. RMSProp tries to solve the indefinite accumulation of squared gradients from Adagrad by replacing accumulator v with exponentially weighted moving average, which allows replacing older squared gradients with newer ones according to the update rule given below Hinton suggests that 0.9 is a good default value for the β and 0.001 for the η. A version of RMSProp with added momentum has been used in [27]. With added momentum update rule for RMSProp becomes

Adam
The Adam (Adaptive Moment Estimation) optimizer introduced in [28] can be viewed as a "tweaked" RMSProp optimizer with added momentum. There are two main differences between RMSProp with momentum and Adam:

•
Estimates of the first moment and second raw moment, i.e., accumulation variables m and v, respectively, used for parameter update in Adam are calculated using exponential moving average.

•
Adam includes initialization bias-correction terms for the first and second moment estimates, which are due to their initialization to the vector of zeros in initial iterations biased towards 0.
Adam update rule is given below.
m 0 = 0, v 0 = 0 (initialize 1st and 2nd moment estimates) Adam and classical momentum are the two most used optimizers used in many papers that reported state-of-the-art results in different fields.

Batch Normalization
Although it is not an optimization algorithm, the Batch Normalization [20] method is one of the most significant innovations in deep model optimization in recent years. It stabilizes the learning process by reducing changes in hidden layers input data distribution caused by constant changes made to parameters from previous layers. The idea is to add new normalizing layers that will transform data during the training in order to avoid unwanted changes in distribution.
Input to the given unit is normalized for each mini-batch D. Let z i be the input to the given unit that corresponds to the ith example of mini-batch D of size m. Normalization of z i is done as follows where are mean and variance estimates for mini-batch D and ε positive constant added for numerical stability. An additional linear transformatioñ is applied to keep the expressive power of the hidden units. New parameters γ and β, which are also learned during training (initialized with γ = 1, β = 0), enable normalized data to have any mean and variance. During the test phase, an exponential moving average of mean and variance values calculated during training is used.

Regularization
To prevent overfitting of the model to the training data, different regularization techniques are used. In [31], regularization is defined as "any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error". There is a wide range of methods that are considered as regularization methods. Some of the most commonly used ones are L 2 weight decay, Dropout, Data Augmentation and Early Stopping.
3.1. L 2 Regularization L 2 regularization, also known as weight decay, is a regularization technique that adds parameter norm penalty to the cost function J. New, regularized cost functionJ used for training is given bỹ where λ ∈ [0, ∞ is regularization parameter that controls the strength of regularization and D mini-batch of training data. During the training process, the minimization ofJ results in a decreased original cost J and · 2 of the model parameters. The step in each iteration is now made based on Gradient descent update in iteration t after substituting (42) becomes decay parameters by constant factor proportionally to their size Penalizing parameters proportionally to their size results in a model with smaller, more dispersed parameters. In this way, the model is encouraged to use all input values a little bit instead of focusing only on the some with large corresponding weights. The other intuition behind L 2 regularization is that the penalty imposes prior to the complexity of the model. By penalizing large parameters, we obtain a less complex model that will reduce overfitting due to its inability to memorize all training data.

L 1 Regularization
Another less common type of weight penalty is · 1 penalty used in L 1 regularization, which results in a model with sparse parameters. Incorporating the norm penalty term into cost function J gives regularized cost functioñ where sign function is applied element-wise on parameter vector θ. Update in iteration t is therefore given by The decay is constant here; if ith parameter θ i is greater than 0, then we subtract ηλ > 0 from it; if it is negative, then we add ηλ pushing it in that way towards 0. Due to the resulting sparsity, L 1 regularization is often considered as a feature selection method; we ignore features with 0 coefficients.

Noise Injection
To improve the robustness of the network to variations of inputs during the training process, random noise such as Gaussian noise can be added to the inputs. In this way, the network is no more able to memorize training data because they are continually changing. Noise added to the input data can be viewed as a form of data augmentation.
Noise is usually added to the inputs, but it can also be added to the weights, gradients, hidden layer activations (to improve robustness of optimization process) and labels (to assure robustness of the network to incorrectly labeled data). The Dropout regularization technique is one way of adding the Bernoulli noise into the input and hidden units.

Dropout
To prevent complex co-adaptations of model units to the training data that can lead to overfitting, the Dropout [32] regularization technique during the training in each iteration drops units randomly with some probability p, which is given in advance. Probability p of dropping units can be defined layer-wise with different probability for different layers. During the test phase, all units are kept with corresponding weights multiplied by the probability p keep of keeping the given unit during the training phase. The Dropout during training and testing is illustrated in Figure 1. Dropout can also be interpreted as: • adding Bernoulli noise into the input and hidden units, where noise can be seen as the loss of the information carried by the dropped unit; and • averaging output of approximately 2 n subnetworks that share weights obtained from the original network with n units by randomly removing non-output units.

Data Augmentation
Overfitting can be addressed by adding new data into the training set. Acquiring new useful training data and required labeling is a "painful" task often unfeasible in practice. Data Augmentation is a regularization technique used to artificially enlarge training set by generating new training data by applying different transformations to the existing data. When working with labeled data, one must be careful not to apply a transformation that can change the correct label. New data can be generated before (preferred when a smaller dataset is used) or during the training process. Examples of transformations that can be applied to image data are resizing, scaling, random cropping, rotation and illumination.

Ensemble Learning
Ensemble methods combine predictions from several models to reduce generalization error. Prediction of the ensemble is obtained by averaging predictions from ensemble members (weighted or unweighted average) or using majority vote for classification tasks. The averaging "works" because different models often make different mistakes.
Because neural networks incorporate a significant amount of randomness (parameter initialization, mini-batch choice, etc.), one neural network model trained multiple times using the same training data can be used to construct an ensemble. The most significant improvements in generalization ability are obtained when ensemble members are either trained on different data or have different architecture. With deep neural networks, the two mentioned approaches are challenging to implement for several reasons:

•
Training multiple neural networks is computationally expensive.

•
Constructing an ensemble of neural networks with different architectures requires fine-tuning hyperparameters for each of them.

•
Training one deep neural network requires large amount of data; training k networks on entirely different datasets requires k times more training data.

Bagging
Bootstrap aggregating (bagging) [33] is an ensemble method focused on reducing the variance of an estimate. With bagging, the ensemble of neural networks with the same architecture and hyperparameter settings can be constructed.
If we construct an ensemble with k members, then from the available dataset D train by sampling with replacement k new training datasets D train . The bagging scheme is illustrated in Figure 2. Ensemble member differences are induced by differences caused by the random selection of data during sampling.

Early Stopping
Training of deep models is challenging. One of the challenges is deciding how long to train the model. If the model is not trained long enough, it will not be able to learn the underlying mapping from inputs to outputs; it will underfit. On the other hand, if it is trained too long, there will be a point during training when model stops to learn generalizable features and starts to learn statistical noise in the training data, i.e., starts to overfit. Early Stopping is a regularization method that terminates the training algorithm before overfitting occurs. During the training, generalization error is empirically estimated using a validation set. The training algorithm stops when the increase of validation error is observed and parameters with the lowest validation error are returned rather than the latest ones. However, the real validation error curve is not "smooth"; it can still go further down after it has begun to increase. Because of that, it is not ideal to stop the training immediately after the increase in validation error is observed. The stopping is often delayed for some predefined number of epochs called patience. Some stopping criteria used in practice are: Stopping criteria involve the trade-off between training time and final generalization performance. The results of experiments made by Prechelt [34] show that criteria which stop training later on average lead to improved generalization compared to criteria that stop training earlier. However, the difference between training times used for "slower" and "faster" criteria that lead to improvements in generalization is rather large on average and significantly varied when criteria that are slower are used.
The Early Stopping method can be used to find the optimal number of epochs to train the model. After the hyperparameter number of epochs has been tuned in that way, the model can be retrained using all data (including validation set) for the obtained number of epochs.

Baseline Model Architectures and Dataset Description
The goal of this experimental study was to quantify the effect of used optimizer and regularization technique on the training process and final generalization performance of the given model. Experiments were performed using three baseline convolutional neural network (CNN) model architectures and two datasets. For implementation, TensorFlow framework, precisely tf.keras, was used.

Baseline Model Architectures
For Model 1, we used CNN-C architecture from [35]. The Model 2 architecture was inspired by VGG-16 [16], consisting of stacked convolutional layers followed by Pooling layer and Dense layers incorporated before the output layer. Model 3, the largest model that we used (in terms of the number of learnable parameters), has an AlexNet-like architecture [36], consisting of stacked convolutional layers followed by Pooling layer, with 3 × 3 receptive fields and without the last pooling layer. Detailed descriptions of the architectures are given in Table 1. The same seed was used for parameter initialization across all models.

Datasets
As training data, we used: (i) standard benchmark CIFAR-10 [18] dataset, which consists of 60,000 32 × 32 colored images divided into ten categories; and (ii) Fashion-MNIST [19] dataset, comprising 70,000 28 × 28 grayscale images of fashion products (clothes and shoes) from 10 different categories. The original training data were split into two parts: training data and validation data; 20% of original training data were used for validation and the rest for training. All models were trained with mini-batches of size 128. Models which use CIFAR-10 dataset were trained for 350 epochs, while the ones that use Fashion-MNIST were trained for 250 epochs. To obtain an unbiased estimate of the generalization error, validation data were used for tuning of hyperparameters and analysis of the learning process, while test data were used only for final evaluation.

Results
In this subsection, we give a comparative analysis of different optimization and regularization techniques based on the empirical evaluations of the generalization performance and visualizations of the learning curves of models, i.e., the behavior of the loss. Iin the experimental part, we use the term loss instead of the term cost to denote the value of the function J that is minimized during the training (as it is the case in the most deep learning frameworks) and accuracy during the training on data that were used for learning and on new data.

Evaluation of Optimization Algorithms
The following observations about the influence of used optimization algorithm on the behavior and final generalization performance of the CNN model are based on their empirical evaluations on three different model architectures, each trained on two datasets with the nine distinct optimizers reviewed in Section 2. Hyperparameters of optimizers used for training each model are given in Appendix B. Figures 3 and 4 show loss and accuracy learning curves of given models, and the final results on the test and training set are reported in Tables 2 and 3.     The following observations are made: The best test set results, in terms of accuracy, are obtained using the classical Nesterov optimization algorithm and adaptive optimization algorithm Adam and its variant AdaMax. (ii) Compared to Nesterov optimization algorithm, Adam and AdaMax show less stable performance (with many "jumps") on validation data. The most stable, not necessarily the best, performance on validation data, especially loss, among adaptive optimizers show Adagrad and Adadelta optimization algorithms.
(iii) RMSProp optimization algorithm, in all six cases, has considerably larger validation loss than other optimizers that consistently keeps growing. Interestingly, despite great discrepancy between RMSProp and others optimizers losses, its validation, and finally test set accuracy remains reasonable well and comparable with others. (iv) In terms of test set accuracy, the ranking of classical optimizers stays consistent across all six models; Nesterov ranked as best, followed by Momentum, and SGD at last. Ranking of adaptive optimizers places Adagrad optimizer on last place, closely followed by RMSProp optimizer.   In the rest of the article, we examine how incorporating different regularization methods and Batch Normalization technique affect generalization performance of a given model. For further investigation, we used one optimizer and model architecture per each dataset. Namely, on CIFAR-10 data, we used Nesterov optimizer with first model architecture, Adam with the second and again Nesterov with the third one and refer to them as baseline model architectures. Analogously, for further research on Fashion-MNIST data, as basline model architectures we used Model 1 and Model 2 with Adam optimizer and Model 3 architecture with AdaMax optimization algorithm.

Batch Normalization
Incorporating Batch Normalization into baseline model architectures, as can be seen in Table 4, showed beneficial effects on their final generalization performance. In all cases, the test set loss is significantly reduced, while accuracy on test data increased in four out of six baseline models. In Figures 5 and 6, we can see how validation loss learning curve in all cases significantly drops below the original one. In models that use Adam optimization algorithm for training (Model 2 on CIFAR-10 and Models 1 and 2 on Fashion-MNIST), we can see jumps in the values of both training and validation loss ("spikes" on the learning curves). Even with those kinds of instabilities, the validation loss is still improved over the baseline's original one. On the training loss curve for the first model architecture, we can se how Batch Normalization can accelerate convergence. From given learning curves, we also notice that overfitting reduced in all cases. Because of that, Batch Normalization is sometimes referred to as an optimization technique with regularizing effect.

Evaluation of Different Regularization Techniques
In this section, we add different types of regularization into the chosen baseline model architectures to examine their effect on the model's generalization performance.

Weight Decay
Adding L 2 and L 1 regularization into the baseline models did not, in general, result in the improvement of generalization performance. As we can see in Table 5, on the Fashion-MNIST dataset, neither L 2 nor L 1 regularization leads to an increase in the test set accuracy. However, applying L 1 regularization on the CIFAR-10 leads to increased accuracy and decreased loss on the test data in baseline Model 1, while adding L 2 regularization has beneficial effect on the performance of baseline Models 2 and 3. As parameter λ, for both L 2 and L 1 regularization, the best performing value on the validation set from a predefined set of λ values {10 −2 , 10 −3 , 10 −4 , 10 −5 , 10 −6 } was chosen. If two neighbor values were close in performance on validation data, we additionally investigated the performance of their midpoint on validation data as potential value for the λ parameter. Figures 7 and 8 show that both L 2 and L 1 regularization reduce all six models' validation loss during the training. In the third model in Figure 7, penalizing models' weights notably slows down the convergence; the model needs more than 300 epochs to reach the loss that the baseline model reached before epoch 50. However, after epoch 200, the penalized model's validation loss falls below the baseline's validation loss despite the slower learning process.  From the results in Figure 9 and Table 6, we can gain insight into the effect that added weight penalties have on final model weights. Obtained results justify the name weight decay; in both cases, resulting weights of regularized models are closer to 0 than the baseline (Model 1 with used NAG optimizer) weights. Most weights of the model that uses L 1 regularization are ≈0; the obtained model has sparse weights.  Figure 8. Loss learning curves of models that incorporate L 2 and L 1 weight decay in baseline models trained on Fashion-MNIST dataset. Figure 9. Comparison of baseline's weights with weights obtained by models that use weight decay regularization methods with regularization parameter λ = 5 × 10 −5 . Interestingly, the L 1 model has the most dispersed weights and the widest range of absolute values of weights. This can be seen as feature selection; in each layer, weights that correspond to irrelevant features are set to ≈0 while the more important features are emphasized by increasing their (absolute) value and, in that way, also increasing their influence on the final output. L 2 regularization has a similar effect. The main difference is that in the case of L 2 regularization less important weights are "pulled" towards 0 but not really set to 0.

Noise Injection
Adding (Gaussian) noise to input images did not, in general, result in improved generalization performance. We experimented with different amounts of added noise, Gaussian noise with standard deviation σ ∈ {0.01, 0.05, 0.1, 0.2}. The final model was trained with parameter σ that had the best performance on validation data. Examples of noise injected images from CIFAR-10 and Fashion-MNIST data are shown in Figure 10. Results are given in Table 7. Only in one case, we observe slight improvement in the test set accuracy, while accuracy on the training set remains close to 100%. Adding noise makes real classes harder to separate. Models have the capacity to learn available training data (training accuracy is in the worst case reduced to 99.74%), but the learned separation criterion also captures the injected noise.  Dropout By adding Dropout into the baseline models, generalization performance improves. Dropping units during training introduces a significant amount of regularization into the model and greatly reduces signs of overfitting. As we can see in Figures 11 and 12, validation loss of all models on both CIFAR-10 and Fashion-MNIST reduces at the expense of slower convergence and slightly worse final performance on the training data. In Table 8, we give results obtained with models using Dropout compared to baseline models' results. All models use Dropout with parameter p = 0.5 on the hidden layers and p = 0.1 on the input layer, which were chosen using the validation data.    Although the original paper [20] states that Batch Normalization fulfills some of the goals of Dropout and therefore removes the need for using the Dropout regularization method, from the results reported in Table 8 and the accuracy learning curves in Figures 13 and 14 (Model 1 and Model 2), we can see that the combination of these two techniques can benefit final performance of a given model. In all four cases, both validation and training accuracy increase compared to the only Dropout model. Batch Normalization inclusion into Model 1 and Model 2 also speeds up the learning process. However, in the case of Model 3 with the largest number of parameters, Dropout-Batch Norm combination indeed harms the model's final classification performance. In Figure 13, the validation accuracy learning curve of Model 3 significantly drops when we introduce Batch Normalization together with the Dropout.  Although the accuracy results in Table 8    On the new CIFAR-10 test set, (NAG) Model 1, which used the Dropout method without dropping input units, achieves 13.47% accuracy, while final evaluation of the model that dropped inputs with probability p = 0.1 during the training results with 81.28% accuracy on new data. On the new Fashion-MNIST test set, (Adam) Model 1 that incorporates Dropout only on hidden units achieves 83.21% accuracy, while the accuracy of one that drops inputs is equal to 93.17%.

Data Augmentation
During the training, we augmented CIFAR-10 images using horizontal flipping, width and height shifting, random zooming and shearing. For Fashion-MNIST augmentation, we used horizontal flipping and random zooming. Figure 16 shows examples of corresponding augmented images for a given image from CIFAR-10 and Fashion-MNIST data.  Table 9 gives the results of models that incorporate Data Augmentation compared with the initial baseline results obtained without regularization. Training with augmented data in all cases leads to enhanced model performance. The positive effect of Data Augmentation on generalization performance is more noticeable on CIFAR-10 data than on Fashion-MNIST data due to its large variations in the position of objects on images and background clutter. Including Data Augmentation in the training pipeline alone leads to an increased test set accuracy on CIFAR-10 data for 5.86 in the worst case. Combining Data Augmentation with Batch Normalization and Dropout in Model 1 and Model 2 further improves generalization performance. On CIFAR-10 data, Dropout with parameter p = 0.25 is used combined with Data Augmentation, while p = 0.5 is used in the case of the Fashion-MNIST data. Figures 17 and 18 show the effect of Data Augmentation on the learning curves of models trained on CIFAR-10 and Fashion-MNIST data; augmentation reduces validation loss and increases validation accuracy at the expense of slower convergence and worse results on the training data.  Early Stopping Figure 3 shows how validation loss of all models trained with different optimizers (all optimizers except slower-converging SGD optimizer) even before epoch 50 reaches its minimum value and afterward only increases. Moreover, Figure 4 shows that there is also no specific improvement in validation accuracy after epoch 100 for the most optimizers. Therefore, we could stop the model's training earlier and obtain model with roughly the same generalization ability. To reduce the training time and prevent possible overfitting to the training data, we used the Early Stopping method with patience 30 to stop the training if there was no improvement in the validation accuracy for 30 consecutive epochs. The model with weights that correspond to the best-observed validation accuracy is returned as a result of the training algorithm.   The other metric that could be monitored during the training and used for decision making about the appropriate time of ending the training algorithm is validation loss. It would be also reasonable to stop the training (with some patience) when the increase in the validation loss is observed. Because we are primarily interested in the accuracy of the final model, we decided to monitor validation accuracy (during the training, we minimize loss instead of maximizing accuracy because the loss function has some "nice" properties such as differentiability).
The final accuracy obtained by models that use Early Stopping and those that do not are compared in Table 10. Although model accuracy on new data is enhanced in some cases, it is often the case that performance of models that use Early Stopping on test data declines. However, training time significantly reduces. For achieving better final accuracy, we can use the larger values for the patience parameter. In some sense, the Early Stopping method can be seen as the trade-off between the time of training and the final performance of the model. For example, the accuracy of Model 1 trained on CIFAR-10 data with Dropout regularization decreases from 87.73% to 84.51% when Early Stopping is used, but training time reduces more than two times. If the training time is not a problem, the model can be trained for longer with saving the parameters that resulted in the best values of a monitored quantity. In Table 10, we can also see how the Data Augmentation technique yields the best accuracy results compared to other previously mentioned "good performing" models that incorporate only one regularizer.
Ensemble Learning Each model from the bagged ensemble has accuracy lower than the accuracy of the baselines noted in the last rows of Tables 11 and 12 caused by the less diverse training dataset, which contain multiple identical images, but together they outperform the baseline. Each ensemble has better generalization performance than any of its members. Accuracy of ensemble increases together with its size. Generalization performance of models that obtained the highest accuracy on test data further increases when we apply the bagging technique, as shown in Table 13. The downside of the Bagging method is the additional time necessary to train all of the base learners to obtain desired enhancement in generalization performance.
• The ensemble of members trained with different settings Below, we examine how ensembling models with different architectures and settings (in terms of used regularization and optimization techniques) affect the ensemble's generalization performance compared to the bagging ensembling approach.
Final accuracies of such ensembles on CIFAR-10 and Fashion-MNIST data are given in Table 14.

Conclusions
In this paper, we summarize different optimization algorithms and regularization methods commonly used for training deep model architectures. The empirical analysis was conducted to quantify and interpret the effect of employed optimization algorithm and regularization techniques on the model's generalization performance on image classification problem. Provided theoretical background accompanied by experimental results of the learning process can be beneficial to anyone who seeks more in-depth insight into the fields of optimization and regularization of deep learning. When possible, visualizations are used together with experimental evaluations to corroborate claims and intuitions about the effect of mentioned methods on the learning process and model's final performance on new data.
Empirical evaluations suggest that the optimization algorithm alone can positively affect model's generalization performance. Nesterov and Adam optimization algorithms were the best-performing algorithms on new data in most of our settings. However, none of the optimization algorithms should be discarded a priori; the evaluation is advisable to select the most appropriate one for the given architecture and dataset problem at hand. Generalization performance can notably be enhanced with proper regularization. Regularization techniques from which implemented CNN architectures gained the most significant improvement in generalization performance were Data Augmentation and Dropout. An appropriate combination of regularization techniques can lead to an even greater boost in the model's final generalization performance. Batch Normalization, an optimization method with a regularizing effect, seems to work well in combination with the Data Augmentation technique. In our experimental settings, the largest generalization gain was obtained using the combination of Batch Normalization and Data Augmentation together with the Dropout regularization method. However, one should combine Batch Normalization and Dropout with caution since their combination can result with an underperformance. If we want to improve generalization performance further, training multiple models to form an ensemble can be beneficial (given the availability of computational resources). To speed up the training and still obtain a model with reasonable generalization performance, the Early Stopping method can be used.
It is important to mention some limitations of conducted evaluations. In this work, regularization is used to complement the optimizer's performance to explore the extent to which the generalization performance can be improved, thus focusing on the evaluation of regularization techniques on the best optimizer per each of three CNN architectures and two benchmark datasets. It would be interesting to expand the experimental evaluations to examine the extent to which the regularizers would yield favorable results with other lower-performing optimizers. Most of the mentioned techniques are applicable to a wide range of problems. Therefore, it would be interesting to extend their experimental evaluations on different neural network architectures and problems from different domains. Further research within this scope could include a more detailed examination of the techniques associated with the optimization, such as the learning rate schedules, different weight initialization schemes, and their effect on the generalization performance.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix B. Used Hyperparameters for Optimizer Per Model
The list of all used hyperparameters, denoted in accordance with the TensorFlow documentation, can be found in Table A2 for models trained on CIFAR-10 dataset and Table A3 for models trained on Fashion-MNIST data. Table A2. Hyperparameters used on CIFAR-10 data. The optimizers' hyperparameters were tuned using the grid search technique. For implementation, the best-performing parameter on validation data was chosen. The learning rates were chosen from a predefined set of values on logarithmic scale {1, 10 −1 , 10 −2 , 10 −3 , 10 −4 , 10 −5 }, while the momentum hyperparameter was chosen from {0.9, 0.95, 0.99}. When two neighbor values' performances were close, their midpoint value was also considered a possible candidate. In the following, we emphasize hyperparameter values that in our settings differ from the defaults in TensorFlow documentation. In our settings, Adadelta's learning rate that yielded the best results on validation data was mostly 0.5 (one time even 1), which is significantly different from the default value of 0.001 Similarly, for slower-converging Adagrad, a larger learning rate of 0.05 or 0.01 was used instead of the default 0.001. On the other hand, the chosen Nadam's and Adam's learning rates were often smaller than the default value of 0.001. Table A3. Hyperparameters used on Fashion-MNIST data.