A Comparison of Regularization Techniques in Deep Neural Networks

: Artiﬁcial neural networks (ANN) have attracted signiﬁcant attention from researchers because many complex problems can be solved by training them. If enough data are provided during the training process, ANNs are capable of achieving good performance results. However, if training data are not enough, the predeﬁned neural network model suffers from overﬁtting and underﬁtting problems. To solve these problems, several regularization techniques have been devised and widely applied to applications and data analysis. However, it is difﬁcult for developers to choose the most suitable scheme for a developing application because there is no information regarding the performance of each scheme. This paper describes comparative research on regularization techniques by evaluating the training and validation errors in a deep neural network model, using a weather dataset. For comparisons, each algorithm was implemented using a recent neural network library of TensorFlow. The experiment results showed that an autoencoder had the worst performance among schemes. When the prediction accuracy was compared, data augmentation and the batch normalization scheme showed better performance than the others.


Introduction
Accurate weather forecasting is an important issue that plays a significant role in the development of several industrial sectors, such as agriculture and transportation. Many companies are using weather prediction techniques to analyze consumer demands. In addition, exact forecasting is essential for people to organize and plan their days. However, it is very difficult to predict the weather precisely because the atmosphere changes dynamically. For a long time, physical simulations were the most widely used scheme. With this method, the current atmospheric condition is sampled, and future conditions are predicted by comparing thermodynamic characteristics. In recent years, artificial neural networks (ANNs) have been widely used for weather prediction because they perform better through the use of machine learning. The human brain is composed of 100 billion interconnected neurons. These neurons are core cells that are responsible for information transmission among neurons using electrochemical signals.
ANNs were modeled by using a mechanism inspired by the human brain's information processing. This scheme was first introduced to researchers in 1943 by Warren and Walter [1]. This scheme is currently being used in almost every scientific area to solve complex problems. Williams [2] presented the efficiency of machine learning algorithms, and proved that they could be applied to many applications. Nicholas [3] proposed an enhanced scheme to train neural network algorithms. In the scheme, a statistical computation scheme was used to reduce the training errors. Zhang [4] most important object, which expresses the most outstanding characteristic from an image. To do this, they used additional metadata together with a CNN. Yue [30] applied neural networks to detecting a collision between cars. The video streaming captured from the traffic system was very complex and dynamic. This made collision detection more difficult. To solve this problem, they presented an enhanced DNN algorithm that was based on feature enhancement. Huang [31] used a neural network to improve the detection accuracy of traffic monitoring systems. One of the problems in moving object detection is that the accuracy becomes lower when there are too many moving objects in a video stream. To solve this problem, they used a DNN to detect a moving object accurately. Akcay [32] used a deep convolutional neural network (DCNN) to classify and detect an object from X-ray images scanned in an airport. By using this, they could save time and expense spent in investigating and detecting a dangerous object. Sevo [33] used a CNN to detect an object from images captured in the air. It is very difficult to detect an object from the air because all scenes in the air are expressed as three-dimensional (3D) images. By using a CNN, he could decrease the complexity of air object detection. One of the most widely known areas where neural networks are applied can be said to be image processing. Woźniak [34] presented an enhanced object detection method, where convolutional neural networks were combined with the analysis of clustered numbers. To determine the points of clusters, they used fuzzy logic. Vieira [35] presented methods and applications of deep learning that were applied to neuroimaging. Neuroimaging is used to make an image structure of a human brain to cure mental disease. In their paper, they insisted that deep learning could be an efficient method in improving brain image quality by training the neural network. Polap [36] described a practice in which an ANN was applied to detect potential diseases from body skin. In the method, skin data were collected by using motion sensors and a camera, and an ANN model was trained using the data. Then, using the model, they determined whether the skin had disease or not. Heaton [37] described an application of deep-learning stochastic models in financial areas. In these areas, they used deep learning to predict and classify financial data. Most doctors in hospitals used X-rays to classify carcinomas in chest organs. However, it causes wrong diagnoses because it is difficult for radiologists to exactly interpret the X-ray results. To solve this problem, Wozniak [38] applied neural networks to improving the accuracies of carcinoma classification. The experimental results showed that they reached a 92% classification accuracy. Litjens [39] described recent advancements of deep learning applications in analyzing images in the medical industry. In the paper, they presented more than 300 pieces of research achieved in this field. To give a concise review, they divided application areas for studies into 10 medical areas. Wozniak [40] presented a new method based on neural networks to detect defects of fruit peels, which was very different from a classical scheme. They invented an enhanced ANN algorithm called an adaptive artificial neural network (AANN). By using this method, they could improve calculation accuracy because it adapted to input data and their characteristics. Wang [41] presented an overview of machine learning applications for manufacturing. Through the help of widespread sensors and the Internet of Things (IoT), huge amounts of data could be collected in manufacturing systems. Deep learning could be used to improve system performance and product quality by analyzing collected big data.
As described earlier, active research is being conducted on neural networks and overfitting solutions. However, there is no research that compares regularization schemes. Therefore, it is difficult for developers to choose the most suitable scheme for developing an application, because there is no information about the performance of each scheme. To solve this problem, this study presents comparative research on regularization techniques by evaluating the training and validation errors in a DNN model using weather datasets. Especially, the appropriate choice of the regularization scheme is a very important process to manage huge augmented objects in intelligent mobile augmented reality (IMAR) system.
The remainder of this paper is organized as follows. Section 2 describes the research methodology and experiment setup. In Section 3, experiment results are described and analyzed. Section 4 presents a discussion of the results. Finally, Section 5 concludes this work.

Methodology
The methodology of the comparative research is represented in Figure 1. The entire methodology consisted of 9 steps. To compare the performance, several powerful regularization methods, including an autoencoder, data augmentation, batch normalization, and L1 regularization, were implemented and studied. First, we built a DNN model without using any regularization methods. We then trained and validated the model, and then calculated the errors. Next, we measured the errors of the same model without changing the settings, after applying regularization schemes. Each regularization scheme was analyzed by comparing the experiment results. Each step is described in more detail in the following section. The methodology of the comparative research is represented in Figure 1. The entire methodology consisted of 9 steps. To compare the performance, several powerful regularization methods, including an autoencoder, data augmentation, batch normalization, and L1 regularization, were implemented and studied. First, we built a DNN model without using any regularization methods. We then trained and validated the model, and then calculated the errors. Next, we measured the errors of the same model without changing the settings, after applying regularization schemes. Each regularization scheme was analyzed by comparing the experiment results. Each step is described in more detail in the following section.  First, the datasets that were used to train and validate the neural network model were collected. In our experiment, the Korean weather dataset was collected from the government website. It included 35 features, including average temperature, maximum temperature, minimum temperature, average wind speed, average humidity, cloudiness, and daylight hours, for the previous five days. The average temperature was a target feature to predict for the next day. We let x (i) be a 35dimensional feature vector for the ith set of five consecutive days, and let y (i) be the one-dimensional vector that contained this feature for the ith single day. The prediction of y (i) with a given x (i) can be expressed using Equation (1): In Equation (1), θ refers to a subset of the 35-dimensional vector, and σ is an activation function. The cost function seeks to minimize In Equation (2), m is the number of training examples. For supervised machine learning, the data are typically divided into two types, training and testing data. However, to obtain a better tuning model, validation data were used in addition to the original data. These validation data are referred to as the development dataset, or dev dataset. The goal of this dataset was to fine-tune the hyper parameters (architecture) of the ANN model. The model frequently used this data. However, it did not learn from this dataset. This set had to be used to obtain the optimal number of hidden units. The dataset was divided using the Sci-Kit Learn library as follows. First, the data was split into training and temporary data. Approximately 80% of the entire dataset was used as training data, and 20% was used as temporary data. The temporary dataset was split into two equal parts, test and validation. Entire data usage for both training and verification would increase the inaccuracy of prediction and would increase the training errors. It would be better to use the divided dataset. We think that cross-samplings can be good ways to decrease the errors. However, we did not use these First, the datasets that were used to train and validate the neural network model were collected. In our experiment, the Korean weather dataset was collected from the government website. It included 35 features, including average temperature, maximum temperature, minimum temperature, average wind speed, average humidity, cloudiness, and daylight hours, for the previous five days. The average temperature was a target feature to predict for the next day. We let x (i) be a 35-dimensional feature vector for the ith set of five consecutive days, and let y (i) be the one-dimensional vector that contained this feature for the ith single day. The prediction of y (i) with a given x (i) can be expressed using Equation (1): In Equation (1), θ refers to a subset of the 35-dimensional vector, and σ is an activation function. The cost function seeks to minimize In Equation (2), m is the number of training examples. For supervised machine learning, the data are typically divided into two types, training and testing data. However, to obtain a better tuning model, validation data were used in addition to the original data. These validation data are referred to as the development dataset, or dev dataset. The goal of this dataset was to fine-tune the hyper parameters (architecture) of the ANN model. The model frequently used this data. However, it did not learn from this dataset. This set had to be used to obtain the optimal number of hidden units. The dataset was divided using the Sci-Kit Learn library as follows. First, the data was split into training and temporary data. Approximately 80% of the entire dataset was used as training data, and 20% was used as temporary data. The temporary dataset was split into two equal parts, test and validation. Entire data usage for both training and verification would increase the inaccuracy of prediction and would increase the training errors. It would be better to use the divided dataset. We think that cross-samplings can be good ways to decrease the errors. However, we did not use these methods because we had to change the input algorithms in order to apply the cross-samplings. In future work, we can apply the algorithms.
Next, we defined and chose an accurate architecture for neural network analysis. This process usually requires significant experience because many factors must be efficiently decided. One factor is how many layers should be set in the model. The basic model included one input layer, two hidden layers, and one output layer, as illustrated in Figure 2. The input layer contained 35 neurons, the hidden layers contained 50 neurons each, and the output layer had 1 neuron. The topology of our model was the 35-50-50-1 topology.
Symmetry 2018, 10, 648 5 of 17 methods because we had to change the input algorithms in order to apply the cross-samplings. In future work, we can apply the algorithms. Next, we defined and chose an accurate architecture for neural network analysis. This process usually requires significant experience because many factors must be efficiently decided. One factor is how many layers should be set in the model. The basic model included one input layer, two hidden layers, and one output layer, as illustrated in Figure 2 After defining the model, we needed to make the parameter settings, which are listed in Table 1. The columns included the basic DNN model and regularization schemes, and the rows included the parameters for each model. The first parameter was the number of input neurons. This number was set to 35 for most models, as previously described. The next parameter to set was the number of hidden layers, which were 2. In reality, the value could be increased or decreased according to the central processing unit (CPU) capability. If the CPU capability was high, the number could be decreased because it took less time. When tested in our experiment, 2 was the most appropriate value because the processing time increased exponentially if the value was greater than 3. The third was the number of neurons in the hidden layers. If the number was higher, then the results were better. However, there was a trade-off between the number and processing time. In our experiment, the value was set to be 50. The number of output neurons was set to 1 because the target feature had only one. The learning rate was set to 0.0001. Although processing took a long time because it was slightly low, the results were more reliable. The proximal Adagrad optimizer algorithm was used to optimize our model. The batch size was 100 and the maximum number of epochs was 100,000. The rectified linear unit (ReLU) was used for the activation function.  After defining the model, we needed to make the parameter settings, which are listed in Table 1. The columns included the basic DNN model and regularization schemes, and the rows included the parameters for each model. The first parameter was the number of input neurons. This number was set to 35 for most models, as previously described. The next parameter to set was the number of hidden layers, which were 2. In reality, the value could be increased or decreased according to the central processing unit (CPU) capability. If the CPU capability was high, the number could be decreased because it took less time. When tested in our experiment, 2 was the most appropriate value because the processing time increased exponentially if the value was greater than 3. The third was the number of neurons in the hidden layers. If the number was higher, then the results were better. However, there was a trade-off between the number and processing time. In our experiment, the value was set to be 50. The number of output neurons was set to 1 because the target feature had only one. The learning rate was set to 0.0001. Although processing took a long time because it was slightly low, the results were more reliable. The proximal Adagrad optimizer algorithm was used to optimize our model. The batch size was 100 and the maximum number of epochs was 100,000. The rectified linear unit (ReLU) was used for the activation function.
In the third step, a defined neural network model was trained using the original data where regularization methods were not applied. During training, the root mean square errors (RMSEs) were captured and the data were saved into a separate file. The RMSE value could be obtained using the following equation: In Equation (3), Answer (X i ) is the real answer data at time i, and Predict (X i ) is the value predicted by the trained neural network model. In the fourth step, the defined neural network model was validated and the validation errors were captured using Equation (1). In the third and fourth steps, overfitting and underfitting were checked. In the fifth step, regularization methods were applied to the original weather data. The applied methods will be described in more detail in Section 2.2. In steps 5 and 6, a defined model was trained and validated using the datasets. In step 7, the future temperature is predicted using the trained neural network model where regularization methods were applied. Finally, each scheme is compared by analyzing the train, validation, and prediction errors.

Applied Regularization Methods in the Experiment
This section describes the regularization methods for the experiment. A widely used scheme for regularization is the autoencoder scheme, which refers to an enhanced neural network with the same number of neurons in the input and output layers. This scheme uses unsupervised learning, because labels are not required when training the model. The scheme compresses the data received from the input neurons into short code, and then decompresses this code into output neurons that are very close to the input data. One of the goals of this scheme is to remove the noise from the input data. The architecture is similar to MLP structure, and it has at least one input, hidden, and output layer. This type of neural network consists of two parts, an encoder and a decoder. An encoder is a network component that compresses the input. A decoder is used to reconstruct the encoded input. In a simple autoencoder with a single layer, the encoder takes the x X input and compresses it to z Z. The equation for calculating the compressed data z is as follows [42]: In Equation (4), z is known as the latent space representation. It is sometimes identified as a code or latent variable. Here, σ is the activation function, such as the ReLU, sigmoid, or Leaky ReLU function. W is the weight of the nodes, and b is the bias vector. In the reconstruction process, the same operation is repeated, as shown in Equation (5) [43]: In the decoding process, the compressed data z is mapped to x , where x represents the transformed input data with the same dimension as the input x value, and σ is an activation function used to decompress the data. W is the weight of the transformed nodes, and b is a bias in the decoder. To obtain satisfactory performance using the autoencoder scheme, the decoding loss should be minimized. Sum squared errors (SSEs) or an RMSE function were used to measure the loss as in Equation (6): From Equations (3) and (4), we derived Equation (7): By replacing z with Equation (7), the final equation of the loss function was as follows [30]: There are many types of autoencoder schemes. In our research, the stacked autoencoder was used to diminish the noise from the input data and simplify the tuning hyperparameters in a scheme. The architecture of the scheme is illustrated in Figure 3. It contained one input layer, three hidden layers, and one output layer.
From Equations (3) and (4), we derived Equation (7): By replacing z with Equation (7), the final equation of the loss function was as follows [30]: There are many types of autoencoder schemes. In our research, the stacked autoencoder was used to diminish the noise from the input data and simplify the tuning hyperparameters in a scheme. The architecture of the scheme is illustrated in Figure 3. It contained one input layer, three hidden layers, and one output layer. Data augmentation is one of the most popular regularization techniques. The main idea of the scheme is to expand the training dataset by applying transformations to decrease overfitting. This technique is commonly used in image processing, since image operations like rotating, shifting, scaling, mirroring, or randomly cropping can be easily implemented when using the scheme [44]. For data augmentation, it is important to effectively control noise. There are some types of noise that are available for the scheme. Among them, the Gaussian noise control scheme is the most widely used. The scheme could be expressed using Equation (9) [45]: In Equation (9), μ is the noise vector. This technique is effectively used for RNNs, whereas it is seldom used in feed forward neural networks [46]. In this study, two augmentation techniques were implemented. The first type of data augmentation was to sum up partial datasets. Let Lj (i) represent input feature data for weather prediction, where j ϵ {1, 2, … n} is the number of features and i ϵ {1, 2, … m} is the number of identical categorical features (in our case, i is the number of days). The final input using the augmented data for the jth categorical data could be expressed as where Lj is the average humidity or average temperature. The number of input features is seven (j = 7), and the number of identical features is five (i = 5). The second augmentation summed the identical categorical features, as shown in Equation (11): Data augmentation is one of the most popular regularization techniques. The main idea of the scheme is to expand the training dataset by applying transformations to decrease overfitting. This technique is commonly used in image processing, since image operations like rotating, shifting, scaling, mirroring, or randomly cropping can be easily implemented when using the scheme [44]. For data augmentation, it is important to effectively control noise. There are some types of noise that are available for the scheme. Among them, the Gaussian noise control scheme is the most widely used. The scheme could be expressed using Equation (9) [45]: In Equation (9), µ is the noise vector. This technique is effectively used for RNNs, whereas it is seldom used in feed forward neural networks [46]. In this study, two augmentation techniques were implemented. The first type of data augmentation was to sum up partial datasets. Let L where L j is the average humidity or average temperature. The number of input features is seven (j = 7), and the number of identical features is five (i = 5). The second augmentation summed the identical categorical features, as shown in Equation (11): Similar to the first data augmentation technique, the number of input neurons became 7. The number of operations in the model decreased as the number of input data decreased, thereby preventing overfitting.
The third scheme for regularization was batch normalization, which was proposed by Sergey and Christian in 2015. After implementing batch normalization on a DNN, regularization techniques like dropout [47] or L2 regularization were not required to tune the model. Instead, this method focused on an internal covariate shift [48]. In addition, by implementing this method, they reduced the training time of the model.
The fourth scheme applied in the experiment was an L1 regularization. L1 regularization is known as the least absolute shrinkage and selection operator (LASSO), and was introduced by Robert [49]. The main idea behind the scheme is to regularize the loss function by completely removing the irrelevant features from the model [36]. The equation of the scheme could be expressed as In Equation (12), L(y i , y i ) is a loss function, m is the number of observations, y i is the predicted value (whereas y i is the actual value), and λ is a non-negative regularization parameter. The main objective was to minimize the f (w,b) function by penalizing weights in proportion to the sum of their absolute values. As λ increases, w decreases. As λ" decreases, the variance increases.

Experiment Setup
The hardware specification for the experiment was as follows. The desktop used was a Gigabyte Z97X-UD3H personal computer running Windows 10 with an Intel Core i7-4790 K CPU and 8 GB of RAM. The simulations were implemented using Python programming language and the TensorFlow library. This library is very popular among machine learning application developers. A neural network application can run on several CPUs and graphics processing units (GPUs) in parallel. Supporting parallelism is one of the key features of the library. In addition, the library can be available for multiple programming languages, such as Python, C++, and Java. There are many higher-level application programming interfaces (APIs) that work with TensorFlow. For instance, Keras API, TFLearn, and Sonnet are provided to easily train the model. In our study, we implemented TensorFlow's new higher-level constructs estimator. There were many advantages to the estimator: • Without changing the model, our model could be run on local or distributed servers. In addition, without the need to record our model, our estimator-based model could run CPUs, GPUs, or tensor processing unit (TPU)s.

•
It was much easier to develop a model with an estimator rather than low-level TensorFlow APIs. • It could make a graph for us.
The estimator API provided an ordinary interface to train, evaluate, and predict functions. For building our model, we used DNNRegressor class in the tf.estimator package. There were many parameters in this class, but we will focus on the major ones as follows. For visualization, we used the TensorBoard visualization tool, which is a very powerful graph visualization released by Google's TensorFlow team. This tool is not only used for graph visualization, but also implemented to plot quantitative metrics on the execution of a graph and to show additional data (e.g., images) that pass through it. Moreover, using this tool, a programmer can debug a model easily. Another API applied for visualization in this research was Matplotlib plotting library. This API provides extremely wide visualization techniques for Python programming language.

Results
This chapter discusses the results. First, the results of the DNN model are described, which was trained without the use of any regularization techniques. Next, the results of the DNN models with regularization techniques are presented. Since overfitting problems are much more visible in pictures, our final results are visualized as graphs. The axes of the graphs consist of error values and epoch numbers. Even after training the network model, very low RMSE values seemed to be very good accuracy. However, in some cases, they caused issues such as an overfitting problem.
We first present the experiment results of a DNN without regularization. As previously discussed, our model was established with the settings that are shown in Table 1. After training 100,000 epochs, RMSEs for training and validation data were plotted, as shown in Figures 4 and 5, respectively. Figure 4 shows that, by increasing the epochs, the error for training data changed rapidly. However, the overall training errors decreased. Sometimes, when using DNNRegressor class, all techniques of regularization were not available. For example, for L1 and autoencoder, we had to use another type of API. To do that, we used Keras open source neural network library, which is written in Python. As it can run on top of TensorFlow, it was easy to implement these two libraries together.
For visualization, we used the TensorBoard visualization tool, which is a very powerful graph visualization released by Google's TensorFlow team. This tool is not only used for graph visualization, but also implemented to plot quantitative metrics on the execution of a graph and to show additional data (e.g., images) that pass through it. Moreover, using this tool, a programmer can debug a model easily. Another API applied for visualization in this research was Matplotlib plotting library. This API provides extremely wide visualization techniques for Python programming language.

Results
This chapter discusses the results. First, the results of the DNN model are described, which was trained without the use of any regularization techniques. Next, the results of the DNN models with regularization techniques are presented. Since overfitting problems are much more visible in pictures, our final results are visualized as graphs. The axes of the graphs consist of error values and epoch numbers. Even after training the network model, very low RMSE values seemed to be very good accuracy. However, in some cases, they caused issues such as an overfitting problem.
We first present the experiment results of a DNN without regularization. As previously discussed, our model was established with the settings that are shown in Table 1. After training 100,000 epochs, RMSEs for training and validation data were plotted, as shown in Figures 4 and 5, respectively. Figure 4 shows that, by increasing the epochs, the error for training data changed rapidly. However, the overall training errors decreased.     By comparing the results, we concluded that the errors for validation and training data decreased in the same range. The closeness between the validation and training data errors meant that good generalization was achieved. In some cases, it showed that the neural network model needed more training. However, since it was time-consuming, we continued our research by implementing regularization methods based on this model.

Experiment Results for Each Regularization Method
First, a model where the autoencoder method was applied with the same settings was tested. The results of the training errors are illustrated in Figures 6 and 7. As can be seen from Figure 6, the training errors did not increase as epoch number increased. However, validation errors became higher when the epoch number increased, as illustrated in Figure 7. Through the results, it is clearly seen from the graphs that the model suffered from an underfitting problem and could not learn anything. When we conducted an experiment several times using an autoencoder, the results were not so good. Thus, we could not continue learning. In our analysis, the basic model of stacked encoder was not appropriate. It needed some changes of structure. By comparing the results, we concluded that the errors for validation and training data decreased in the same range. The closeness between the validation and training data errors meant that good generalization was achieved. In some cases, it showed that the neural network model needed more training. However, since it was time-consuming, we continued our research by implementing regularization methods based on this model.

Experiment Results for Each Regularization Method
First, a model where the autoencoder method was applied with the same settings was tested. The results of the training errors are illustrated in Figures 6 and 7. As can be seen from Figure 6, the training errors did not increase as epoch number increased. However, validation errors became higher when the epoch number increased, as illustrated in Figure 7. Through the results, it is clearly seen from the graphs that the model suffered from an underfitting problem and could not learn anything. When we conducted an experiment several times using an autoencoder, the results were not so good. Thus, we could not continue learning. In our analysis, the basic model of stacked encoder was not appropriate. It needed some changes of structure.   Second, we investigated the experiment results of the DNN model, where a batch normalization method was applied. The results of this technique are given in Figures 8 and 9. As it is shown in the figures, the results were more acceptable than those using the autoencoder. The training errors began to decrease initially. However, the overall trend fluctuated constantly after approximately 3000 epochs. Validation errors decreased and increased from the beginning. Even though there was a small decrease around 10,000 epochs, the overall trend increased slightly. By comparing these two graphs, it became clear that the DNN model using batch normalization was overfitted within a small range, because validation errors increased in spite of the constant fluctuation in training errors.  Second, we investigated the experiment results of the DNN model, where a batch normalization method was applied. The results of this technique are given in Figures 8 and 9. As it is shown in the figures, the results were more acceptable than those using the autoencoder. The training errors began to decrease initially. However, the overall trend fluctuated constantly after approximately 3000 epochs. Validation errors decreased and increased from the beginning. Even though there was a small decrease around 10,000 epochs, the overall trend increased slightly. By comparing these two graphs,   Next, we discuss the DNN model where a L1 regularization method was applied. The experiment results are represented in Figures 10 and 11. Figure 10 shows that the training errors were smoothly diminished as the epoch number increased. However, for validation errors, the trend showed a rise from the beginning of the epochs. This demonstrates that even though the L1 regularization technique was the most popular model to prevent overfitting in artificial intelligence, it still suffered from an overfitting problem.   Next, we discuss the DNN model where a L1 regularization method was applied. The experiment results are represented in Figures 10 and 11. Figure 10 shows that the training errors were smoothly diminished as the epoch number increased. However, for validation errors, the trend showed a rise from the beginning of the epochs. This demonstrates that even though the L1 regularization technique was the most popular model to prevent overfitting in artificial intelligence, it still suffered from an overfitting problem. Next, we discuss the DNN model where a L1 regularization method was applied. The experiment results are represented in Figures 10 and 11. Figure 10 shows that the training errors were smoothly diminished as the epoch number increased. However, for validation errors, the trend showed a rise from the beginning of the epochs. This demonstrates that even though the L1 regularization technique was the most popular model to prevent overfitting in artificial intelligence, it still suffered from an overfitting problem.
Next, we discuss the DNN model where a L1 regularization method was applied. The experiment results are represented in Figures 10 and 11. Figure 10 shows that the training errors were smoothly diminished as the epoch number increased. However, for validation errors, the trend showed a rise from the beginning of the epochs. This demonstrates that even though the L1 regularization technique was the most popular model to prevent overfitting in artificial intelligence, it still suffered from an overfitting problem.  In reality, a failure occurred during the experiments for all three types of regularization techniques. However, the cause of the failures was undetermined.
Fourth, the experiment results were investigated using data augmentation. As previously discussed, we implemented two types of data augmentation for our investigation. The first scheme, which summed the features, performed better than other regularization methods. As shown in Figures 12 and 13, the training errors simultaneously declined as the number of epochs increased. The validation error of the model found its optimal value at 40K epochs. The validation data error rose slightly after 40K epochs. From these graphs, it is shown that the DNN model with data augmentation based on summing had a slight overfitting after 40K epochs.  In reality, a failure occurred during the experiments for all three types of regularization techniques. However, the cause of the failures was undetermined.
Fourth, the experiment results were investigated using data augmentation. As previously discussed, we implemented two types of data augmentation for our investigation. The first scheme, which summed the features, performed better than other regularization methods. As shown in Figures 12 and 13, the training errors simultaneously declined as the number of epochs increased. The validation error of the model found its optimal value at 40K epochs. The validation data error rose slightly after 40K epochs. From these graphs, it is shown that the DNN model with data augmentation based on summing had a slight overfitting after 40K epochs. In reality, a failure occurred during the experiments for all three types of regularization techniques. However, the cause of the failures was undetermined.
Fourth, the experiment results were investigated using data augmentation. As previously discussed, we implemented two types of data augmentation for our investigation. The first scheme, which summed the features, performed better than other regularization methods. As shown in Figures 12 and 13, the training errors simultaneously declined as the number of epochs increased. The validation error of the model found its optimal value at 40K epochs. The validation data error rose slightly after 40K epochs. From these graphs, it is shown that the DNN model with data augmentation based on summing had a slight overfitting after 40K epochs.     Fifth, experiment results using the data augmentation technique based on an average are illustrated in Figures 14 and 15. The figures show that the overfitting issue was completely overcome with this method. As is shown in Figure 14, training errors were considerably diminished throughout all the epochs. For the validation errors, those rapidly decreased until 20K epochs and stayed stable after the point shown in Figure 15. Then, the validation errors began to fall down slowly from 35K epochs, and it showed the smallest error at 100K epochs. Notice the difference between training and validation errors were not high. This indicates that the dataset achieved good generalization. Fifth, experiment results using the data augmentation technique based on an average are illustrated in Figures 14 and 15. The figures show that the overfitting issue was completely overcome with this method. As is shown in Figure 14, training errors were considerably diminished throughout all the epochs. For the validation errors, those rapidly decreased until 20K epochs and stayed stable after the point shown in Figure 15. Then, the validation errors began to fall down slowly from 35K epochs, and it showed the smallest error at 100K epochs. Notice the difference between training and validation errors were not high. This indicates that the dataset achieved good generalization.  Next, we compared the accuracies of each scheme by evaluating the averaged mean square errors (MSEs) between estimated temperature and observed temperature. Statistically, the MSE is regarded as an important metric that is used in order to evaluate the performance of a predictor. By comparing the values, we could evaluate the precision and accuracy of predictors. The formula is given below: Fifth, experiment results using the data augmentation technique based on an average are illustrated in Figures 14 and 15. The figures show that the overfitting issue was completely overcome with this method. As is shown in Figure 14, training errors were considerably diminished throughout all the epochs. For the validation errors, those rapidly decreased until 20K epochs and stayed stable after the point shown in Figure 15. Then, the validation errors began to fall down slowly from 35K epochs, and it showed the smallest error at 100K epochs. Notice the difference between training and validation errors were not high. This indicates that the dataset achieved good generalization.  Next, we compared the accuracies of each scheme by evaluating the averaged mean square errors (MSEs) between estimated temperature and observed temperature. Statistically, the MSE is regarded as an important metric that is used in order to evaluate the performance of a predictor. By comparing the values, we could evaluate the precision and accuracy of predictors. The formula is given below: Next, we compared the accuracies of each scheme by evaluating the averaged mean square errors (MSEs) between estimated temperature and observed temperature. Statistically, the MSE is regarded as an important metric that is used in order to evaluate the performance of a predictor. By comparing the values, we could evaluate the precision and accuracy of predictors. The formula is given below: The statistical results are represented in Figure 16. As it is shown in the figure, the autoencoder scheme was fairly high. For the other schemes, the values were not high. From the results, L1 regularization and the autoencoder still encountered overfitting and underfitting problems. The batch normalization showed better performance than these methods, as mentioned earlier. The DNN model with data augmentation showed the best performance.
Finally, we compared actual average temperature and predicted average temperature in all models. The prediction was done during ten days, from 2018.03.01 until 2018.03. 10. The results are shown in Table 2. As can be seen from the table, the scheme that showed the worst performance was the autoencoder because there was much difference between actuality and prediction. For data augmentation and batch normalization, the differences were fairly small. Sometimes, batch normalization outperformed data augmentation during some days. From the table, we see that the data augmentation showed the best performance because the prediction was nearly the same as the real temperature for some days.  As it is shown in the figure, the autoencoder scheme was fairly high. For the other schemes, the values were not high. From the results, L1 regularization and the autoencoder still encountered overfitting and underfitting problems. The batch normalization showed better performance than these methods, as mentioned earlier. The DNN model with data augmentation showed the best performance.
Finally, we compared actual average temperature and predicted average temperature in all models. The prediction was done during ten days, from 2018.03.01 until 2018.03. 10. The results are shown in Table 2. As can be seen from the table, the scheme that showed the worst performance was the autoencoder because there was much difference between actuality and prediction. For data augmentation and batch normalization, the differences were fairly small. Sometimes, batch normalization outperformed data augmentation during some days. From the table, we see that the data augmentation showed the best performance because the prediction was nearly the same as the real temperature for some days.

Discussion
The study showed that the models using regularization techniques demonstrated better performance than those without regularization methods in terms of training errors. When comparing each scheme quantitatively, an autoencoder scheme exposed higher errors than other schemes. This was because it encountered underfitting due to the lack of data caused by removing some of the training data. With this result, the portion of removed data must be decreased for an autoencoder when the training data are insufficient. In addition, L1 regularization and the autoencoder scheme still encountered overfitting and underfitting. Batch normalization and data augmentation showed better performance than the others when comparing the errors. When comparing the prediction accuracy, data augmentation and batch normalization showed better performance than others. Of the two schemes, batch normalization outperformed data augmentation on some days. This was because much more training data was added to the original data instead of being removed. However, if too much data was used for training, it required too much time to complete the training of the models, demonstrating a tradeoff between training data and processing time. In our study, only one CPU was used to train the neural network. If there was too much data, the training time was too long. In future work, it is necessary to analyze how the training time varies and compare the results using big data. One of the approaches to considerably decrease the training time is to use a compute unified device architecture (CUDA) GPU, where the experimental data are stored in a distributed manner and processed in parallel on a multiple-CPU computer. However, this scheme requires the installation of proprietary applications and software. It also requires a change in the basic architecture of the experimental software.

Conclusions
The main contribution of this work is to help developers to choose the most suitable scheme for their neural network application by doing comparative research with the purpose of assessing the training and validation errors of a model with regularization methods. In the existing research in the literature of neural networks, there was no research about a comparison of regularization methods. From our study, we see that regularization methods could solve overfitting and underfitting problems efficiently, but, even though some regularization algorithms were applied, neural network models still suffered from the same problems during training. This indicates that it is not easy to solve the problems and a more enhanced solution needs to be devised to completely solve the problems. One remaining aspect to reflect upon consists of a comparison of processing times for each regularization scheme. For stacked autoencoders, it takes a longer time to finish training and validation. The reason for this is not analyzed clearly yet.