Handwritten Digit Recognition: Hyperparameters-Based Analysis

: Neural networks have several useful applications in machine learning. However, beneﬁting from the neural-network architecture can be tricky in some instances due to the large number of parameters that can inﬂuence performance. In general, given a particular dataset, a data scientist cannot do much to improve the e ﬃ ciency of the model. However, by tuning certain hyperparameters, the model’s accuracy and time of execution can be improved. Hence, it is of utmost importance to select the optimal values of hyperparameters. Choosing the optimal values of hyperparameters requires experience and mastery of the machine learning paradigm. In this paper, neural network-based architectures are tested based on altering the values of hyperparameters for handwritten-based digit recognition. Various neural network-based models are used to analyze di ﬀ erent aspects of the same, primarily accuracy based on hyperparameter values. The extensive experimentation setup in this article should, therefore, provide the most accurate and time-e ﬃ cient solution models. Such an evaluation will help in selecting the optimized values of hyperparameters for similar tasks


Introduction
For computer scientists, the complexity of recognizing handwriting is becoming increasing demanding due to a significant variation in languages. The problem remains even if the variable of language is kept constant, by categorizing handwriting by similar languages. Every piece of writing is different, though there are some similarities that can be classified together. The task for a data scientist, in this case, is to group the digits involved in writing a similar language in categories that would indicate related groups of the same numbers or characters. The optimization of this process is of utmost importance, as one wants to recognize handwriting as it is written. The classifiers that are used in these cases vary in their methods and recognition speeds, along with the accuracy of recognition.
Neural networks (NN) have proved useful in a vast range of fields, classification being one of the most basic. NNs show excellent performance for the classification of images by extracting features from images and making predictions based on those features. Many different types of NN scans be used for classification, i.e., convolutional neural networks (CNN), recurrent neural networks (RNN), etc. Each type has its architecture, but some properties remain the same. As each layer consists of neurons, each neuron is assigned weight and activation functions. First, we will look at the architecture of each neural network.
However, 1 and 2 can be repeated multiple times to help increase learning from the abstract feature.
(1) Convolutional first takes in input and then extract features from it. It conserves pixel relationships by learning from them, using a series of small squares. A computation is a mathematical operation between the filter and the part of the image under consideration. This filter performs various functions such as edge detection, blurring and sharpening the image. (2) The pooling layer downsamples the images when they are too large. This helps retain important features while reducing image size. The most crucial pooling type is Spatial pooling, also called subsampling or downsampling, which reduces the dimension of the image by retaining essential features. Pooling is further categorized as follows: (a) Max pooling; (b) Average pooling; (c) Sum pooling.
(3) Flatten(ing) flats all the values into a single dimension tensor obtained from the pooling layer. The dense layer and connected layer follow this layer. (4) The fully connected layer is the dense feed-forward neural network. These layers are fully connected with the flatten layers.

Recurrent Neural Network
The term recurrent neural networks was first coined by Pearlmutter [4] in 1989, where he explained the learning process of RNN for state space trajectories. Recurrent neural networks (RNN) are part of an Artificial neural network, which, unlike a feedforward neural network, can retain essential features about the past and has a memory unit in it [5]. Basic feedforward also can remember, but it only retains training information. Whereas RNN retains training information, it can also learn from every input(s) while generating output(s). RNN can take one or more input(s) data and generate one or more output(s). These output(s) are not only affected by the input data alone, but also by the hidden state vector, which remembers the output generated by the input. Accordingly, a sample input can give a different output based on the previous input data inside the series.
The basic architecture of RNN is shown in Figure 1. In this figure, y1, y2 and y3 are the outputs generated from the x1, x2 and x3 nodes, and h is hidden state. Appl. Sci. 2020, 10, x FOR PEER REVIEW 3 of 18

Multi Layer Perceptron (MLP)
MLPs are nothing, but neural network in classical form. It is a form of the feed-forward neural networks. It consists of at least three layers, an input layer, a hidden layer and an output layer. The layers of neural networks are stacked upon each other; they are called hidden layers [6]. Since a neural network must be nonlinear, activation like ReLU is used to add nonlinearities in the layer [7].

Siamese Neural Network
Chopra et al. [8] gave the idea of Siamese neural networks in 2005. A Siamese NN is part of an artificial neural network which has two or more identical subnetworks. These subnetworks are assigned the same weights and outputs, and any parameter update is mirrored across both networks. Since these subnetworks are similar, they have only a few parameters to compute and are faster than the other networks [9]. Siamese NN can be used for checking the identity of two documents-photographs and images-and are widely used.

Knowledge-Transfer Model
Transfer learning or knowledge-transfer modeling is an approach in which previously trained offline models are augmented to new similar problems. A base network is first trained for a standard dataset. The trained model is then augmented with the latest models needed in preparation for a specific task. In most cases, this helps in enhancing accuracy and reducing the training and execution times of the new models.

Hyper Parameters
While defining the structure of a NN, the factors that matter most are the hyperparameters. If neglected and not selected properly, they can be the cause of high error, resulting in wrong predictions [10]. We deal with the following parameters: (1) Learning rate dynamics: learning rate is a crucial factor in machine learning, determining the extent to which new data may be changed by considering old data. Here the data predominantly signify the weight. When the learning rate is considerably high, it may jump or escape the minima and converge fast. This causes low accuracy and the model to be unfit. Similarly, without the lower learning rate, the model takes much time to converge and get stuck in undesirable local minima. This increases the model's compilation time and sometimes leads to

Multi Layer Perceptron (MLP)
MLPs are nothing, but neural network in classical form. It is a form of the feed-forward neural networks. It consists of at least three layers, an input layer, a hidden layer and an output layer. The layers of neural networks are stacked upon each other; they are called hidden layers [6]. Since a neural network must be nonlinear, activation like ReLU is used to add nonlinearities in the layer [7].

Siamese Neural Network
Chopra et al. [8] gave the idea of Siamese neural networks in 2005. A Siamese NN is part of an artificial neural network which has two or more identical subnetworks. These subnetworks are assigned the same weights and outputs, and any parameter update is mirrored across both networks. Since these subnetworks are similar, they have only a few parameters to compute and are faster than the other networks [9]. Siamese NN can be used for checking the identity of two documents-photographs and images-and are widely used.

Knowledge-Transfer Model
Transfer learning or knowledge-transfer modeling is an approach in which previously trained offline models are augmented to new similar problems. A base network is first trained for a standard dataset. The trained model is then augmented with the latest models needed in preparation for a specific task. In most cases, this helps in enhancing accuracy and reducing the training and execution times of the new models.

Hyper Parameters
While defining the structure of a NN, the factors that matter most are the hyperparameters. If neglected and not selected properly, they can be the cause of high error, resulting in wrong predictions [10]. We deal with the following parameters: (1) Learning rate dynamics: learning rate is a crucial factor in machine learning, determining the extent to which new data may be changed by considering old data. Here the data predominantly signify the weight. When the learning rate is considerably high, it may jump or escape the minima and converge fast. This causes low accuracy and the model to be unfit. Similarly, without the lower learning rate, the model takes much time to converge and get stuck in undesirable local minima. This increases the model's compilation time and sometimes leads to overfitting.
Deep learning training generally relies on training, using the stochastic gradient descent method. The stochastic gradient descent (SGD) approach updates the parameters after each example. Batch gradient descent (BGD) updates parameters after processing a batch. This hastens the stochastic descent when the training set is extensive. This updating of weight is done by the most significant algorithm known as the back-propagation algorithm. This optimization algorithm has many hyperparameters, and the learning rate is one. (2) Momentum dynamics: In the stochastic gradient descent optimization algorithm, weights are updated after every iteration. This is efficient when the dataset is quite large, but sometimes this progression is not that fast. To accelerate the learning process, momentum will smooth the progression, thereby reducing training time. We try to adapt the rate by varying the momentum in the SGD parameter and fixing the learning rate. Furthermore, we check model performance and accuracy. (3) Learning rate (α) decay: As discussed earlier, SGD is an excellent optimization algorithm. In SGD, an important hyperparameter called the decay rate is used. Decay causes the learning rate to be reduced over every iteration, making the learning rate less than the previous algorithm. Generally, the decay rate is computed as follows: where α is known as the learning rate. (4) Drop learning rate on plateau: This parameter constantly monitors the metric, and if the metric does not change over given epochs, it drops the learning rate. We continued by exploring different values for the "patience" which checks how many epochs wait to monitor the change and then drops the new learning rate. This learning rate is usually lower than the previous learning rate. For this experiment, we are using a constant learning rate of 0.01 and then drop the learning rate by a factor set to the value of 0.1. These values can be tuned for better performance. (5) Adaptive learning rates: We know the learning decay rate is very critical and hard to implement; there is no hard and fast rule to get the best values. To deal with this, previously best-known algorithms are used for the learning rate, as follows: (a) Adaptive gradient algorithm (AdaGrad); (b) Root mean square propagation (RMSprop); (c) Adaptive moment estimation (Adam).
All employ a different methodology and use various approaches. Again, the optimization algorithm has its drawbacks and tradeoffs: no algorithm is perfect. We use these algorithms to find the better one and accordingly use them for our metric calculation.
In this paper, we compare the five neural networks described above for the classification of the MNIST dataset [11]. Then we check the alteration in accuracy and execution time by changing the hyperparameters described above. The MNIST database is compiled by the National Institute of Standard and Technology. The database comprises 70,000 images of handwritten digits. These in turn are divided into 60,000 for training and 10,000 for testing. The dimension for each image is 28 × 28 (784 pixels). We used MNIST because it is concise and can be easily used. Figure 2 shows a sample of images in the MNIST dataset. The algorithms work at different magnitudes and tend to sublimate each other completely. However, the potential for comparison is still there. Simultaneously, each of these algorithms has its strengths and weaknesses, which is relevant to comparing them all in a single task and elaborating their significant suitability for accuracy and speed. This makes for comparable potential, along with being able to reach the desired target.

Previous Work
There has been much work in the field of classification, especially for MNIST datasets. Many types of neural networks have been used for this purpose, and sometimes a combination of different algorithms has been used. Many new advanced forms of underlying neural networks were also tested on MNIST data.
Kessab et al. [12] used a multilayer perceptron (MLP) for the recognition of handwritten letters to build an optical character recognition (OCR) system. Koch et al. [13] used Siamese neural networks for one-shot image recognition, i.e., only one image of each class is used. Shuai et al. [14] built the Independently recurrent neural network (IndRNN), an advanced form of RNN and tested its performance on MNIST datasets. Chen et al. [15] built a probabilistic Siamese network and tested it on MNIST data. Chen used CNN for the image classification of MNIST [16].
Niu et al. [17] combined CNN with a support vector machine (SVM), a different approach for MNIST data classification. Similar work is done by Agarap et al. [18] combined CNN and SVM for image classification. Zhang et al. [19] combined SVM and k-nearest neighbors (KNN) for image classification. Vlontzos [20] used Random neural networks with extreme learning machines (ELM) to build a classifier that does not require large training datasets.
As it is an era that revolves around deep learning, many works have been done in hyperparameter tuning. Mostly, hyperparameters are selected through grid search; but there are many algorithms now that can tune them for better results. Loshchilov et al. [21] proposed covariance matrix adaptation evolution strategy (CMA-ES) for hyperparameter optimization. Smith et al. [22] proposed a disciplined approach for tuning the learning rate, batch size, momentum and weight decay. Domhan et al. [23] extrapolated the learning curves to speed up automatic hyperparameter optimization of deep neural networks. Bardenet et al. [24] merged optimization techniques with surrogate-based ranking for surrogate-based collaborative tuning (SCoT). They verified SCoT in two different experiments that outperformed standard tuning methods, but their approach is still suffering from limited meta-features. Koutsoukas et al. [25] studied the comparison between hyperparameters for modeling bioactivity data.
As seen from the above references, much work exists in the field of classification. No work has been done on a comparison between different neural networks based on changing hyperparameters. There are several givens in this paper: • Results for efficiency of each neural network (NN) based on different hyperparameters; • The best results of a comparison of different NNs based on hyperparameters selected after testing; • Best model and hyperparameter values are given.
With the advent of deep learning, there has been more focus on developing techniques for recognition of handwritten letters and digits, using deep neural networks instead of hand-engineered features. The literature review of methods to classify and test handwritten data, using deep-learning techniques, is summarized in Table 1. This section gives a basic overview of work in the field to support our work, whereas the next section describes the methodology taken to prove our claims. As a consequence, our primary contributions are as follows: • A comprehensive evaluation of neural network-based architectures for handwritten digit recognition.

•
An investigation of the hyperparameters of neural network-based architectures for improving the accuracy of handwritten digit recognition.

Previous Work
There has been much work in the field of classification, especially for MNIST datasets. Many types of neural networks have been used for this purpose, and sometimes a combination of different algorithms has been used. Many new advanced forms of underlying neural networks were also tested on MNIST data.
Kessab et al. [12] used a multilayer perceptron (MLP) for the recognition of handwritten letters to build an optical character recognition (OCR) system. Koch et al. [13] used Siamese neural networks for one-shot image recognition, i.e., only one image of each class is used. Shuai et al. [14] built the Independently recurrent neural network (IndRNN), an advanced form of RNN and tested its performance on MNIST datasets. Chen et al. [15] built a probabilistic Siamese network and tested it on MNIST data. Chen used CNN for the image classification of MNIST [16].
Niu et al. [17] combined CNN with a support vector machine (SVM), a different approach for MNIST data classification. Similar work is done by Agarap et al. [18] combined CNN and SVM for image classification. Zhang et al. [19] combined SVM and k-nearest neighbors (KNN) for image classification. Vlontzos [20] used Random neural networks with extreme learning machines (ELM) to build a classifier that does not require large training datasets.
As it is an era that revolves around deep learning, many works have been done in hyperparameter tuning. Mostly, hyperparameters are selected through grid search; but there are many algorithms now that can tune them for better results. Loshchilov et al. [21] proposed covariance matrix adaptation evolution strategy (CMA-ES) for hyperparameter optimization. Smith et al. [22] proposed a disciplined approach for tuning the learning rate, batch size, momentum and weight decay. Domhan et al. [23] extrapolated the learning curves to speed up automatic hyperparameter optimization of deep neural networks. Bardenet et al. [24] merged optimization techniques with surrogate-based ranking for surrogate-based collaborative tuning (SCoT). They verified SCoT in two different experiments that outperformed standard tuning methods, but their approach is still suffering from limited meta-features. Koutsoukas et al. [25] studied the comparison between hyperparameters for modeling bioactivity data.
As seen from the above references, much work exists in the field of classification. No work has been done on a comparison between different neural networks based on changing hyperparameters. There are several givens in this paper: • Results for efficiency of each neural network (NN) based on different hyperparameters; • The best results of a comparison of different NNs based on hyperparameters selected after testing; • Best model and hyperparameter values are given.
With the advent of deep learning, there has been more focus on developing techniques for recognition of handwritten letters and digits, using deep neural networks instead of hand-engineered features. The literature review of methods to classify and test handwritten data, using deep-learning techniques, is summarized in Table 1. This section gives a basic overview of work in the field to support our work, whereas the next section describes the methodology taken to prove our claims. As an instance of best-arm identification in infinitely many-armed bandits The rest of the paper is divided as follows: Section 3 describes the methodology, pipeline and experimental environment used. Section 4 will describe and analyze the results, on the basis of which the optimal hyperparameter values will be selected. In the end, Section 5 will conclude the study with some suggestions for future work.

Methodology
We used five neural networks for comparison purposes: CNN, RNN, multilayer perceptron, the Siamese model and the knowledge-transfer model. The hyperparameters tuned for each classification algorithm are: learning rate dynamics, momentum dynamics, learning rate Decay, Drop learning rate on the plateau and adaptive learning rates. Each hyperparameter was changed to check the accuracy and execution time of the models and their effects. Table 2 shows the different values of hyperparameters taken for testing the models. The dataset used for checking the performances is the MNIST dataset.

Preprocessing of Data
The MNIST dataset has the dimensions of 28 × 28 (784 pixels). All the images are of the same size; hence we did not proceed with generalizations. The images are split into two sets: x_train and y_train as the training set and x_test and y_test as the testing set. The selection of hyperparameter values was selected through the accuracy obtained by the test set. The images are normalized by dividing each image by 255 steps. Each channel is 8 bits and when we split it, the value is confined between 0-1. This makes it precise and computationally efficient based on the following equation:

Architecture of Models
The models will be trained through training sets and selection of hyperparameter values will be done on the basis of accuracy obtained by testing the models on test sets. The factors for measuring performance are: (1) Test loss (2) Test accuracy (3) Execution time The architecture of CNN, RNN and MLP is shown in Tables 3-5, respectively. The base model is shown in Table 6. This model is used as two branches of the Siamese Network. Table 7 shows the architecture of the model used for transfer knowledge. The model was first trained for 5 digits from 1 to 5. Then feature layers were frozen and the model was trained for 5 digits 6 to 10. The number of trainable features in base models is 600,165 while in the knowledge-transfer model the number of trainable features is 590,597. It can be seen that each model is defined in table according to all the parameters that were kept constant throughout experiment.

Type Function Output
Simple RNN -50 Dense -10 Activation Softmax 10 In these networks, we changed the values of all the hyperparameters one-by-one, i.e., we started from the first value of each parameter in Table 2 and changed the value of the learning rate first while keeping other parameters constant and measuring performance. Then we set the learning rate at which the performance was maximum and changed the value of momentum dynamics. This process was repeated for all hyperparameter values and for each model. For each parameter value, the Network was observed for 10 epochs or until it converged. After the values of all the parameters was monitored, a conclusion was drawn as described in the result section.

Results
We performed a rigorous experiment while checking all hyperparameters. We experimented extensively and tested with the Google Colab with a constant epoch = 10 for all of them.

Testing on CNN
First, we tested the CNN model. The results are shown in Table 8. We conclude from this table that choosing the learning rate = 0.001 will yield the highest accuracy, but with the tradeoff of the highest execution time. Then we changed the momentum for CNN. The results are shown in Table 9. We see that momentum does not exhibit in the model performance boost, and thereby tuning it will be of no use and a waste of computational power and time. After this, we changed learning rate decay. The results are shown in Table 10. We concluded that choosing a decay rate with 0.001 will yield the highest accuracy with the minimum test loss. The effect of changing patience was observed on CNN. The results are shown in Figure 3. We found that patience does not contribute much and tuning it would be of no significant value. In the end, different optimizers were used to check which works better for CNN. The results are shown in Table 11. From this chart, we conclude that accuracy is higher for the "adagrad" optimizer with the test loss being the least of all the values.

Results
We performed a rigorous experiment while checking all hyperparameters. We experimented extensively and tested with the Google Colab with a constant epoch = 10 for all of them.

Testing on CNN
First, we tested the CNN model. The results are shown in Table 8. We conclude from this table that choosing the learning rate = 0.001 will yield the highest accuracy, but with the tradeoff of the highest execution time. Then we changed the momentum for CNN. The results are shown in Table 9. We see that momentum does not exhibit in the model performance boost, and thereby tuning it will be of no use and a waste of computational power and time. After this, we changed learning rate decay. The results are shown in Table 10. We concluded that choosing a decay rate with 0.001 will yield the highest accuracy with the minimum test loss. The effect of changing patience was observed on CNN. The results are shown in Figure 3. We found that patience does not contribute much and tuning it would be of no significant value. In the end, different optimizers were used to check which works better for CNN. The results are shown in Table 11. From this chart, we conclude that accuracy is higher for the "adagrad" optimizer with the test loss being the least of all the values.

Testing on RNN
We followed the same routine and we tested RNN by changing the values of our hyperparameters. Tables 12-15 show the effect of changing learning rate, momentum, decay rate and optimizer, respectively, on RNN. We see that even though momentum is up, the execution time is still above 540 s, which in turn converts to 9.5 min and above. The decay rate does affect accuracy, but in most cases, it does not significantly affect the execution time. However, we can see that there is no change in execution time with a change in learning rate; and the same is true for all the other hyperparameters, including patience, as shown in Figure 4. Moreover, there is no significant change in test accuracy by changing the hyperparameters. Consequently, we conclude that in the case of RNN, it is useless to tune the hyperparameters. We see that even though momentum is up, the execution time is still above 540 s, which in turn converts to 9.5 min and above. The decay rate does affect accuracy, but in most cases, it does not significantly affect the execution time. However, we can see that there is no change in execution time with a change in learning rate; and the same is true for all the other hyperparameters, including patience, as shown in Figure 4. Moreover, there is no significant change in test accuracy by changing the hyperparameters. Consequently, we conclude that in the case of RNN, it is useless to tune the hyperparameters.

Testing on Multilayer Perceptron
We tested the multilayer perceptron by changing the values of our hyperparameters. Tables 16-19 show the effect of changing learning rate, momentum, decay rate and optimizer, respectively, on the multilayer perceptron model. From the results, we can see that the best optimum learning rate is 0.1, giving the least loss and maximum accuracy. The best test accuracy was achieved with a moment value of 0.9. A decay rate of 0.0001 yields maximum test accuracy. The Patience value shows hardly any effect in boosting the test accuracy as shown in Figure 5. "Adagrad" outperforms all other optimizers in every criterion.

Testing on Siamese Neural Network (SNN)
We tested SNN by changing the value of our hyperparameters. Tables 20-23 show the effects of changing learning rate, momentum, decay rate and optimizer, respectively, on the SNN model. From the results, we can see that a learning rate of 1.0 yields the maximum result; hence using it will be beneficial. A momentum of 0.9 or 0.99 yields maximum accuracy, hence it should be used. Moreover, decay with 0.0001 yields maximum accuracy, and the "adagrad" optimizer outperforms all optimizers, yielding high accuracy and converging in less execution time.

Testing on Siamese Neural Network (SNN)
We tested SNN by changing the value of our hyperparameters. Tables 20-23 show the effects of changing learning rate, momentum, decay rate and optimizer, respectively, on the SNN model. From the results, we can see that a learning rate of 1.0 yields the maximum result; hence using it will be beneficial. A momentum of 0.9 or 0.99 yields maximum accuracy, hence it should be used. Moreover, decay with 0.0001 yields maximum accuracy, and the "adagrad" optimizer outperforms all optimizers, yielding high accuracy and converging in less execution time.

Testing on Knowledge Transfer Model
In the case of the knowledge transfer model, we have a base model and a transfer model, as shown in Table 7.
For the base model, we saw the following results: • Training and execution time: 18.433 s; • Test accuracy: 0.996. Figure 6a shows the accuracy and a loss curve for the basic NN model of the knowledgetransfer model.
In the case of the transferred model, we saw the following results: • Training and execution time: 11.73 s; • Test accuracy: 0.991. Figure 6b shows the accuracy and a loss curve for the transferred NN model of the knowledge-transfer model.
We can clearly see that the knowledge-transfer model outperforms all the models and takes significantly less time. We did not tune the patience, taking it to be 0. The learning rate was 0.01, with no decay rate. significantly less time. We did not tune the patience, taking it to be 0. The learning rate was 0.01, with no decay rate. We can see from Table 24 that CNN outperforms all models, but the execution time is very high. The knowledge transfer model has the best model time with significantly less time and higher accuracy. The results of NNs are in Table 24 and were taken by choosing the best parameter values.

Parameters Influence and Discussion
The hyperparameters show a different impact on the convergence of the final trained model. Despite the selection of optimal hyperparameters, the model trained from the ground truth will not match the ground truth data precisely. The reason is that the dataset is just a proxy for the data distribution. With larger training samples, the trained model parameters will be close enough to the parameters used to generate the data. In general, if the batch size and the learning rate are correctly tuned, the model will quickly converge. In addition, an optimal and proper seeding of the model will help achieve optimization, thus enabling the model to converge to the best minimum out of several minima.
The learning rate plays a significant role in the convergence of the model towards the optimum. It is seen to counterbalance the cost function influence. If the learning rate chosen is comparatively large, the algorithm does not typically converge. If the selected learning rate is very low, the algorithm will need considerable time to achieve the desired accuracy. It can be seen from execution time in Tables 8, 12, 16 and 20 that lower learning rate takes more execution time. If the learning rate is optimally selected and learning rate updates are in optimal settings, then the model should We can see from Table 24 that CNN outperforms all models, but the execution time is very high. The knowledge transfer model has the best model time with significantly less time and higher accuracy. The results of NNs are in Table 24 and were taken by choosing the best parameter values.

Parameters Influence and Discussion
The hyperparameters show a different impact on the convergence of the final trained model. Despite the selection of optimal hyperparameters, the model trained from the ground truth will not match the ground truth data precisely. The reason is that the dataset is just a proxy for the data distribution. With larger training samples, the trained model parameters will be close enough to the parameters used to generate the data. In general, if the batch size and the learning rate are correctly tuned, the model will quickly converge. In addition, an optimal and proper seeding of the model will help achieve optimization, thus enabling the model to converge to the best minimum out of several minima.
The learning rate plays a significant role in the convergence of the model towards the optimum. It is seen to counterbalance the cost function influence. If the learning rate chosen is comparatively large, the algorithm does not typically converge. If the selected learning rate is very low, the algorithm will need considerable time to achieve the desired accuracy. It can be seen from execution time in Tables 8, 12, 16 and 20 that lower learning rate takes more execution time. If the learning rate is optimally selected and learning rate updates are in optimal settings, then the model should converge to a good set of parameters. From the influences of the learning rate vs. cost function of an algorithm's convergence, we observed that the cost function influences a good learning rate. If the cost function is steep/highly curved, the larger the learning step size, the more the algorithm will not correctly and optimally converge. Therefore, taking small learning steps diminishes the overshoot problem; however, it also drastically slows the learning process and algorithm convergence.
Generally, in the machine learning community, a comparatively large learning rate is usually selected (between 0.1 and 1) and decayed while training on the data [30]. As such, we observed that the right decay selection is of immense importance. Decay selection involves two questions: how much decay and how often? Both of these are non-trivial. As such, we note that aggressively decaying settings extremely slow down the convergence of the model towards the optimum. On the other hand, very minute-paced decaying settings result in chaotic updates to the model, and very small improvements are observed. We conclude from the experiments that optimal decay schedules are of immense importance. Adaptive approaches such as RMSprop, Adam and Momentum help adjust the learning rate during convergence and optimization.
All the parameters chosen for tuning are related to the learning rate of the model. First, there is an inverse relationship between the learning rate and momentum. A high value of momentum can be compensated with a low learning rate in order to achieve high accuracy and vice versa. However, there is still a threshold that needs to be maintained in order to achieve high accuracy, i.e., very smaller values of momentum cannot be compensated with increasing high value of learning rate. It can be seen from Tables 8 and 9 that maximum accuracy was achieved while keeping learning rate low, but not too low and momentum value high, but not too high. The similar is true for Tables 12 and 13 although in case of RNN model optimal momentum value is smaller than CNN, which is acceptable as each model has its own optimal values of parameters. In Tables 16 and 17, the learning rate has gone up 10 times, but the momentum value remains 0.9. In case of knowledge-transfer model, Tables 20 and 21, the learning rate is high, but momentum value is still 0.9 for highest accuracy, this can be explained in a sense that it is not appropriate to say which learning rate is appropriate for the model, i.e., not too high or not too low. Second, when comparing the learning rate with the decay rate, it is important to notice that a high decay rate can cause the model never to reach an optimal point if the learning rate value is small, while the low decay rate can cause the model to pass the optimal point while keeping the learning rate high. Therefore, it can be concluded that a high decay rate can be compensated with a high learning rate while a low decay rate can be compensated with a low learning rate. This can be proved using Tables 8 and 10, Tables 12 and 14, Tables 16 and 18 where low learning rate and low decay rate gives highest accuracy. For Tables 20 and 22 this is also true if we consider learning rate as low for knowledge-transfer model. Hence, when looking it altogether low learning rate, high momentum value and low learning decay rate gives higher accuracy. However, while decreasing the learning rate at plateau makes the model more accurate, it would also decrease the learning rate and affect the other parameters linked to it, as explained above. This can help make an overview of the whole structure about the dependency of hyperparameters on each other taking the learning rate as the base.
In the end, the adaptive learning rate helps in compensating the effect of poorly configured learning rate. Choosing an optimizer does not affect the other parameters that much and it is not dependent on their values. Although choosing an appropriate optimizer is also important according to different neural networks models. When speaking in terms of execution time Adam optimizer has a highest execution time followed by RMSProp. This is true for all the cases in Tables 11, 19 and 23,  except for Table 15 where in case of RNN model SGD has highest execution time. When dealing with execution time it is also important to mention effect of batch size on it. High batch size causes a low number of batches for training which in return causes less training time. There is not much difference in final accuracy, but higher batch size takes less iteration to reach it and lower batch size takes more time to reach it [22].

Conclusions and Future Work
There are many neural network models; each one is unique in its own way. It has been shown that when comparing different models for a specific set and the values of hyperparameters, the "knowledge-transfer model" outperforms all other models. It is true that CNN also gives great results, but at the cost of high execution time. This research is useful in the sense that while building a specific classification model, one can select the values of hyperparameters that perform better according to the proof given.
Everything is incomplete and there is always room for improvement. In our opinion, one can apply an algorithm for hyperparameter tuning to test the performance of the model. This may lead to better hyperparameter value selection and hence provides better classification results.