Handwritten Devanagari Character Recognition Using Layer-Wise Training of Deep Convolutional Neural Networks and Adaptive Gradient Methods

: Handwritten character recognition is currently getting the attention of researchers because of possible applications in assisting technology for blind and visually impaired users, human–robot interaction, automatic data entry for business documents, etc. In this work, we propose a technique to recognize handwritten Devanagari characters using deep convolutional neural networks (DCNN) which are one of the recent techniques adopted from the deep learning community. We experimented the ISIDCHAR database provided by (Information Sharing Index) ISI, Kolkata and V2DMDCHAR database with six different architectures of DCNN to evaluate the performance and also investigate the use of six recently developed adaptive gradient methods. A layer-wise technique of DCNN has been employed that helped to achieve the highest recognition accuracy and also get a faster convergence rate. The results of layer-wise-trained DCNN are favorable in comparison with those achieved by a shallow technique of handcrafted features and standard DCNN.


Introduction
In the last few years, deep learning approaches [1] have been successfully applied to various areas such as image classification, speech recognition, cancer cell detection, video search, face detection, satellite imagery, recognizing traffic signs and pedestrian detection, etc.The outcome of deep learning approaches is also prominent, and in some cases the results are superior to human experts [2,3] in the past years.Most of the problems are also being re-experimented with deep learning approaches with the view to achieving improvements in the existing findings.Different architectures of deep learning have been introduced in recent years, such as deep convolutional neural networks, deep belief networks, and recurrent neural networks.The entire architecture has shown the proficiency in different areas.Character recognition is one of the areas where machine learning techniques have been extensively experimented.The first deep learning approach, which is one of the leading machine learning techniques, was proposed for character recognition in 1998 on MNIST database [4].The deep learning techniques are basically composed of multiple hidden layers, and each hidden layer consists of multiple neurons, which compute the suitable weights for the deep network.A lot of computing power is needed to compute these weights, and a powerful system was needed, which was not easily available at that time.Since then, the researchers have drawn their attention to finding the technique which needs less power by converting the images into feature vectors.In the last few decades, a lot of feature extraction techniques have been proposed such as HOG (histogram of oriented gradients) [5], SIFT (scale-invariant feature transform) [6,7], LBP (local binary pattern) [8] and SURF (speeded up robust features) [9].These are prominent feature extraction methods, which have been experimented for many problems like image recognition, character recognition, face detection, etc. and the corresponding models are called shallow learning models, which are still popular for the pattern recognition.Feature extraction [10] is one type of dimensionality reduction technique that represents the important parts of a large image into a feature vector.These features are handcrafted and explicitly designed by the research community.The robustness and performance of these features depend on the skill and the knowledge of each researcher.There are the cases where some vital features may be unseen by the researchers while extracting the features from the image and this may result in a high classification error.
Deep learning inverts the process of handcrafting and designing features for a particular problem into an automatic process to compute the best features for that problem.A deep convolutional neural network has multiple convolutional layers to extract the features automatically.The features are extracted only once in most of the shallow learning models, but in the case of deep learning models, multiple convolutional layers have been adopted to extract discriminating features multiple times.This is one of the reasons that deep learning models are generally successful.The LeNet [4] is an example of deep convolutional neural network for character recognition.Recently, many other examples of deep learning models can be listed such as AlexNet [3], ZFNet [11], VGGNet [12] and spatial transformer networks [13].These models have been successfully applied for image classification and character recognition.Owing to their great success, many leading companies have also introduced deep models.Google Corporation has made a GoogLeNet having 22  Character recognition is a field of image processing where the image is recognized and converted into a machine-readable format.As discussed above, the deep learning approach and especially deep convolutional neural networks have been used for image detection and recognition.It has also been successfully applied on Roman (MNIST) [4], Chinese [14], Bangla [15] and Arabic [16] languages.In this work, a deep convolutional neural network is applied for handwritten Devanagari characters recognition.
The main contributions of our work can be summarized in the following points: 1.This work is the first to apply the deep learning approach on the database created by ISI, Kolkata.
The main contribution is a rigorous evaluation of various DCNN models.

2.
Deep learning is a rapidly developing field, which is bringing new techniques that can significantly ameliorate the performance of DCNNs.Since these techniques have been published in the last few years, there is even a validation process for establishing their cross-domain utility.
We explored the role of adaptive gradient methods in deep convolutional neural network models, and we showed the variation in recognition accuracy.

3.
The proposed handwritten Devanagari character recognition system achieves a high classification accuracy, surpassing existing approaches in literature mainly regarding recognition accuracy.

4.
A layer-wise technique of DCNN technique is proposed to achieve the highest recognition accuracy and also get a faster convergence rate.
The remainder of this paper is organized as follows.Section 2 discusses previous work in handwritten Devanagari character recognition, Section 3 presents the introduction of deep convolutional neural network and adaptive gradient methods, Section 4 outlines the experiments and discussions and, finally, Section 5 concludes the paper.

Previous Work
Devanagari handwritten character recognition has been investigated by different feature extraction methods and different classifiers.Researchers have used structural, statistical and topological features.Neural networks, KNN (K-nearest neighbors), and SVM (Support vector machine) are primarily used for classification.However, the first research work was published by I. K. Sethi and B. Chatterjee [17] in 1976.The authors recognized the handwritten Devanagari numerals by a structured approach which found the existence and the positions of horizontal and vertical line segments, D-curve, C-curve, left slant and right slant.A directional chain code based feature extraction technique was used by N. Sharma [18].A bounding box of a character sample was divided into blocks and computed 64-D direction chain code features from each divided block, and then a quadratic classifier was applied for the recognition of 11,270 samples.The authors reported an accuracy of 80.36% for handwritten Devanagari characters.Deshpande et al. [19] used the same chain code features with a regular expression to generate an encoded string from characters and improved the recognition accuracy by 1.74%.A two-stage classification approach for handwritten characters was reported by S. Arora [20] where she used structural properties of characters like shirorekha and spine in the first stage and in another stage used intersection features.These features further fed into a neural network for the classification.She also defined a method for finding the shirorekha properly.This approach has been tested on 50,000 samples and obtained 89.12% accuracy.In [21], S. Arora combined different features such as chain codes, four side views, and shadow based features.These features were fed into a multilayer perceptron neural network to recognize 1500 handwritten Devanagari characters and obtain 89.58% accuracy.
A fuzzy model-based recognition approach has reported by M. Hanmandlu [22].The features are extracted by the box approach which divided the character into 24 cells (6 × 4 grid), and a normalized vector distance for each box was computed except the empty cells.A reuse policy is also used to enhance the speed of the learning of 4750 samples and obtained 90.65% accuracy.The work presented in [23] computed shadow features, chain code features and classified the 7154 samples using two multilayer perceptrons and a minimum edit distance method for handwritten Devanagari characters.They reported 90.74% accuracy.Kumar [24] has tested five different features named Kirsch directional edges, chain code, directional distance distribution, gradient, and distance transform on the 25,000 handwritten Devanagari characters and reported 94.1% accuracy.During the experiment, he found the gradient feature outperformed the remaining four features with the SVM classifier, and the Kirsch directional edges feature was the weakest performer.A new kind of feature was also created that computed total distance in four directions after computing the gradient map and neighborhood pixels' weight from the binary image of the sample.In the paper [25], Pal applied the mean filter four times before extracting the direction gradient features that have been reduced using the Gaussian filter.They used modified quadratic classifier on 36,172 samples and reported 94.24% accuracy using cross-validation policy.Pal [26] has further extended his work with SVM and MIL classifier on the same database and obtained 95.13% and 95.19% recognition accuracy respectively.
Despite the higher recognition rate achieved by existing methods, there is still room for improvement of the handwritten Devanagari character recognition.

Deep Convolutional Neural Networks (DCNN)
The deep convolutional neural network can be broadly segregated into two major parts as shown in Figure 1, the first part contains the sequence of alternative convolutional with max-pooling layers, and another part contains the sequence of fully connected layers.An object can be recognized by its features which are directly dependent on the distributions of color intensity in the image.The Gaussian, Gabor, etc. filters are used to record these color intensity distributions.The values of a kernel for these filters are predefined, and they record only the specific distribution of color intensity.The kernel values are not going to change as per the response of the applied model.However, in DCNN, the values of the kernel are being updated according to the response of the model.That helps to find the best kernel values for the model.The alternative convolutional and max-pooling layers do this job perfectly.Another part of DCNN is fully connected layers which contain multiple neurons, like the simple neural network in each layer that gets a high-level feature from the previous convolutional-pooling layer and computes the weights to classify the object properly.
this job perfectly.Another part of DCNN is fully connected layers which contain multiple neurons, like the simple neural network in each layer that gets a high-level feature from the previous convolutional-pooling layer and computes the weights to classify the object properly.

DCNN Notation
The deep convolutional neural network is a specially designed neural network for the image processing work.The most of the color images are being represented in three dimensions , where h represents height, w represents the width of the image and c represents the number of channels of the image.However, the DCNN can only take an image which has the same height and width.So before feeding the image in DCNN, a normalization process has to follow to convert the image from size to size where m represents height and width of an image.The DCNN directly takes the three-dimensional normalized image/matrix X as an input and supplies to convolutional layer which has k kernels of size , where and .The convolutional layer performs the multiplication between the neighbors of a particular element of X with the weights provided by the kernel to generate the k different feature maps of size − + 1).The convolutional layer is often followed by the activation functions.Rectified linear unit (Relu) was selected as activation function where k denotes the feature map layer, Y is a map of size and is a kernel weight of size , represents the bias value and * represents the 2D convolution.The next pooling layer works to reduce the feature maps by applying mean, max or min operation over local region of feature map, where can vary from 2 to 5 generally.DCNNs have multiple consecutive layers of convolutional followed by pooling layers and each convolutional layer introduces a lot of unknown weight.The back-propagation algorithm-one of the famous techniques used in the simple neural network to find weight automatically-has been used to find the unknown weights during the training phase.The back-propagation updates the weights to minimize a loss or error with an iterative process of gradient descent that can be expressed as Back-propagation algorithm helps to follow a direction towards where the cost function gives the minimum loss or error by updating the weights.The value α, called learning rate, helps to determine the step size or change in the previous weight.The back-propagation can be stuck at local minimum sometimes, which can be overcome by momentum μ which accumulates a velocity vector ν in the direction of continuous reduction of loss function.The error or loss of a network can be found by various functions.The sum of squares function used to calculate the loss or error that can be expressed as

DCNN Notation
The deep convolutional neural network is a specially designed neural network for the image processing work.The most of the color images are being represented in three dimensions h × w × c, where h represents height, w represents the width of the image and c represents the number of channels of the image.However, the DCNN can only take an image which has the same height and width.So before feeding the image in DCNN, a normalization process has to follow to convert the image from h × w × c size to m × m × c size where m represents height and width of an image.The DCNN directly takes the three-dimensional normalized image/matrix X as an input and supplies to convolutional layer which has k kernels of size n × n × p, where n < m and p ≤ c.The convolutional layer performs the multiplication between the neighbors of a particular element of X with the weights provided by the kernel to generate the k different feature maps of size l(m − n + 1).The convolutional layer is often followed by the activation functions.Rectified linear unit (Relu) was selected as activation function where k denotes the feature map layer, Y is a map of size l × l and W il is a kernel weight of size n × n, B k l represents the bias value and * represents the 2D convolution.
The next pooling layer works to reduce the feature maps by applying mean, max or min operation over pl × pl local region of feature map, where pl can vary from 2 to 5 generally.DCNNs have multiple consecutive layers of convolutional followed by pooling layers and each convolutional layer introduces a lot of unknown weight.The back-propagation algorithm-one of the famous techniques used in the simple neural network to find weight automatically-has been used to find the unknown weights during the training phase.The back-propagation updates the weights to minimize a loss j(w) or error with an iterative process of gradient descent that can be expressed as Back-propagation algorithm helps to follow a direction towards where the cost function gives the minimum loss or error by updating the weights.The value α, called learning rate, helps to determine the step size or change in the previous weight.The back-propagation can be stuck at local minimum sometimes, which can be overcome by momentum µ which accumulates a velocity vector ν in the direction of continuous reduction of loss function.The error or loss of a network can be found by various functions.The sum of squares function used to calculate the loss or error that can be expressed as An L2 regularization λ was applied during the computation of loss to avoid the large progress of the parameters at the time of the minimization process.
The entire network of DCNN involves the multiple layers of convolutional, pooling, relu, fully connected and Softmax.These layers have a different specification to express them in a particular network.In this paper, we used a special convention to express the network of DCNN.

•
xINy: An input layer where x represents the width and height of the image and y represent the number of channels.

Different Adaptive Gradient Methods
Basically, the neural network training updates the weights in each iteration, and the final goal of training is to find the perfect weight that gives the minimum loss or error.One of the important parameters of the deep neural network is learning rate, which decides the change in the weights.The selection of value for learning rate is a very challenging task because if the value of the learning rate selects low, then the optimization can be very slow and a network will take time to reach the minimum loss or error.On the other hand, if the value of learning rate selects higher, then the optimization can deviate and the network will not reach the minimum loss or error.This problem can be solved by the adaptive gradient methods that help in faster training and better convergence.The Adagrad [27] (adaptive gradient) algorithm was introduced by Duchi in 2011.It automatically incorporates low and high update for frequent and infrequent occurring features respectively.This method gives an improvement in convergence performance as compared to standard stochastic gradient descent for the sparse data.It can be expressed as, where Av t is the previous adjustment gradient and is used to avoid divide by zero problems.The Adagrad method divides the learning rate by the sum of the squared gradient that produces a small learning rate.This problem is solved by the Adadelta method [28] that can only accumulate a few past gradients in spite of entire past gradients.The equation of the Adadelta method can be expressed as where E[Av] 2 represents entire past gradients.It depends on current gradient and the previous average of the gradient.The problem of Adagrad is solved by Hinton [29] by the technique called RMSProp, which was designed for stochastic gradient descent.RMSProp is an updated version of Rprop which did not work with mini-batches.Rprop is same as the gradient, but it also divides by the size of the gradient.RMSProp keeps a moving average of the squared gradient for each weight and, further, it divides the gradient by square root of the mean square value.The first moving average of the squared gradient is given by, where γ is the forgetting factor, ∇Qw is the derivative of the error and Av t−1 is the previous adjustment value.The weights are updated as per following equation, where w is the previous weight and w t+1 is the updated weight whereas α is the global learning rate.Adam (adaptive moment estimation) [30] is another optimizer for DCNN that needs the first-order gradient with small memory and computes adaptive learning rate for different parameters.This method has proven better than the RMSprop and rprop optimizers.The rescaling of the gradient is dependent on the magnitudes of parameter updates.The Adam does not need a stationary object and works with sparse gradients.It also contains a decaying average of past gradients M t .
where M t and V t are calculated first and the second moment of the gradients and these values are biased towards zero when the decay rates are small, and thereby bias-correction has done first and second moments estimates: As per the authors of Adam, the default values of B 1 and B 2 were fixed at 0.9 and 0.999 empirically.They have shown its work in practice as a best choice as an adaptive learning method.Adamax is an extension of Adam, where in place of L 2 norm, an L P norm-based update rule has been followed.

Layerwise Training DCNN Model
The work of training is to find the best weight for the deep neural network at which the network produces high accuracy or a very small error rate.The outcome of any deep model neural network somehow depends on how the model was trained and the number of layers.Usually, the model is created with the certain number of layers, and entire layers are being involved in the training phase.In this work, we proposed a layer-wise training model of DCNN in spite of involving entire layers during the training phase to recognize the handwritten Devanagari characters.The layer-wise training model starts with adding one layer of convolutional and pooling layer, followed by fully connected layer and applies the back-propagation algorithm to find the weights.In the next phase of the layer-wise training model, the next layer of convolutional, pooling layer is added and the back propagation algorithm is applied with previously found weights to calculate weights for the added layer.
After adding entire layers, a fine tuning was performed with the complete network to adjust the entire weights of the network on a very low learning rate.The back-propagation algorithm starts with some random weights, and during training it sharpens the weighs by updating them in each epoch.The layer-wise training model provides nice rough weights initially as the network starts with first layers and, further, it adds remaining layers to find the weights for remaining layers.The layer-wise training model is clearly shown in Figure 2. The training starts with only one pair of convolutional and pooling layer and further another pair is being added.Algorithm 1 shows the stepwise procedure to create the layer-wise DCNN model.

Experiments and Discussions
Experiments were carried out on two databases: ISIDCHAR and V2DMDCHAR using the DCNN, layer-wise DCNN and different adaptive gradient methods.As it is hard to delineate the number of layers of DCNN that can produce the best result, we considered six different network architectures (NA) of DCNN as shown in Table 1.NA-1 contains only single convolutional-pooling layer and 500 fully connected neurons to observe the first response of DCNN.The next, NA-2 has double the number of fully connected neurons.The aim is to observe the impact of enhancement.Further, NA-3 and NA-4 have two C-P layers with variation in the number of kernels to analysis the impact of two C-P layers.The last, NA-5 and NA-6 have three C-P layers.
Initially, the different network architectures of DCNN were applied on each database to find out the best model for that particular database and then the proposed layer-wise DCNN was applied to observe the impact of that model.The models have also been tested with different adaptive gradient methods to these methods; they are also under experiment to observe their performance.Our work also shows the impact of different adaptive gradient methods on recognition accuracy.

Experiments and Discussions
Experiments were carried out on two databases: ISIDCHAR and V2DMDCHAR using the DCNN, layer-wise DCNN and different adaptive gradient methods.As it is hard to delineate the number of layers of DCNN that can produce the best result, we considered six different network architectures (NA) of DCNN as shown in Table 1.NA-1 contains only single convolutional-pooling layer and 500 fully connected neurons to observe the first response of DCNN.The next, NA-2 has double the number of fully connected neurons.The aim is to observe the impact of enhancement.Further, NA-3 and NA-4 have two C-P layers with variation in the number of kernels to analysis the impact of two C-P layers.The last, NA-5 and NA-6 have three C-P layers.
Initially, the different network architectures of DCNN were applied on each database to find out the best model for that particular database and then the proposed layer-wise DCNN was applied to observe the impact of that model.The models have also been tested with different adaptive gradient methods to these methods; they are also under experiment to observe their performance.Our work also shows the impact of different adaptive gradient methods on recognition accuracy.

Network Model Architectures
NA-1 The experiments were all executed on the ParamShavak supercomputer system having two multicore CPUs with each CPU consisting of 12 cores along with two accelerator cards.This system has 64 GB RAM with CentOs 6.5 operating system.The deep neural network model was coded in Python using Keras-a high-level neural network API that uses Theano Python library.The basic pre-processing tasks like background elimination, gray-normalization and image resizing were done in Matlab.ISIDCHAR and V2DMDCHAR databases.
The ISIDCHAR [26] was prepared by researchers of the Indian Statistical Institute, Kolkata.They collected the samples from persons of different age groups to accommodate the maximum variation of written characters.Apart from that, the samples are also collected from the filled job forms and post-cards that makes this database so realistic.This database consists of 36,172 grayscale images of 47 different Devanagari characters.Owing to the assemblage of samples from many authors, this database delivers a variety of samples in each class, and the background of the samples is also highly uninformed.V2DMDCHAR [31] has been prepared by Vikas J. Dongre and Vijay H. Mankar's in 2012.This database has 20,305 samples of handwritten Devanagari characters.

Experimental Setup
The experiments were performed to investigate the effects of different network architectures, optimizers, and layer-wise trainings.The first phase of experiments was performed to observe the best network architecture for the database, and then the best-observed network architecture was tested with six different optimizers to find the best optimizer.A total of 12 (6 + 6) different experiments were performed on the database.The second phase of experiments aimed to observe the effect of layer-wise training.The layer-wise training was only performed with the best network architecture and best optimizer selected in the first phase.
Each optimizer had its own set of parameters.In our experiments, the optimizer parameters were kept as per their default values or as suggested by the author.The rectified linear activation function was used for entire experiments to mitigate the gradient vanishing problem.The sum of squares of the difference between target and observed values was calculated to estimate the loss of the deep network.Each network was trained for 100 epochs using mini-batches of size 200.

Results
The first phase of experiments was performed on ISIDCHAR to examine the best deep network architecture.We recorded the recognition accuracy at different network architecture using the Adam optimizer during each of the 50 epochs.The results in terms of the maximum, minimum, mean, and standard deviation values of recognition accuracy are reported in Table 2.
The best recognition accuracy was obtained with the network architecture NA-6, and the least recognition accuracy was obtained with the network architecture NA-1. Figure 3 shows the obtained recognition accuracy at each epoch.The network NA-1 produced 85% recognition accuracy because it has only one convolutional layer.The network NA-3 and NA-5 produced higher recognition accuracies of 91.53% and 93.24% respectively because these networks have a more convolutional layer.This enhancement signifies that the increment of the convolutional layer in deep convolutional neural network produced best results.In our experiments, we observed the enhancement in the recognition accuracy by increasing the number of kernels of convolutional layer.The network architectures NA-2, NA-4 and NA-6 had more kernels than NA-1, NA-3 and NA-5 and they produced higher recognition accuracy as observed in Table 2.The number of trainable parameters for each network architecture is shown in Table 3.The entire network architecture was also tested using the RMSProp optimizer, and the results have reported in Table 4.The NA-6 network produced 96.02% recognition accuracy with RMSProp while 95.58% with Adam.The behavior of NA-6 with RMSProp at each epoch can be seen in Figure 4.
higher recognition accuracy as observed in Table 2.The number of trainable parameters for each network architecture is shown in Table 3.The entire network architecture was also tested using the RMSProp optimizer, and the results have reported in Table 4.The NA-6 network produced 96.02% recognition accuracy with RMSProp while 95.58% with Adam.The behavior of NA-6 with RMSProp at each epoch can be seen in Figure 4.The best recognition accuracy of the ISIDCHAR database was obtained with NA-6 network architecture with RMSProp optimizer.However, it may be possible that this network could perform better with other optimizers.To further investigate, we performed experiments with six different optimizers.Table 5 shows the recognition accuracy obtained with NA-6 at different optimizers.The highest recognition accuracy 96.02% was recorded with NA-6 at RMSProp optimizer.The Adam optimizer outperformed the SGD and Adagrad optimizers.The AdaDelta, AdaMax, and RMSProp The best recognition accuracy of the ISIDCHAR database was obtained with NA-6 network architecture with RMSProp optimizer.However, it may be possible that this network could perform better with other optimizers.To further investigate, we performed experiments with six different optimizers.Table 5 shows the recognition accuracy obtained with NA-6 at different optimizers.The highest recognition accuracy 96.02% was recorded with NA-6 at RMSProp optimizer.The Adam optimizer outperformed the SGD and Adagrad optimizers.The AdaDelta, AdaMax, and RMSProp optimizers outperformed the Adam optimizer.Figure 5 shows the performance of individual optimizer.optimizers outperformed the Adam optimizer.Figure 5 shows the performance of individual optimizer.We found that the NA-6 network architecture with RMSProp optimizer produced the highest recognition accuracy.This network was again trained by layer-wise model as described in Section 3.3.We found that the NA-6 network architecture with RMSProp optimizer produced the highest recognition accuracy.This network was again trained by layer-wise model as described in Section 3.3.
This network was tested with ISIDCHAR, V2DMDCHAR, and combined databases.The results are reported in Table 6.It has been seen that a nice enhancement in the recognition accuracy was recorded by the layer-wise training model.The 97.30% recognition accuracy was obtained on ISIDCHAR database and 97.65% recognition accuracy obtained on V2DMDCHAR database.The layer-wise training model was also applied after combining both the databases and obtained 98% recognition accuracy when 70% of the samples were used for training and the rest used for testing.The current work is compared to previous works on ISIDCHAR database in Table 7.

Conclusions
Deep learning is one of the prominent technologies that have been experimentally studied with entire major areas of computer vision and document analysis.In this paper, we experimentally developed a deep convolutional neural network (DCNN) and adaptive gradient methods to recognize the unconstrained handwritten Devanagari characters.The deep convolutional neural network helped us to find the best features automatically and also classify them.We experimented with a handwritten Devanagari character database with six different DCNN network architectures as well as six different optimizers.The highest recognition accuracy 96.02% was obtained using NA-6 network architecture and RMSProp-an adaptive gradient method (optimizer).Further, we again trained DCNN layer-wise, which is also adopted by many researchers to enhance the recognition accuracy, using NA-6 network architecture and the RMSProp adaptive gradient method.Using DCNN layer-wise training model, our database obtained 98% recognition accuracy, which is the highest recognition accuracy of the database.
layers of convolutional and pooling layers alternatively.Apart from this model, Google has also developed an open source software library named Tensorflow to conduct deep learning research.Microsoft also introduced its own deep convolutional neural network architecture named ResNet in 2015.ResNet has 152-layer network architectures which made a new record in detection, localization, and classification.This model introduced a new idea of residual learning that makes the optimization and the back-propagation process easier than the basic DCNN model.

Figure 3 .
Figure 3.In this figure, we draw the recognition accuracy obtained with different network architectures on ISIDCHAR database at each epoch.The Adam optimizer was used.

Figure 3 .
Figure 3.In this figure, we draw the recognition accuracy obtained with different network architectures on ISIDCHAR database at each epoch.The Adam optimizer was used.

Figure 4 .
Figure 4.In this figure, we draw the recognition accuracy obtained with different network architectures on the ISIDCHAR database at each epoch.The RMSProp optimizer was used.
• xCy: A convolutional layer where x represents a number of kernels and y represents the size of kernel y*y.
• xPy: A pooling layer where x represents pooling size x*x, and y represents pooling stride.• Relu: Represents rectified layer unit.• xDrop: A dropout layer where x represents the probability value.• xFC: A fully connected or dense layer where x represents a number of neurons.• xOU: A output layer where x represents classes or labels.

Table 1 .
Various network architectures of deep convolutional neural network used.

Table 2 .
In this table, we report the results in term of maximum, minimum, mean, and standard deviation recognition accuracy obtained with different network architectures on ISIDCHAR when the system trained for 50 epochs with the Adam optimizer.The best scores are in bold.

Table 3 .
List of trainable parameters in each network architecture.

Table 2 .
In this table, we report the results in term of maximum, minimum, mean, and standard deviation recognition accuracy obtained with different network architectures on ISIDCHAR when the system trained for 50 epochs with the Adam optimizer.The best scores are in bold.

Table 3 .
List of trainable parameters in each network architecture.

Table 4 .
In this table, we report the results in term of maximum, minimum, mean, and standard deviation recognition accuracy obtained with different network architectures on ISIDCHAR when the system trained for 50 epochs with the RMSProp optimizer.The best scores are in bold.

Table 4 .
In this table, we report the results in term of maximum, minimum, mean, and standard deviation recognition accuracy obtained with different network architectures on ISIDCHAR when the system trained for 50 epochs with the RMSProp optimizer.The best scores are in bold.

Table 5 .
In this table, we report the results in term of maximum, minimum, mean, and standard deviation recognition accuracy obtained with NA-6 on ISIDCHAR when the system trained for 50 epochs with the different optimizers.The best scores are in bold.

Table 5 .
In this table, we report the results in term of maximum, minimum, mean, and standard deviation recognition accuracy obtained with NA-6 ISIDCHAR when the system trained for 50 epochs with the different optimizers.The best scores are in bold.

Table 6 .
In this table, we reported the maximum recognition accuracy obtained with NA-6 and RMSProp optimizer on ISIDCHAR, V2DMDCHAR and combined both when the model was trained layer-wise.

Table 7 .
Comparison of recognition accuracy by other researchers.