An E ﬀ ective and Improved CNN-ELM Classiﬁer for Handwritten Digits Recognition and Classiﬁcation

: Optical character recognition is gaining immense importance in the domain of deep learning. With each passing day, handwritten digits (0–9) data are increasing rapidly, and plenty of research has been conducted thus far. However, there is still a need to develop a robust model that can fetch useful information and investigate self-build handwritten digit data e ﬃ ciently and e ﬀ ectively. The convolutional neural network (CNN) models incorporating a sigmoid activation function with a large number of derivatives have low e ﬃ ciency in terms of feature extraction. Here, we designed a novel CNN model integrated with the extreme learning machine (ELM) algorithm. In this model, the sigmoid activation function is upgraded as the rectiﬁed linear unit (ReLU) activation function, and the CNN unit along with the ReLU activation function are used as a feature extractor. The ELM unit works as the image classiﬁer, which makes the perfect symmetry for handwritten digit recognition. A deeplearning4j (DL4J) framework-based CNN-ELM model was developed and trained using the Modiﬁed National Institute of Standards and Technology (MNIST) database. Validation of the model was performed through self-build handwritten digits and USPS test datasets. Furthermore, we observed the variation of accuracies by adding various hidden layers in the architecture. Results reveal that the CNN-ELM-DL4J approach outperforms the conventional CNN models in terms of accuracy and computational time.


Introduction
Recognizing handwritten digits from their images has been gaining great importance in the 21st century. Handwritten digits are used in various online handwritten applications like extracting postal zip codes [1], handling bank cheque amounts [2], and identifying vehicle license-plates [3], etc. All these domains are dealing with datasets and therefore demand high recognition accuracy with smaller computational complexity. It has been reported that deep learning models have more merits as compared to shallow neural designs [4][5][6][7][8][9]. Differences between DNN and SNN are described in Table 1. The objective of a handwriting digits recognition scheme is to transform handwritten characters images into machine-understandable formats. Generally, handwritten digits [10] are diverse in terms of orientation, size, and distance from the margins, thickness, security systems, and strokes, 2 of 15 which increase the complexity in recognizing handwritten numbers. Due to this diversity, handwritten numerals recognition is a challenging task for researchers. In the last few years, many machine learning and deep learning algorithms were developed for handwritten digits recognition (HDR). Boukharouba et al. [11] proposed a novel handwritten feature extraction technique using a support vector machine. In this model, the vertical and horizontal direction of handwritten digits were merged with the freeman chain code method, and the model doesn't require any digits normalization process. Mohebi et al. [12] proposed an HDR system by using self-organizing maps and obtained improved results compared with former self-organizing map (SOM)-based algorithms. Alwzwazy et al. [13] classified Arabic handwritten digits by implementing a robust deep belief neural network (DBNN) based open-source deep learning framework, Caffe, and achieved 95.7% accuracy, which is not up to mark. Size of hidden layers 3. Single hidden layer is required to fully connect the network.
3. Multiple hidden layers which may be fully connected.
Requirements 4. Give more importance to the quality of features and their extraction process. 4. Automatically detects the significant features of an object, e.g., an image, handwritten character or a face. 5. More dependent on human expertise. 5. Less human involvement.
Adhesh Garg et al. [14] present an efficient CNN model with several convolutions ReLU and pooling layers for a random dataset of English handwritten characters which is trained on the MNIST dataset, and this work obtained 68.57% testing accuracy. Ayushi Jain et al. [14] proposed a new method that introduced rotational invariance using multiple instances of convolutional neural networks (RIMCNN). They applied the RIMCNN model for classifying handwritten digits and rotated captcha recognition. This model achieved 99.53% of training accuracy. Akhtar et al. [15] presented a novel feature extraction technique whereby both support vector machines (SVM) and K-nearest neighbors (K-NNs) classifiers were used for offline handwritten digits recognition. Experimental results show 96.18% accuracy for SVM and 97% for K-NNs. Alex Krizhevsky et al. [16] presented a two-layer deep belief network (DBN) architecture that trained 1.6 million tiny images and achieved high classification performance by using CIFAR-10. Arora et al. [17] compared and CNN for the classification of handwritten digits. Results demonstrated that the CNN classifier performed better than the FNN. Malik and Roy [18] proposed a new approach that was artificial neural network (ANN)-and ELM-based for MNIST handwritten digits. The test accuracy of both the model was 96.6% and 98.4% respectively. Ali et al. [19] designed a model for recognizing the Sindhi handwritten digits, using multiple machine learning approaches. Their experimental results illustrate that the random forest classifier (RFC) and decision tree (DT) perform effectively as compared to other ML approaches. Bishnoi et al. [20] suggested a novel method for offline HDR using various databases like NIST and MNIST, whereby every digit was classified through four regions, including right, left, upper, and lower. These four regions curves were used for identifying images. Cruz et al. [21] presented a new method of feature extraction for handwritten digits identification and used ensemble classifiers. Overall, six feature sets were extracted, and this novel model was tested using the MNIST database. Nevertheless, most conventional methods discussed above were used freely available datasets like MNIST and CIFAR or other self-build datasets, including Arabic, Bangla, Sindhi, etc. However, very few works have been done on self-build datasets, Symmetry 2020, 12, 1742 3 of 15 especially handwritten digits (0-9). Moreover, the reported work also has gaps in terms of accuracy and computational time, which need to be further improved.
Deep Learning4j (DL4J) is an open source, Java-based, distributed deep-learning library. It is also written in Scala that can be amalgamated with Hadoop and Spark [22]. DL4J is created in a way that can use distributed GPUs and CPUs platforms. It gives the capability to work with arbitrary n-dimensional arrays, and use the CPU and GPU resources. Distinct from various other frameworks, DL4J divides the updater algorithm from the optimization algorithm. This permits us to be flexible while trying to find a combination that works best for data and problem.
In the present work, we have proposed a self-build handwritten digit recognition system based on the symmetrical CNN-ELM algorithm. In comparison with other traditional classification algorithms, such as SVM, backpropagation (BP), ELM has fast in training speed as well as high training precision in short running time [23]. The proposed CNN-ELM architecture is split into two parts, namely features extraction and classification, as shown in Figure 1. Initially, an input image is given to the convolutional neural network for features extraction, and then the image is classified into one of the output classes. The whole process of the proposed method is as follows, firstly CNN unit with the ReLU activation function was used for feature extraction from handwritten digit images. Secondly, ELM is replaced with the last layer of CNN fully connected layer to classify the digits (0-9) based on feature vector obtained. In addition, this study made a contrast between different numbers of hidden layers for handwritten digits' recognition to validate CNN-ELM-DL4J architecture efficiency. Moreover, a self-build handwritten digits dataset of 4478 samples and a USPS test dataset are utilized to test the proposed model. The results indicate that the proposed framework recognized a self-build dataset and achieved state-of-the-art test accuracy in a short computational time.
[21] presented a new method of feature extraction for handwritten digits identification and used ensemble classifiers. Overall, six feature sets were extracted, and this novel model was tested using the MNIST database. Nevertheless, most conventional methods discussed above were used freely available datasets like MNIST and CIFAR or other self-build datasets, including Arabic, Bangla, Sindhi, etc. However, very few works have been done on self-build datasets, especially handwritten digits (0-9). Moreover, the reported work also has gaps in terms of accuracy and computational time, which need to be further improved.
Deep Learning4j (DL4J) is an open source, Java-based, distributed deep-learning library. It is also written in Scala that can be amalgamated with Hadoop and Spark [22]. DL4J is created in a way that can use distributed GPUs and CPUs platforms. It gives the capability to work with arbitrary ndimensional arrays, and use the CPU and GPU resources. Distinct from various other frameworks, DL4J divides the updater algorithm from the optimization algorithm. This permits us to be flexible while trying to find a combination that works best for data and problem.
In the present work, we have proposed a self-build handwritten digit recognition system based on the symmetrical CNN-ELM algorithm. In comparison with other traditional classification algorithms, such as SVM, backpropagation (BP), ELM has fast in training speed as well as high training precision in short running time [23]. The proposed CNN-ELM architecture is split into two parts, namely features extraction and classification, as shown in Figure 1. Initially, an input image is given to the convolutional neural network for features extraction, and then the image is classified into one of the output classes. The whole process of the proposed method is as follows, firstly CNN unit with the ReLU activation function was used for feature extraction from handwritten digit images. Secondly, ELM is replaced with the last layer of CNN fully connected layer to classify the digits (0-9) based on feature vector obtained. In addition, this study made a contrast between different numbers of hidden layers for handwritten digits' recognition to validate CNN-ELM-DL4J architecture efficiency. Moreover, a self-build handwritten digits dataset of 4478 samples and a USPS test dataset are utilized to test the proposed model. The results indicate that the proposed framework recognized a self-build dataset and achieved state-of-the-art test accuracy in a short computational time.

Related Work
In the previous two decades, research has been ongoing in deep learning. Using several machine learning algorithms, there have been countless treads in classifiers building on various datasets for image recognition and detection. In particular, deep learning on different datasets has shown improvement in accuracy. Deep learning algorithms like CNNs are broadly used for recognition. The MNIST dataset is a benchmark for handwritten digit that is used by numerous researchers to assess multiple leading-edge machine learning concepts. Several research papers are published in which the MNIST dataset is used for directing experimentations based on CNN and ELM pattern classification models, and they are described below.
Tan et al. [24] have proposed an SDAGD ("stochastic diagonal approximate greatest descent") to train the weight parameters in CNN. Hamid et al. [25] used three different classifiers, namely KNN, SVM, and CNN, to assess the performance on MNIST datasets. The performance of Multilayer perceptron on that platform was not up to mark as it wasn't able to accurately recognize digit 6, 9, and stuck in the local optimal rather global minimum didn't obtain. With the implementation of

Related Work
In the previous two decades, research has been ongoing in deep learning. Using several machine learning algorithms, there have been countless treads in classifiers building on various datasets for image recognition and detection. In particular, deep learning on different datasets has shown improvement in accuracy. Deep learning algorithms like CNNs are broadly used for recognition. The MNIST dataset is a benchmark for handwritten digit that is used by numerous researchers to assess multiple leading-edge machine learning concepts. Several research papers are published in which the MNIST dataset is used for directing experimentations based on CNN and ELM pattern classification models, and they are described below.
Tan et al. [24] have proposed an SDAGD ("stochastic diagonal approximate greatest descent") to train the weight parameters in CNN. Hamid et al. [25] used three different classifiers, namely KNN, SVM, and CNN, to assess the performance on MNIST datasets. The performance of Multilayer perceptron on that platform was not up to mark as it wasn't able to accurately recognize digit 6, 9, and stuck in the local optimal rather global minimum didn't obtain. With the implementation of Keras modality, it was reported that accuracy was improved on CNN as other classifiers, performed accurately. Xu et al. [26] researched on improving the overfitting in CNN. Gosh et al. [27] conducted a comparative study on MNIST dataset and implemented DBF ("deep belief networks"), DNN ("deep neural networks"), and CNN. It was concluded that with an accuracy rate of 98.08% performance of DNN was best among others as they had some error rates as well as differences in their time of Deng [29] presented a detailed survey on deep learning algorithms, architectures, and applications. All types, including generative, hybrid and discriminative architectures along with their algorithms, were discussed in detail. CNN, recurrent neural networks (RNN), autoencoders, DBNs, and RBMs were discussed with their various applications. Teow [30] has presented a minimal easily understandable CNN model for the recognition of handwritten digit. Yann le Cunn et al. [31] presented a thorough overview of deep learning and its algorithms. The algorithms like RNN, CNN, and backpropagation, along with multi-layer perceptron, have been discussed in all aspects with illustrations. These reported studies have demonstrated the trend of unsupervised learning in the field of artificial intelligence (AI).
Ercoli et al. [32] via obtaining multi k-means technique have premeditated hash codes and used them for repossession of visual descriptors. Krichevsky [16] on the CIFAR-10 dataset uses a 2-layer Convolutional Deep Belief Network (CDBN). The prototype obtained 78.90% accuracy in classification of a said dataset on a GPU unit. Abouelnaga et al. [33] constructed a collaborative classifier on K-nearest neighbor. They used a combination of KNN and CNN to reduce the overfitting by Principal Component Analysis (PCA). Improved accuracy of 0.7% was obtained by combining these two classifiers. Wang et al. [34] have discussed a new approach of optimization for edifice correlations between filters in CNN's. Chherawala et al. [35] proposed a vote weighted RNN's model to regulate the implication of feature sets. That model is an application of RNN and the significance of that model was determined by combinations of weighted votes. Its features were extracted from the images of Alex's word and then those features were used for handwriting recognition. Katayama and Yamane [36] suggested that the CNN architecture trained by rotated and un-rotated images undergo classification by the assessment of feature map obtained from its convolutional part. Pang and Yang [37] proposed a rapid learning model known as deep convolutional extreme learning machine (DC-ELM), by taking two datasets MNIST and USPS. Results show that the DC-ELM method improves testing accuracy and significantly decreases the training time. He et al. [38] present an effective model based on a combined CNN and regularized extreme learning machine, known as CNN-RELM, by utilizing ORL and NUST face databases. The proposed CNN-RELM model outperforms CNN and RELM. Xu et al. [39] constructed a sample selection-based hierarchical extreme learning machine (H-ELM) model for the classification task. They use a combination of FCM with CNN and H-ELM for data classification. Results reveal that the sample selection method achieves higher prediction results with a small training dataset, in a significantly short training time for the MINIST, CIFAR-10, and NORB databases. Das et al. [40] evaluate the performance of the ELM model on handwritten character recognition databases such as ISI-Kolkata Bangla characters, MNIST, ISI Kolkata Odia digits, and a new established NIT-RKL Bangla dataset. Shifei Ding et al. [41] investigate a novel convolutional extreme learning machine with a kernel (CKELM) model based on deep learning for solving various issues. KELM is not efficient for feature extraction, and DL takes excessive time for the training process. Results conclude that the performance of CKELM is higher than ELM, RELM, and KELM, especially in terms of accuracy.

Convolution Neural Network Framework
Nowadays, machine learning algorithms are trendy in the field of image segmentation and image classification because it cannot alter the topological structure of the images. CNN is a deep learning neural network, which is applying in various areas like, pattern recognition, speech analysis [42].
The conventional structure of CNN architecture comprises five layers which are shown in Figure 2. The first layer, which is the input layer has a normalized pattern of S × S size matrix. The feature map links the inputs of its prior layer. The convolution features derive from convolutional layer are the input for the max-pooling layer. Every neuron in the feature map shares the same kernel and the same weights [43]. For example, by using 4 as kernel size, max-pooling ratio 2, stride 2, and padding of zero, all feature map layers shrink its features size S to (S-4)/2 from the previous feature size. [42].
The conventional structure of CNN architecture comprises five layers which are shown in Figure  2. The first layer, which is the input layer has a normalized pattern of S × S size matrix. The feature map links the inputs of its prior layer. The convolution features derive from convolutional layer are the input for the max-pooling layer. Every neuron in the feature map shares the same kernel and the same weights [43]. For example, by using 4 as kernel size, max-pooling ratio 2, stride 2, and padding of zero, all feature map layers shrink its features size S to (S-4)/2 from the previous feature size.
There are some special structural features in the CNN architecture, including downsampling, weight sharing, and local sensing domain. Each layer has a single local perception domain neuron. Generally, these neurons of each layer are only related to a certain domain, which is of 5×5 rectangular area the network input layer. Because of these special structural attributes, each neuron extracts the structural features of the input image. The training parameters of the CNN network can significantly shrink through weight sharing feature. Down-sampling is also an effective and unique feature of the CNN model, which is suitable for extracting images, reducing noise, and also have the capability to reduce the feature dimension. The CNN model constructs in a sequence of the input layer, hidden layers, and output layer. Two hidden layers: the convolution layer extracts features from the image, and the downsampling layer selects the optimized features from extracted features.

Extreme Learning Machine Framework
Extreme Learning Machine (ELM) is a feedforward neural networks. It is a fast learning algorithm which is established for a single-hidden layer that can recognize images. During the training process, no need to adjust or update the parameters, simply adjust the hidden layer nodes to find the best solution [44]. In contrast with the conventional classification methods like CNN and SVM [45], ELM has the power of very fast running and efficient learning speed, robust generalization capability, and few parameter adjustments. For a single hidden layer neural network, formally, suppose that we have a set of N arbitrary distinct samples ( , )  , For L number of hidden layers nodes, a single hidden layer can be described as follows, There are some special structural features in the CNN architecture, including downsampling, weight sharing, and local sensing domain. Each layer has a single local perception domain neuron.
Generally, these neurons of each layer are only related to a certain domain, which is of 5×5 rectangular area the network input layer. Because of these special structural attributes, each neuron extracts the structural features of the input image. The training parameters of the CNN network can significantly shrink through weight sharing feature. Down-sampling is also an effective and unique feature of the CNN model, which is suitable for extracting images, reducing noise, and also have the capability to reduce the feature dimension. The CNN model constructs in a sequence of the input layer, hidden layers, and output layer. Two hidden layers: the convolution layer extracts features from the image, and the downsampling layer selects the optimized features from extracted features.

Extreme Learning Machine Framework
Extreme Learning Machine (ELM) is a feedforward neural networks. It is a fast learning algorithm which is established for a single-hidden layer that can recognize images. During the training process, no need to adjust or update the parameters, simply adjust the hidden layer nodes to find the best solution [44]. In contrast with the conventional classification methods like CNN and SVM [45], ELM has the power of very fast running and efficient learning speed, robust generalization capability, and few parameter adjustments. For a single hidden layer neural network, formally, suppose that we have a set of N arbitrary distinct samples ( , · · · , T im }t ∈ R m , For L number of hidden layers nodes, a single hidden layer can be described as follows, where the activation function is denoted by G(x), W j = [W j1 , W j2 , . . . , W jn ] t belongs to input weights, γ j is the output weight, bias is donated by b j . (W j · x i ) is indicated the inner product of inputs weights and input samples. A single hidden layer is used to reduce the error of output. It can be mathematically expressed as, If γ j , W j and b j are exist, then it can be regarded and expressed as a matrix by the following Equation's, By training the single hidden layer neural network, we can getŴ j ,b j andγ j which makes the Equation as follows (5) H Here, j = 1, 2, . . . , L; it is equivalent to the minimum loss function The ELM algorithm does not require any adjustment for parameters. After randomly determining the input weights W j and bias b j , the γ of the hidden layer and output matrixes H are uniquely decided.

Combined and Improved CNN-ELM-DL4J Framework
In the convolution layer of Figure 2, the kernel is convoluted on the entire image and provides an output by using the activation function. Usually, convolution and subsampling layers are come out alternately in CNN. All output feature maps of the convolution layer are connected to the input feature maps. The output of convolution layer feature maps is obtained as follows.
where n belongs to the number of convolution layer, convolution kernel is denoted by W ji , ∅ i is used as a bias, and W j belongs to the input map. The activation function is represented by f (). The sigmoid function is a typical CNN activation function. This increases the training time of the network. Therefore, an improved, easy to derive, unsaturated, and nonlinear ReLU function [46] is used in each convolution layer. ReLU function also reduces the overfitting issue and faster the convergence speed of the whole CNN architecture. The ReLU function is derived using a mathematical equation: In the present study, classification accuracy was obtained through the implementation of an enhanced CNN framework integrated with the ELM algorithm. The CNN unit is subjected to the extract features from the handwritten images and gives output while the ELM utilizes the output generated from the CNN unit as input and thus generates results by classifying the images.

Used Datasets
For the experimental study, the well-renowned freely available MNIST dataset is utilized. This database consists of 70,000 images. In our experimental study, we have employed 60,000 images for training and 10,000 images for testing. The dataset was already normalized, i.e., there is no need for further pre-processing. Figure 3 represents the handwritten images from the MNIST dataset. Moreover, Symmetry 2020, 12, 1742 7 of 15 in this project, the USPS test dataset of 2007 samples are also used for testing purposes. USPS [47,48] comprises 7291 training samples and 2007 testing samples in grayscale for the digits 0 to 9.

Used Datasets
For the experimental study, the well-renowned freely available MNIST dataset is utilized. This database consists of 70,000 images. In our experimental study, we have employed 60,000 images for training and 10,000 images for testing. The dataset was already normalized, i.e., there is no need for further pre-processing. Figure 3 represents the handwritten images from the MNIST dataset. Moreover, in this project, the USPS test dataset of 2007 samples are also used for testing purposes. USPS [47,48] comprises 7291 training samples and 2007 testing samples in grayscale for the digits 0 to 9.

Own Test Dataset and Preprocessing
A self-build test dataset was created that contain 4478 handwritten digits images. Among all images, around five hundred photographs per digit are created for each numeral (0-9). The dataset was constructed by 5 university students. The ages of participants vary from 15 to 30 years. We cannot use our own dataset directly for experimentation because data items are untidy and different sizes. First, we have set grayscale image to 28 × 28 size, then transpose the colors in a way that the front became white and the background became black. While creating the dataset, it was considered to make each image as natural as ordinary people write the standard handwritten digits in their daily routine. We have used a self-build dataset only for testing purpose. The whole dataset depicts ten output classes for (0-9) digits.

CNN-ELM-DL4J Model Details
The proposed CNN-ELM structure mainly comprises the following layers: an input layer, a convolutional layer, a pooling layer, a fully connected layer, a Softmax layer, and ELM classification layer.

Own Test Dataset and Preprocessing
A self-build test dataset was created that contain 4478 handwritten digits images. Among all images, around five hundred photographs per digit are created for each numeral (0-9). The dataset was constructed by 5 university students. The ages of participants vary from 15 to 30 years. We cannot use our own dataset directly for experimentation because data items are untidy and different sizes. First, we have set grayscale image to 28 × 28 size, then transpose the colors in a way that the front became white and the background became black. While creating the dataset, it was considered to make each image as natural as ordinary people write the standard handwritten digits in their daily routine. We have used a self-build dataset only for testing purpose. The whole dataset depicts ten output classes for (0-9) digits.

CNN-ELM-DL4J Model Details
The proposed CNN-ELM structure mainly comprises the following layers: an input layer, a convolutional layer, a pooling layer, a fully connected layer, a Softmax layer, and ELM classification layer.
An input layer is the input of the neural network. Before entering this layer, it must be decided how much the image or input should be preprocessed. Networks like LeNet-5, for example, work well on images with little preprocessing.
The convolutional layer is the essential layer of CNN. In this layer, the images are transformed into a set of representative features. The main objective is to reduce the images into something easier to process, without losing their important characteristics, which means creating a feature map. The element involved in carrying out the convolution operation in the convolutional layer is named neuron, filter, or kernel. This element, which is a square matrix smaller than the image itself, is in charge of taking square patches of pixels and passing them through the filter with a certain stride till the image is completely parsed and, significant patterns in the pixels are found. The convolution consists of taking the dot product of the filter with the patch of the image, where the value of the filters can assign importance to some aspects of the image, so they can be able to differentiate one image from other. These values will tend to be learnable values, called weights, and will be reinforced by some other learnable values, called biases, which are constant values.
There are some commonly used activation functions like those in the linear unit family, for example, the ReLU function. The ReLU function has the characteristic of giving an output of zero for any negative input (or input of zero) while providing the same output value to any positive input. A down-sampling (data reduction) operation is performed in the pooling layer, where the size of the feature map is reduced. There are different ways of down-sampling of the data, however max-pooling is the usually used option.
The convolutional and sub-sampling layers are succeeded by one or more fully connected layers, where each neuron attaches to all the neurons in the previous layer. The features of the image, extracted by the preceding layers, are combined to recognize large patterns, and the last ELM layer, combines the features to classify the images.
The number of outputs of the layer is equal to the number of classes in the target data (digit recognition is a 10-class recognition problem). The feature vector acquired from previous Fc_1 layer as the input for the ELM algorithm. ELM use some function to train the training set. After obtaining trained parameters, the predict function of ELM used for classification of test sets. In the end, the training and the validation set recognition accuracy was achieved.
The architecture used in this paper is a variation of the LeNet-5, and it was decided to implement this type of CNN-ELM architecture due to the nature of the character recognition problem (the LeNet-5 architecture was created to work specifically with a handwritten digit recognition problem). The structure of the improved CNN-ELM for handwritten digits image recognition is depicted in Figure 4, and the detailed setting of each layer is represented in Table 2. trained parameters, the predict function of ELM used for classification of test sets. In the end, the training and the validation set recognition accuracy was achieved. The architecture used in this paper is a variation of the LeNet-5, and it was decided to implement this type of CNN-ELM architecture due to the nature of the character recognition problem (the LeNet-5 architecture was created to work specifically with a handwritten digit recognition problem). The structure of the improved CNN-ELM for handwritten digits image recognition is depicted in Figure  4, and the detailed setting of each layer is represented in Table 2.

The Training of CNN
In this phase, we have trained the CNN model, which is depicted in Figure 5. The associated features and the parameters are adjusted using the gradient descent method following errors among the actual output and the envisage output. The training process of the CNN model stops if the least error or an extreme number of iterations is reached, at which point the model is saved for the next step. Feature map of CNN can be adjusted and obtain as follows. i.
For n-th layer of the convolutional layer, m-th feature map derived by the following equation, where N m belongs to the input set, f belongs to the nonlinear activation function, k n jm is the b n m convolutional filter, and b n m is the bias. ii.
Similarly, for n-th number of subsampling layers, its m-th feature map is obtained by where w n m belongs to weights, down(·) is a pooling function, and b n m is the bias.

The Training of CNN
In this phase, we have trained the CNN model, which is depicted in Figure 5. The associated features and the parameters are adjusted using the gradient descent method following errors among the actual output and the envisage output. The training process of the CNN model stops if the least error or an extreme number of iterations is reached, at which point the model is saved for the next step. Feature map of CNN can be adjusted and obtain as follows.

Results and Discussion
In the experimental part, the model was trained on freely accessible MNIST dataset while the testing/validation of the framework was carried out by self-build handwritten and USPS digit datasets. Moreover, the results were analyzed through a confusion matrix. In addition to this, the architecture was validated by changing the number of hidden layers. Finally, accuracy comparison of the proposed framework with the reported literature depicted that the state of the art accuracy has been achieved by combining CNN architecture with the ELM algorithm.

Digits vs. Error Rate
The error rate plot allows one to compute more statistical inference. The line graph represents the error rate versus numerals (digits) for both CNN and CNN-ELM networks in Figure 6a. One can infer from this diagram that the error rate for digit zero, two, and six is lowest 0. This might be attributed to the less cursive handwriting styles for digit 0, 2, and 6. Meanwhile, the highest error rate is found for digit 8 (0.94%) for our proposed (CNN-ELM) network owing to its resemblance with digit 3 and 5. However, the bare CNN network shows higher error rates in contrast to an extreme learning-based convolutional neural network. To justify this statement, the plot of handwritten digits vs. correctly and incorrectly classified images as depicted in Figure 6b. According to this plot, the digit zero, two, and six have the fewest incorrectly classified images, which directly means the lowest error rate. The reverse is the case for digit 8.
where n m w belongs to weights,   down  is a pooling function, and n m b is the bias.

Results and Discussion
In the experimental part, the model was trained on freely accessible MNIST dataset while the testing/validation of the framework was carried out by self-build handwritten and USPS digit datasets. Moreover, the results were analyzed through a confusion matrix. In addition to this, the architecture was validated by changing the number of hidden layers. Finally, accuracy comparison of the proposed framework with the reported literature depicted that the state of the art accuracy has been achieved by combining CNN architecture with the ELM algorithm.

Digits vs. Error Rate
The error rate plot allows one to compute more statistical inference. The line graph represents the error rate versus numerals (digits) for both CNN and CNN-ELM networks in Figure 6a. One can infer from this diagram that the error rate for digit zero, two, and six is lowest 0. This might be attributed to the less cursive handwriting styles for digit 0, 2, and 6. Meanwhile, the highest error rate is found for digit 8 (0.94%) for our proposed (CNN-ELM) network owing to its resemblance with digit 3 and 5. However, the bare CNN network shows higher error rates in contrast to an extreme learning-based convolutional neural network. To justify this statement, the plot of handwritten digits vs. correctly and incorrectly classified images as depicted in Figure 6b. According to this plot, the digit zero, two, and six have the fewest incorrectly classified images, which directly means the lowest error rate. The reverse is the case for digit 8.

Training and Validation Accuracy
We have observed the training and validation accuracy of CNN and CNN-ELM models simultaneously. From Figure 7a,b, the accuracy is very random during the training and validation period for each digit. The training accuracy of the CNN-ELM model is dominated by the training accuracy of CNN model. For instance, maximum training accuracy was recorded 98.90% for the digit five for CNN and 99.10% for digit zero and 5 for CNN-ELM seen in Figure 7a. Similarly, in Figure 7b, one can observe that the validation accuracy in CNN-ELM for all the digits is surpassing the validation accuracy for bare CNN. Altogether, the validation accuracy of the proposed model is dominated by the validation accuracy of CNN model. This happens because of efficient training

Training and Validation Accuracy
We have observed the training and validation accuracy of CNN and CNN-ELM models simultaneously. From Figure 7a,b, the accuracy is very random during the training and validation period for each digit. The training accuracy of the CNN-ELM model is dominated by the training accuracy of CNN model. For instance, maximum training accuracy was recorded 98.90% for the digit five for CNN and 99.10% for digit zero and 5 for CNN-ELM seen in Figure 7a. Similarly, in Figure 7b, one can observe that the validation accuracy in CNN-ELM for all the digits is surpassing the validation accuracy for bare CNN. Altogether, the validation accuracy of the proposed model is dominated by the validation accuracy of CNN model. This happens because of efficient training which is done by adjusting the hyperparameters and selecting the appropriate one. These results are concordant with the error rate plot Figure 6a. Thus, all results evidence the proposed CNN-ELM model as a more efficient and superior model than the others, especially the bare convolutional neural networks.
The training set size highly affects network accuracy. The accuracy increases as more dataset are available to train the model. The training and testing dataset is not as much affected by the addition of more data in the training set.
Symmetry 2020, 12, x FOR PEER REVIEW 11 of 15 which is done by adjusting the hyperparameters and selecting the appropriate one. These results are concordant with the error rate plot Figure 6a. Thus, all results evidence the proposed CNN-ELM model as a more efficient and superior model than the others, especially the bare convolutional neural networks. The training set size highly affects network accuracy. The accuracy increases as more dataset are available to train the model. The training and testing dataset is not as much affected by the addition of more data in the training set.

Analysis through Confusion Matrix
The self-build handwritten numerals dataset is presented in the confusion matrix, as shown in Table 3. A self-build dataset comprises around 4500 (total number of images for each digit is highlighted in red colour) handwritten images were used for testing purpose; only 19 images were misclassified. For the digits 0, 2, and 6 recognition rate is 100% because not a single image is misclassified, Among 473 images of digit 1, only one image is wrongly predicted as 7. Similarly, for digit 3, two images were misclassified as digits 2 and 9. The complete details of incorrectly predicted images are shown in the table below. A total of 19 images is wrongly classified among the whole test dataset. The maximum number of wrong prediction is highlighted in pink colour. In the confusion matrix, we have shown that which misclassified image is categorized for which class. After vigilant surveillance of the patterns of these images, the causes behind these misclassifications are quite understandable.

Analysis through Confusion Matrix
The self-build handwritten numerals dataset is presented in the confusion matrix, as shown in Table 3. A self-build dataset comprises around 4500 (total number of images for each digit is highlighted in red colour) handwritten images were used for testing purpose; only 19 images were misclassified. For the digits 0, 2, and 6 recognition rate is 100% because not a single image is misclassified, Among 473 images of digit 1, only one image is wrongly predicted as 7. Similarly, for digit 3, two images were misclassified as digits 2 and 9. The complete details of incorrectly predicted images are shown in the table below. A total of 19 images is wrongly classified among the whole test dataset. The maximum number of wrong prediction is highlighted in pink colour. In the confusion matrix, we have shown that which misclassified image is categorized for which class. After vigilant surveillance of the patterns of these images, the causes behind these misclassifications are quite understandable.

Comparison of Different Number of Hidden Layers
Moreover, some additional experiments were also performed by increasing the number of layers to check the influence of the number of hidden layers on accuracy. According to reported literature by the increasing of hidden layers could give more accuracy [24]. In contrary, our proposed framework showed a decrement in accuracy with the increment of hidden layers. Figure 8a,b indicate that accuracy for the architecture constituting seven hidden layers is smaller than five and six hidden layers architecture for both CNN and CNN-ELM-DL4J models. However, the accuracy of the framework constituting five hidden layers is the highest. We also observe by increasing hidden layers, which leads to the complexity of the network, which also increases the computational time. This unusual approach is due to the amount of dataset. When experiments are performed on an enormous dataset, the framework with a high number of hidden layers may give more accuracy. In our case, the testing dataset contains only 4478 images; it can be seen that the accuracy through five hidden layers is highest.
Therefore, the proposed model needs a few parameters to accomplish higher recognition accuracy for self-build handwritten images, so it is computationally efficient. The accuracy comparison for various classification approaches is illustrated in Table 4.
that accuracy for the architecture constituting seven hidden layers is smaller than five and six hidden layers architecture for both CNN and CNN-ELM-DL4J models. However, the accuracy of the framework constituting five hidden layers is the highest. We also observe by increasing hidden layers, which leads to the complexity of the network, which also increases the computational time. This unusual approach is due to the amount of dataset. When experiments are performed on an enormous dataset, the framework with a high number of hidden layers may give more accuracy. In our case, the testing dataset contains only 4478 images; it can be seen that the accuracy through five hidden layers is highest. Therefore, the proposed model needs a few parameters to accomplish higher recognition accuracy for self-build handwritten images, so it is computationally efficient. The accuracy comparison for various classification approaches is illustrated in Table 4.

Conclusions
The various deep learning-based model has been employed to recognize the handwritten digit from its image thus far. However, there is still a need for an efficient and effective model in terms of

Conclusions
The various deep learning-based model has been employed to recognize the handwritten digit from its image thus far. However, there is still a need for an efficient and effective model in terms of recognition accuracy, computational time, and high efficiency for feature extraction. Herein, a state-of-art convolutional neural network (CNN) with an extreme learning machine architecture was implemented to train the MNIST images and self-build handwritten numerals images used for model validation. Experimental results demonstrate that the CNN-ELM-DL4J algorithm is better than conventional CNN models in terms of recognition accuracy and computational time. By using the ELM algorithm, our model is computationally efficient as compared to simple CNN and other machine learning networks. Furthermore, we have explored the effect of a various number of hidden layers on the model's efficiency. From the results, it is concluded that adding more hidden layers led to an elevation in network complexity and computational time. Thus, the framework with an optimum number of hidden layers will give higher accuracy. For future work, experimental results can be improved and become more efficient by increasing/changing the dataset images and/or further tuning the network with appropriate parameters.