A Sustainable Deep Learning Framework for Object Recognition Using Multi-Layers Deep Features Fusion and Selection

: With an overwhelming increase in the demand of autonomous systems, especially in the applications related to intelligent robotics and visual surveillance, come stringent accuracy requirements for complex object recognition. A system that maintains its performance against a change in the object’s nature is said to be sustainable and it has become a major area of research for the computer vision research community in the past few years. In this work, we present a sustainable deep learning architecture, which utilizes multi-layer deep features fusion and selection, for accurate object classification. The proposed approach comprises three steps: (1) By utilizing two deep learning architectures, Very Deep Convolutional Networks for Large-Scale Image Recognition and Inception V3, it extracts features based on transfer learning, (2) Fusion of all the extracted feature vectors is performed by means of a parallel maximum covariance approach, and (3) The best features are selected using Multi Logistic Regression controlled Entropy-Variances method. For verification of the robust selected features, the Ensemble Learning method named Subspace Discriminant Analysis is utilized as a fitness function. The experimental process is conducted using four publicly available datasets, including Caltech-101, Birds database, Butterflies database and CIFAR-100, and a ten-fold validation process which yields the best accuracies of 95.5%, 100%, 98%, and 68.80% for the datasets respectively. Based on the detailed statistical analysis and comparison with the existing methods, the proposed selection method gives significantly more accuracy. Moreover, the computational time of the proposed selection method is better for real-time implementation.


Introduction
Object recognition is currently one of the most focused areas of research due to its emerging application in intelligent robotics and visual surveillance [1,2]. The researchers, however, are still facing problems in this domain for correct object recognition, such as in recognizing an object's shape and spotting a minor difference among several objects. Therefore, a sustainable system-the one that maintains its performance against a change in the object's nature-is required for the correct recognition of complex objects [3]. Object classification is the key to a sustainable visual surveillance system [4]. Besides the latter, object classification finds its application in numerous domains, including intelligent robotics, face and action recognition, video watermarking, pedestrian tracking, autonomous vehicles, semantic scene analysis, content-based image retrieval, and many more. We believe that a genuinely sustainable object recognition system still has to overcome numerous challenges, including complex background, different shape and same color for different objects, continuously moving objects, different angles, and many more, since the conventionally usedunsustainable systems-did not prove to work well for complex object classification [5].
Many techniques have been introduced in computer vision to overcome the previously discussed challenges related to complex objects. What most of them tried to accomplish was an optimal method that would perform the same for many types of problems, but this was a considerable challenge. Although in the past few decades, the conventional approaches, such as Hand-Crafted Features (HCF), were used, as the time passed, however, the objects and their backgrounds became more confusing, thereby restricting their use. Handcrafted features included Histogram of Oriented Graph (HOG) [6], geometric features [7], Scale Invariant Feature Transformation (SIFT) [8], Difference of Gaussian (DoG) [9], Speeded-Up Robust Features (SURF) [10], and texture features (HARLICK) [11]. Recent techniques, in contrast, proposed to exploit a hybrid set of features to get a better representation of an object [12]. Unfortunately, those techniques were unable to recognize the growing complexities of objects and images as well.
In the face of the challenges as mentioned earlier, the concept of deep learning has been recently introduced in this context, which has also shown improved performance against reduced computational time. With this, a large number of convolutional neural networks (CNN) pre-trained models have been proposed. This includes AlexNet [13], VGG (VGG-16, VGG-19 [14], GoogleNet [14], ResNet (Resnet-50, ResNet-102, and ResNet-152) [15], and Inception [16]; all these models are trained on the ImageNet dataset. Even with these contributions, however, acceptable accuracy has been difficult to achieve. This has given rise to the concept of features fusion [15,16]-a process of combining several feature populations into a single feature space, which has been adopted in various applications ranging from medical imaging to object classification [17][18][19]. The concept of features fusion does manage to achieve increased classification accuracy, but only at an increased computational cost. In addition, some of the recent works have shown that the fusion process may add irrelevant features that are not important for the classification task [17,18]. We believe that if the irrelevant features were selected and removed from the fused vector, then the computational time could be minimized with an increased accuracy.
Feature selection can be categorized into three: Filter-based, wrapper-based, and embedded. The filter-based selection selects the features from subsets independently. The wrapper-based methods initially assume the features, and then select them based on predictive power. The embedded selection initially utilizes the selection in the training phase, which enjoys the advantages of both filter-based and wrapper based [19]. Some of the famous feature selection techniques include Principle Component Analysis (PCA) [20], Linear Discriminant Analysis (LDA) [21], Pearson Correlation Coefficient (PCC) [22], Independent Component Analysis (ICA) [22], Entropy Controlled [23], Genetic Algorithm-based (GA) [24], and many more.
In this work, an entire sustainable framework based on a deep learning architecture is proposed. While we summarize our challenges and highlight our contributions in response to those in Section 3, the details on the proposed framework are explicitly given in Section 4. Section 5 presents the simulation results before we conclude the manuscript in Section 6. In what follows, however, we review some of the existing related works, in Section 2.

Related Work
Many strategies are performed for image classification, as investigated in the area of computer vision and machine learning. Object categorization is the most emergent field of computer vision because of its enormous applications in video surveillance, auto-assisted vehicle frameworks, pedestrian analysis, automatic target recognition, and so on. In the literature, very few fusion-based techniques are presented for the classification of complex objects. Features fusion is the process of combining two or more feature spaces into a single matrix. By fusion, there is a chance to get a higher accuracy vector having properties of multiple feature spaces. Roshan et al. [25] presented a new technique for object classification. They applied the presented algorithm on the VGG-16 architecture and performed training from scratch. Additionally, they used transfer learning (TL) on the top layers. They utilized the Caltech-101 dataset and achieved an accuracy of 91.66%. Jongbin et al. [26] introduced a new DFT-based technique for feature building by discarding the pooling layers among the fully connected and convolutional layers. Two modules were implemented in this technique: The first module, known as DFT, replaced max-pooling from the architecture by a user-defined size pooling. The second module, known as DFT+, was the fusion of multiple layers to get the best classification accuracy. They achieved 93.2% classification accuracy on the Caltech-101 dataset using the VGG-16 CNN network, and 93.6% accuracy on the same dataset using the Resnet-50 model. Qun et al. [27] used a pre-trained network with associative memory banks for feature extraction. They extracted the features using ResNet-50 and VGG-16. Later, the K-Means clustering was used on the memory banks to perform unsupervised clustering. Qing et al. [28] presented a fused framework for object classification. They extracted the CNN features and applied three different types of coding techniques onto the fused vector.
Two pre-trained models, namely VGG-M and VGG-16, were used for feature extraction from the 5-Conv-Layer. Subsequently, PCA-based reduction was applied, and features were fused into a final vector using the proposed coding techniques. Results showed an improved accuracy of 92.54% on the Caltech-101 database. Xueliang et al. [29] presented a late fusion-based technique for object recognition. Three pre-trained networks, namely AlexNet, VGGNet, and ResNet-50, were used for the purpose. Firstly, they evaluated that the middle-level layers of the CNN architecture contained more robust information for visual representation, and then features were extracted from these layers. Features fusion from these three models showed the improved result, and reported 92.2% accuracy on the Caltech-101 dataset. Hamayun et al. [30] proved that the most robust features were extracted from the fully connected layer-6 (FC-6) instead of the FC-8. In the presented approach, they exploited the CNN output and modified it at a middle-level layer instead of the deepest layer. VGG-16 and VGG-19 pre-trained models were used to illustrate the proposed technique. They extracted 4096 features from the FC-6 layer and then applied reduction using PCA. For the experimental process, they used the Caltech-101 dataset and attained an accuracy of 91.35% using the reduced features from the layer FC-6. Mahmood et al. [31] gave an idea for object detection and classification using pre-trained networks (ResNet-50 and ResNet-152). After feature extraction, they performed features reduction using PCA. The Caltech-101 database was selected for evaluation and achieved an accuracy of 92.6%. Emine et al. [32] used convolutional architecture for fast feature embedding (Caffe) for object recognition. About 300 images from the Caltech-101 dataset were used to test the proposed technique. Results showed that 260 images were correctly classified, and 40 were misclassified. Chunjie et al. [33] introduced a new technique, called Contextual Exemplar, to handle the drawbacks caused by the local features. The method comprised three phases: In the first, they combined the regions-based image, followed by constructing the relationship between those regions in the second phase, and they used the connection of those regions for semantic representation in the third phase. They selected 1000 features and achieved an accuracy of 86.14%. Rashid et al. [8] focused on multiple features fusion and selection of the best of them for efficient object classification. They used VGG and Alexnet pre-trained models for CNN feature extraction and SIFT as point features extraction. Both types of features were fused by a simple concatenation approach. Moreover, they introduced an entropy-based selection approach within their framework, which achieved an accuracy of 89.7% for the Caltech-101 dataset. Nazar et al. [34] fused HOG and Inception V3 CNN features and improved the existing accuracy up to 90.1% for the Caltech-101 dataset.

Challenges and Contributions
The computer vision research community is still facing various challenges for object classification, and most of them are due to the complex nature of objects. We do realize that it is not an easy task to classify objects into their relevant categories efficiently. To be able to tackle the challenges facing the community and achieve the required accuracies, in this work, we propose a deep learning architecture-based framework for object classification with improved accuracy. The highlights of the framework are as follows: A detailed statistical analysis of the proposed method is conducted and compared with recent techniques to examine the stability of the proposed architecture.

Materials and Methods
The proposed object classification architecture is presented in this section with detailed mathematical formulation and visible results. As shown in Figure 1, the proposed architecture consists of three core steps: Deep learning feature extraction using TL, fusion of various model features, and selection of the robust features for final classification. In the classification step, the ESD classifier is used, and the performance is compared with other learning algorithms. The details of each step, depicted in this figure, are discussed below.

Deep Learning Features Extraction
Since the past two decades, deep learning has proven itself as the best approach for image recognition and classification [8,[35][36][37]. CNN is a method of deep learning, involving a series of layers. A simple CNN model consists of convolution and pooling layers. A few other layers are the activation layer named ReLu, and the feature layer called fully connected (FC). The first layer of CNN is known as the input layer. This layer takes images as input, and the convolutional layer computes the neurons' response. The latter is calculated by the dot product of weights and smaller regions. While the ReLu layer helps in the activation function, the pooling layer between convolution layers removes the inactive neurons for the next phase. Finally, the high-level features are computed using the FC layers, which are classified through Softmax [8]. In this work, we are using two pre-trained CNN models, namely VGG19 and Inception V3, for feature extraction. In what follows, we present a brief description of each model. VGG19: VGG-19 [38] consists of 16 convolutional layers, 19 learnable weights layers, which are utilized for transfer learning, 3 FC layers, and an output layer. This model is already trained on the ImageNet dataset. The input size for this model is 224 × 224 × 3, as given in Table A1 (Appendix Section). The learnable weights and bias of the first convolution layer are 3 × 2 × 3 × 64 and 1 × 1 × 64. The total learnable at this layer is 1792. For the second convolution layer, the total learnable is 36928. This layer extracts the local features of an image.
where, ( ) is the output layer , ( ) is the base value, , ( ) denotes the filter mapping the ℎ feature value, and ℎ means the − 1 output layer. The learnable weights and bias of the first FC layer are 4096 × 25,088 and 4096 × 1. The dropout layer is added between FC layers, where the dropout rate is 50%. For FC layer 7, the total learnable is 16781312, and learnable weights are 4096 × 4096 . For the last FC layer, the total learnable is 4097000, and learnable weights are 1000 × 4096. Hence, when the activation is applied, it returns a feature map vector of dimension 1 × 1 × 1000. For fully connected layers 1 and 2, the feature map vector dimension is 1 × 1 × 4096.
Inception V3: It is an advanced pre-trained CNN model. It consists of 316 layers and 350 connections. The number of convolution layers is 94 of different filter sizes, where the size of the first input layer is 299 × 299 × 3. A brief description of this model is given in Table A2 (Appendix Section). In this table, it is shown that a scaling layer is added after the input layer. On the first convolution layer, activation is performed and obtained a weight matrix of dimension 149 × 149 × 32, where 32 denotes the number of filters. Later, the batch normalization and ReLu activation layers are added. Mathematically, the ReLu layer is defined as: Between the convolution layers, a pooling layer is also added to get active neurons. In the first max-pooling layer, the filter size is 2 × 2. Mathematically, the max-pooling is defined as: where, denotes the stride, , , and are defined filters for feature set maps such as 2 × 2, 3 × 3. Moreover, a few other layers are also added in this architecture, such as addition and concatenation layers. In the end, an average pool layer is added. The activation is performed, and in the output, a resultant weight matrix is obtained as a features map of dimension 1 × 1 × 2048. The last layer is FC, and its learnable weight matrix is 1000 × 2048, and the ensuing feature matrix is 1 × 1 × 1000. Mathematically, the FC layer is defined as follows: Feature Extraction using TL: In the feature extraction step, we employ TL, by which we retrain both the specific CNN models (VGG19 and InceptionV3) on the selected datasets. For training, we set a 60:40 approach along with labeled data. Furthermore, we perform preprocessing, in which we resize the images according to the input layer of each model. Later, we select the input convolutional and output layers as feature mapping. For VGG19, we choose the first convolutional layer as an input layer, and the FC7 as the output. After that, the CNN activation is performed, and we obtain the training and testing vectors. On the feature layer FC7, a resultant feature vector is obtained of dimension 1 × 4096 denoted by ( ) and utilized in the next process. A modified architecture of VGG19 is also shown in Figure 2. For Inception V3, we select the first convolutional layer as input, and the average pool layer as a feature map. Similar to VGG19, we perform TL and retrain this model on the selected datasets, and apply the CNN activation on the average pool layer. On this layer, we obtain a feature vector of dimension 1 × 2048, denoted by ( ) . Both training and testing vectors proceed for the next features fusion process. The modified architecture of Inception V3 is shown in Figure 3. In this figure, it is shown that the last three layers are removed before being retrained on the selected datasets for this work.

Features Fusion
The fusion of multiple features in one matrix is the latest research area of pattern recognition. The primary purpose of features fusion is to obtain a stronger feature vector for classification. From the latest research, it is noticed that the fusion process improves the overall accuracy, but on the other side, its main disadvantage is high computational time (sec). However, our usual priority is to improve the classification accuracy. For this purpose, we implement a new Parallel Maximum Covariance (PMC) approach for features fusion. In this approach, we need to equalize the lengths of both extracted feature vectors. Later, we find the maximum covariance for fusion in a single matrix.
Consider two deep learning feature vectors ( ) and ( ) of dimensions × and × , where denotes the number of images, indicates VGG19 deep learning feature vector length of × 4096 and denotes Inception V3 feature vector of dimension × 2048, respectively. To make the length of vectors equal, we first find out the maximum length vector and perform average value padding. The average feature is calculated from a higher length vector. Let be an arbitrary unit column vector presenting a pattern in field, and indicates a random unit column vector representing a pattern in the field, respectively. The time series projections on row vectors are defined as follows: For optimal solutions and , maximize their covariance as follows: where, is the covariance value among and whose th and th features are ( ) and ( ). Hence, the feature pair and of maximum covariance is saved in the final fused vector. However, it is possible that few of the feature pairs are redundant. This process is continued until all pairs are compared with each other. In the end, a fused vector is obtained, denoted by ( ) of dimensions × , where denotes the feature-length, which varies according to the selected features. In this work, the fused feature-length is × 3294 for the Caltech-101 dataset, × 2981 for the Birds dataset, and × 3089 for the Butterflies dataset.

Feature Selection
Feature selection is an exciting research topic in machine learning (ML) nowadays, and shows significant improvement in the classification accuracy. In this work, we propose a new technique for feature selection, namely, Multi Logistic Regression controlled Entropy-Variances (MRcEV). It exploits a partial derivative-based activation function to remove the irrelevant features, and the remaining robust features are passed to the entropy-variances function. Through the latter, a new vector is obtained, which only contains positive values. Finally, this vector is presented to the ESD fitness function, and the validity of the proposed technique is determined. Mathematically, the formulation is given as: For a given dataset, a fused vector is represented as Δ = ( ) , ( ) having sample images, where ( ) denotes the fused feature vector, which is utilized as the input, and ( ) ∈ ℝ . The ( ) indicates the corresponding labels and defined as ( ) ∈ ℝ. The probability among ( ) for the class is then computed as follows: The parameter of logistic regression = ( , , … , ) is obtained by minimizing the negative likelihood of features. If features are independent, then a multinomial distribution is computed as follows: To get a sparse model, a regularization parameter is added to negative log-likelihood. The modified MLR criteria for the active features are defined as follows: where r is regularization parameter. At the minimum value of , the partial derivative with respect to is formulated as follows: This expression shows that if the partial derivative of with respect to is less then , then that feature value is set to zero, and removed from the final vector. Later, entropy-variances-based function is implemented to obtain a more robust vector. Mathematically, this function is formulated as: where, is an entropy function, denotes variance of the selected vector, and ( ) represents the final entropy-variances function. The selected features are passed to this function to get a clear difference among all the features based on the classification classes. This proposed selection technique picks almost 50% to 60% robust features from the fused feature vector. The selected features are finally verified through the ESD classifier [39]. In the ensemble learning classifier, the subspace discriminant method is used. The proposed system's predicted results are shown in Figures 4-6.

Results
This section presents the simulation results with detailed numerical analysis and visual plots. As stated above, in this work, we utilize four publicly available datasets for evaluation of the proposed framework, including Caltech-101, Birds database, Butterflies database, and CIFAR-100 [40]. A brief description of the selected datasets is given in Table 1, where we have highlighted the total number of images, their specific classes (categories), and the number of images that each class comprises. As understandable, the Caltech-101 and CIFAR-100 are relatively more challenging for object classification. For validation, the 60:40 approach is employed along with ten-fold crossvalidation. We used various classifiers for the experimental process, such as Ensemble learning, SVM, KNN, and Linear Discriminant classifiers. The performance of each classifier is validated using three essential measures, including accuracy, FNR, and computational time. All the simulations are conducted in MATLAB2019a installed on a 2.4 Gigahertz Corei7 processor with 16 Gigabytes of RAM, 128 SSD, and a Radeon R7 graphic card.

Caltech-101 Dataset Results
The results achieved on the Caltech-101 dataset are presented in three different ways: In the first method, both VGG19-and inceptionV3-based deep features are fused using a serial-based method, and the classification is performed without features selection. In the second method, the fusion of deep features is conducted using the proposed fusion approach, as presented in Section 4.2. In the third method, the feature selection is performed on the proposed fused vector, followed by classification. The results are shown in Table 2, where it is evident that the ESD classifier yields the best results against the rest for each method. However, it may be noticed that a massive difference exists among the accuracies achieved using M1 and the other methods. For example, consider the case of the ESD classifier, where the achieved accuracy rises from 79% to 90.8% upon using the proposed fusion method, which further jumps to 95.5% once the proposed selection method is applied. Additionally, observe that the computational time drops by around 74% between M1 and the P-selection method, making the latter more superior to the other two methods. The accuracy of the P-Selection method may also be verified through Figure 7. The effectiveness of the proposed P-Fusion and P-Selection methods while using other classifiers is also evident in Table 2. Observe that the best accuracies are provided by the P-Selection method irrespective of the classifier, while the P-Fusion stands second, both in terms of accuracy and computational time. Overall, the proposed selection method shows significant performance on ESD classifier for the Caltech-101 dataset. Table 2. Proposed classification results using the Caltech-101 dataset. M1 represents simple serialbased fusion and classification, P-Fusion represents the proposed fusion approach, and P-Selection represents the proposed selection method results. Where, ESD described ensemble subspace discriminant, LDA represent linear discriminant analysis, LSVM denotes linear support vector machine, QSVM denotes quadratic SVM, and Co-KNN describe cosine K-Nearest Neighbor.

Birds Dataset Results
The classification results using the Birds dataset are presented in this section. As before, three methods are applied for the evaluation, and all the results obtained previously hold true in this case as well. Table 3 summarizes these results, and verifies that the ESD classifier yields the best results for all the three methods when compared with various classifiers. Irrespective of the classifier used, it may also be verified that the proposed fusion method outperforms the M1 both in terms of the achieved accuracies and computational time, while the proposed selection method even surpasses the fusion method in both metrics. Its accuracy is also confirmed by Figure 8. Due to the simplicity in the dataset, the accuracies achieved by the three methods are relatively comparable, unlike in the case of Caltech-101, where the proposed methods outperformed the M1 by a considerable margin. The computational time, however, gives the proposed methods a substantial edge on the equivalent techniques. Table 3. Proposed classification results using the Birds dataset. M1 represents simple serial-based fusion and classification, P-Fusion represents the proposed fusion approach, and P-Selection represents the proposed selection method results.

Butterflies Dataset
The results for the Butterflies dataset are given in Table 4. It may be observed that the ESD classifier gives better outcomes for all three feature methods. For M1, the ESD classifier achieves an accuracy of 95.1%, which is improved to 95.6% after using the P-Fusion method. The computational time of M1 is 46.05 (sec), but after P-Fusion, the time is reduced to 31.95 (sec). In comparison, the P-Selection method achieves an accuracy of 98%, which is better than the M1 and P-Fusion. Moreover, the computational time of this method is 19.53 (sec), which is also the minimum. The performance of the ESD classifier for the P-Selection method may also be verified through Figure 9. The performance of the ESD classifier is also compared with a few other well-known techniques such as SVM, KNN, and LDA, as given in Table 4. From the results, it can be clearly seen that all the classifiers provide better accuracy on the P-Selection method. Moreover, it is also concluded that W-KNN performs better in terms of computational time. Table 4. Proposed classification results using the Butterflies dataset. M1 represents simple serialbased fusion and classification, P-Fusion represents the proposed fusion approach, and P-Selection represents the proposed selection method results.

CIFAR-100 Dataset
This dataset consists of 100 object classes such as bus, chair, table, train, bed, and each class consists of 100 samples, making this dataset more challenging. There are 50,000 images available for the training of this dataset, while there are 10,000 images for testing. In this work, we utilize this dataset for the evaluation of the proposed technique. The results are given in Tables 4 and 5. In Table  4, the proposed training results are provided, which show the maximum accuracy of 69.76% and an error rate of 30.24%. For the simple fusion method (M1), the noted accuracy is 51.34%, and the computation time is 608 (min). After employing the proposed fusion, it takes the time of 524 (min) for execution, and achieved an improved accuracy of 63.97%. The proposed P-Selection method further improves the accuracy and reached 69.76%, whereas the execution time is also minimized to 374 (min). The testing results are given in Table 6. The maximum achieved accuracy of the testing process is 68.80% using the P-Selection method and ESD classifier. The accuracy is not impressive, but in the view of dataset complexity, it is acceptable. The accuracy of the ESD using the P-Selection method can be further verified through Figure 10 (confusion matrix). Table 5. Proposed training results on CIFAR-100 dataset.  Figure 10. Confusion matrix of CIFAR-100 dataset for proposed P-Selection method.

Analysis and Comparison with Existing Techniques
A comprehensive analysis and comparison with existing techniques are presented in this section to examine the authenticity of the proposed method results. The proposed fusion and robust feature selection methods give a significant performance of 95.5%, 100%, 98%, and 68.70%, respectively, for ESD classifier on the selected datasets. Results can be seen in Tables 2-4. However, it is essential to examine the accuracy of ESD against each classifier based on a detailed statistical analysis. For Caltech-101 dataset, we run the proposed algorithm 500 times for each method and get two accuracies: average (76.3%, 87.9%, and 92.7%), and maximum (79%, 90.8%, and 95.5%). These accuracies are also plotted in Figure 11a. In this figure, it is shown that a minor change is occurring in the accuracy after 500 iterations. For the Birds database, two accuracies are also obtained: minimum (97.2%, 98.9%, and 99.4%) and maximum (99%, 99.5%, and 100%). These values are also plotted in Figure 11b. In this figure, it can be observed that the change in M1 is a bit higher as compared to P-Fusion and P-Selection. In the end, the statistical analysis is conducted for the Butterflies dataset, as shown in Figure 11c. This figure shows a slight change in the accuracy of each method. Figure 11. Statistical analysis of ESD classifier using all three methods. Where (a) represent M1 method, (b) denotes P-Fusion method, and (c) denotes P-Selection method, respectively.
We performed the classification using other deep neural nets such as VGG16, AlexNet, ResNet50, and ResNet101 to compare the proposed scheme classification performance. The results are computed from the last two layers, such as Vgg16 (FC7 and FC8), AlexNet (FC7 and FC8), and ResNet (Average Pool and FC Layer). The features extracted from these layers are fused using the proposed approach and later perform the selection technique. For the classification of these neural nets, we used the original classifier named Softmax. Results are given in Tables 7 and 8 below for Caltech-101 and CIFAR-100 datasets. In these tables, we noticed that the P-Fusion and P-Selection techniques are performed well using the proposed scheme. A brief comparison with existing techniques is also presented in Table 9. From this table, we computed the results on different training/testing ratios and get a variety of results. Based on the results, it is show that the increase in a training ratio minimizes the error rate. For example, in this

Conclusions
A new multi-layer deep features fusion and selection-based method for object classification is presented in this work. The major contribution of this work lies in the fusion of deep learning models, and then selection of the robust features for final classification. Three core steps are involved in the proposed system: Feature extraction using transfer learning, features fusion of two different deep learning models (VGG19 and Inception V3) using PMC, and selection of the robust features using Multi Logistic Regression controlled Entropy-Variances (MRcEV) method. An ESDA classifier is used to validate the performance of MRcEV. We utilize three datasets for the experimental process and demonstrate an improved achieved accuracy. From the results, we conclude that the proposed method is useful for large, as well as small datasets. The fusion of two different deep learning features shows an impact on classification accuracy. Additionally, the selection of robust features shows an effect on both computational time and classification accuracy. The main limitation of the proposed method is the quality of features-by using low-quality images, it is not possible to get strong features. In the future, this problem will be rectified through contrast, stretching deep learning architecture. Moreover, for the improvement of experimental process, the Caltech-256 and CIFAR-100 datasets will be considered.
Author Contributions: M.R. and M.A.K. developed this idea, and they were responsible for the first draft. M.A. was responsible for mathematical formulation. S.H.W. supervised this work. S.R.N. gave technical support for this work. T.S. and A.R. were responsible for the final proofreading. All authors have read and agreed to the published version of the manuscript.
Funding: There was no funding involved in this work.

Conflicts of Interest:
The authors declare no conflict of interest.