Skip to Content
MachinesMachines
  • Article
  • Open Access

6 May 2022

Object Detection via Gradient-Based Mask R-CNN Using Machine Learning Algorithms

,
,
,
and
1
Department of Information Engineering, I-Shou University, Kaohsiung City 84001, Taiwan
2
College of Information and Communications Technology, Bulacan State University, Bulacan 3000, Philippines
3
Department of Electrical Engineering, I-Shou University, Kaohsiung City 84001, Taiwan
*
Author to whom correspondence should be addressed.

Abstract

Object detection has received a lot of research attention in recent years because of its close association with video analysis and image interpretation. Detecting objects in images and videos is a fundamental task and considered as one of the most difficult problems in computer vision. Many machine learning and deep learning models have been proposed in the past to solve this issue. In the current scenario, the detection algorithm must calculate from beginning to end in the shortest amount of time possible. This paper proposes a method called GradCAM-MLRCNN that combines Gradient-weighted Class Activation Mapping++ (Grad-CAM++) for localization and Mask Regional Convolution Neural Network (Mask R-CNN) for object detection along with machine learning algorithms. In our proposed method, images are used to train the network, together with masks that shows where the objects are in the image. A bounding box is regressed around the region of interest in most localization networks. Furthermore, just like any classification task, the multi-class log loss is minimized during training. This model enhances the calculation time and speed, as well as the efficiency, which recognizes objects in images accurately by comparing state-of-the-art machine learning algorithms, such as decision tree, Gaussian algorithm, k-means clustering, k-nearest neighbor, and logistic regression. Among these methods, we found logistic regression performed well with an accuracy rate of 98.4%, recall rate of 99.6%, and precision rate of 97.3% with respect to ResNet 152 and VGG 19. Furthermore, we proved the goodness of fit of our proposed model using chi-square statistical method and demonstrated that our solution can achieve great precision while maintaining a fair recall level.

1. Introduction

Due to its vast range of applications and recent technological developments, object detection has sparked a lot of attention in recent years. It has been used in the fields of robotic vision, security monitoring, drone scene analysis, autonomous driving, and transit surveillance. The rapid advancement of object detection systems can be linked to numerous factors and projects, including the improvement of deep convolutional neural networks, and GPU computing capability. Deep learning methods are now widely used in computer vision, including object detection in both generic and domain-specific contexts.
Object detection is a computer program that looks for semantic things of a given class in digital images and videos (such as people, buildings, cars, etc.) It deals with computer vision and images to be processed. Object detection is employed in a wide range of industries, including security, military, transportation, health, and life sciences. Object detection benchmarks, such as Caltech, KITTI, ImageNet [1], PASCAL VOC, MS COCO, and Open Images V5 [2] have also been utilized in the past.
We will primarily focus on the problem of object localization and detection in this project, which has a wide range of applications in our daily lives [3]. Among the most well-known network interpretation methods, Grad-CAM++ [4] is known to be a particularly good algorithm for object localization. Among the newly examined state-of-the-art approaches for sanity check-based tasks, Grad-CAM++ [4] produces faultless results. Hence, our research work used Grad-CAM++ for localization. The main purpose of object detection is to recognize objects in an image, not only to return the class confidence for each object, but also to forecast the bounding boxes. Among the efforts in object detection, Regional Convolutional Neural Network (RCNN) [5] stands out as the most notable, combining selective search [6], and bounding box regression to achieve high object detection performance. Because our main core of this research work is based upon RCNN and not related with You Only Look Once (YOLO) we provide an alternate methodology for object detection which utilizes the proposed model, GradCAM-MLRCNN, which lowers the complexity of RCNN and increases the system’s overall performance.
This research work is organized as follows. In Section 2, we discuss related work for our research work. In Section 3, we explain the research gap and its contributions. Further, we briefed about our proposed method (GradCAM-MLRCNN) in Section 4. In Section 5, we demonstrated our research experiments by comparing state-of-the-art machine learning algorithms, such as decision tree, Gaussian algorithm, k-means clustering, k-nearest neighbor, and logistic regression with other pre-trained models. Finally, the conclusions and future works of this article are given in Section 6.

3. Research Gap and Contributions

Deep learning approaches have already achieved state-of-the-art outcomes using traditional object detection criteria. Mask R-CNN won the COCO object detection competition in 2016 by outperforming the other detection models. Mask R-CNN’s activities in object detection are scarcely comparable due to the complex nature of the problem, the enormous number of annotated samples, and the wide range of object scales. However, missing better visualization and accuracy are leading to a research gap. To avoid such issues, combining both methodologies, such as Grad-CAM++ for localizing objects and Mask R-CNN for detecting objects along with some machine learning algorithms, will provide enhanced image presentation and better accuracy of predicted objects in an image to fill the research gap perfectly.
The key contribution of this paper is the use of Grad-CAM++ combined with the Mask R-CNN framework to recognize multi-scale objects with the help of the COCO dataset with other pre-trained models, such as VGG 16, VGG 19, ResNet 101, and ResNet 152 as one of the backbone networks along with state-of-the-art machine learning algorithms, such as logistics regression, decision tree, Gaussian classifier, k-means clustering, and k-nearest neighbor (KNN). While boosting detection accuracy, our proposed method (Grad-CAM-MLRCNN) efficiently eliminates detector box redundancy. The efficiency of the proposed method is demonstrated by comparing pre-trained models with state-of-the-art machine learning algorithms.
We are using Grad-CAM++ for object localization by producing a heat map and Mask R-CNN for object detection along with machine learning algorithms for better class accuracy. Even though Mask R-CNN is losing to detect objects in an image, with the help of the Grad-CAM++ cum machine learning algorithm, we can detect the objects in an image perfectly and accurately. That is an important feature of the proposed method.

4. Proposed Method

In this section, we provide a brief description of the mathematical derivation for the proposed method, which uses partial derivatives of the last convolutional layer feature maps concerning a specific class score as weights in terms of object localization. Then we analyzed object detection by Mask R-CNN through ResNet 152.
The Grad-CAM++ model, which was originally used to locate an object and generate a visual explanation for the GradCAM-MLRCNN network without affecting its structure, is described in the proposed methodology. Gradient-weighted Class Activation Mapping (Grad-CAM++) employs the gradients of any target concept to build a coarse localization map that highlights relevant places in the image for predicting objects. The approach presented in this research is primarily influenced by two commonly used algorithms, CAM [7] and Grad-CAM [8]. Both CAM and Grad-CAM work on the premise that the final score Y c for a given class c may be expressed as a linear combination of the class’s global average pooled last convolutional layer feature mappings A k by Equations (1) and (2).
Y c = k w k c 1 Z i j A i j k
L i , j c = k w k c A i j k
where A k is the feature map activation, y c is the neural network output before performing softmax, and Z is the number of pixels in the feature map.
L i , j c directly correlates with the importance of a particular spatial location i ,   j for a particular class c, and thus functions as a visual explanation of the class predicted by the network. The weights w k c of Grad-CAM++ is formulated by Equation (3)
w k c = i j α i j k c · R e L U Y c A i j k  
Hence, α k c represents a partial linearization of the deep network downstream from A , which evaluates the ‘importance’ for feature map   k with class c .
ReLU is one of the transfer functions that is a rectified linear unit. w k c indicates the relevance of a given activation map A k according to this formulation. As established in prior research in pixel-space visualization, such as deconvolution [22] and guided backpropagation [23], positive gradients are significant in constructing saliency maps for a given convolutional layer. Increases in the intensity of pixels i ,   j have a positive impact on the class score Y c , as indicated by a positive gradient for an activation map A k at location i ,   j . For a given class c and activation map k , we now formalize a method for determining the gradient weights α i j k c . Let Y c be the score of a certain class c . When Equations (1) and (3) are added together, we obtain the following Equation (4),
Y c = k [ i j { i j   α i j k c · R e L U Y c A i j k   } A i j k ]
Here, i ,   j and a ,   b are iterators over the same activation map A k . Without loss of generality, we drop the ReLU as it only functions as a threshold for allowing the gradients to flow back. Taking partial derivative w.r.t. A i j k   on both sides
Y c A i j k = a b α a b k c Y c A a b k + a b   { α i j k c 2 Y c A i j k 2       } A a b k
Hence, we get the following Equation (6)
α i j k c = 2 y 2 A i j k 2 2 2 y 2 A i j k 2 + a b   A a b k   ( 3 Y c A i j k 3     )
Substitute Equation (6) in Equation (3)
w k c = i j 2 y 2 A i j k 2 2 2 y 2 A i j k 2 + a b   A a b k   ( 3 Y c A i j k 3     ) · R e L U Y c A i j k  
Now we will compare the Equations (3) and (7). Hence Grad-CAM++ is reformulated by α i j k c = 1 Z
Let M k be the global average pooled output,
M k = 1 Z i j A i j k .
Then the computational score as follows, according to Grad-CAM
Y c = k w k c · M k
where w k c indicates the relevance of a given activation map A k .Then we need to do gradients on both sides with respect to M k ,
Y c M k = Y c A i j k M k A i j k .
Taking the partial derivates of Equation (8),
M k A i j k = 1 Z .
Then substitute in the Equation (10),
Y c M k = Y c A i j k · Z .
From Equation (9) derives that,
Y c M k = w k c .
Hence,
w k c = Z · Y c A i j k .
Y c A i j k = 1   if   A i j k = 1 Other   wise   0  
A i j k = 1 if an object is present in a visual pattern otherwise 0.
Hence, it computes all weighted sum of A k for final required output. In this equation, we are using ReLU activation function. Therefore, negative values will be considered as zero. Hence, if objects are present in an image they will be masked and predicted by our proposed method even though those objects are small in scale. According to Y c A i j k = 1 , a heat map calculates the weighted combination of feature map activation A k with weights α k c by Grad-CAM++.

4.1. Mask R-CNN

Mask R-CNN is one of the new powerful, simple, and flexible paradigms for instance segmentation frameworks (i.e., assigning distinct labels to various classes). Mask R-CNN has the advantage of providing both bounding box and semantic segmentation, allowing for a multi-stage semantic segmentation technique utilizing the same architecture [24]. In this paper, we adapted and assessed the Mask R-CNN for object detection, which is made up of two main phases that have been extended from the Faster R-CNN. The first phase is extracting features using ResNet 152 and then utilizing the region proposal network (RPN) to generate a potential bounding box [25]. The extracted characteristics are shared in the second phase, which is used to categorize the objects and generate class labels, bounding box offsets, and binary mask images for each instance in each Region of Interest (RoI). The proposed framework for recognizing and finding objects at the pixel level is depicted in Figure 1. To extract features, the ResNet 152 model is first applied to the input frame. The proposed approach uses ResNet 152 as a convolutional backbone [26] for feature extraction, bounding box classification, and regression. The collected feature maps are then submitted to a region proposal network (RPN), which constructs multiple bounding boxes based on the objectness of the feature maps (i.e., the existence or absence of the object in candidate regions). Positive ROIs containing objects, including shadows, are submitted to the RoI Align phase to ensure that each RoI has appropriate spatial adjustment.
Figure 1. Architecture of the proposed method (GradCAM-MLRCNN).

4.2. Object Detection Based on RPN

The region proposal network (RPN) [25] applies a light binary classifier to multiple sets of predetermined anchors (bounding boxes) over the entire feature map by CNN, then calculates the objectness score, which determines the foreground object on the candidate bounding box. Moving a sliding window across the extracted feature maps initializes RPN. A collection of anchors with predetermined scales and aspect ratios are centered for each sliding window. These candidate anchors are examined to see if an object is present or not. If the candidate bounding box is present on the foreground object, then the objectness score will be calculated as high by intersection-over-union (IoU), which is the intersection area between the predicted RoI and its ground truth RoI divided by the union area of the two regions. If the probability of IoU is greater than 0.7, it is considered a positive region Of interest (RoI). When the probability of IoU is less than 0.3, it is considered as a negative bounding box as none of the object is present in an image (which does not cover any foreground objects and is deemed background region).
RPN uses the non-maximum suppression (NMS) [25] technique to reject redundant and overlapping RoI suggestions based on their IoU score, with the low-scoring bounding box being deleted and the high-scoring bounding box (i.e., positive bounding box) progressing to the classification stage. Each candidate bounding box is then transformed into a low-dimensional vector before being supplied to fully connected layers and fully convolutional network (FCN). The regression layer is the first layer, and it generates 4N, which represents the coordinates of N anchor boxes. The classification layer is the second layer, which generates 2N probability scores to determine whether the foreground item is present or absent at each proposition. As a result, the RPN regressor enhances the surrounding bounding box by shifting and resizing it to the closest accurate object boundaries. The generated positive RoIs are sent into the following stage, which includes bounding box regression and foreground object label classification. The proposed framework is divided into three categories: class of object, heat map, and background class for bounding box.

4.3. Loss Function

Mask R-CNN applies a multi-loss function during the learning to evaluate the model and ensure its fitting to unseen data. This loss function is computed as a weighted total sum of various losses during the training at every phase of the model on each proposal RoI [27], which is shown by Equation (16). This weighted loss defined as,
L o s s = L C l a s s + L B o u n d i n g b o x + L M a s k
where L C l a s s (the loss of classification) shows the convergence of the predictions to the true class. L C l a s s combines the classification loss during the training of RPN and Mask R-CNN heads. L B o u n d i n g b o x (the loss of bounding box) shows how well the model localizes objects and it combines the bounding box localization loss during the training of RPN and Mask R-CNN heads. L C l a s s and L B o u n d i n g b o x losses are computed by Equations (17) and (18)
L c l a s s p , u = l o g p u
where L c l a s s p , u   is predicted probability of ground truth class u for each positive bounding box.
L B o u n d i n g b o x t u , v = Σ i   x , y , w , h L 1 s m o o t h t i u v i  
W h e r e   L 1 s m o o t h x = 0.5   x 2                 i f   x < 1 x 0.5     o t h e r w i s e
where L 1 s m o o t h t i u v i is predicted bounding box for class u and ground truth bounding box v for each input   i .
Our proposed method is combining Grad-CAM++ for object localization and mask regional convolutional neural network (Mask R-CNN) for object detection along with machine learning algorithms, such as decision tree, Gaussian algorithm, k-means clustering, and k-nearest neighbor to improve the classifier’s accuracy by utilizing the major voting concept. That means Mask R-CNN’s predicated labels and any one of the machine learning algorithms’ predicted labels for that object in an image should be matched with the ground truth labels. Even though Mask R-CNN does not predict the class label for that object in an image, the machine learning algorithm cum Grad-CAM++ will help to predict the class label for that object in the image. That is an important feature of the proposed method.
Hence, we are using the loss function and the following sigmoid function as referred to in Equation (19). For a given set of features (or inputs) x, the target variable (or output) y, can only take discrete values in a classification problem.
The sigmoid function is used in logistic regression as follows,
1 1 + e v a l u e
where value is b 0 + b 1 x .
The logistic regression Equation (20) is defined by
y = e b 0 + b 1 x 1 + e b 0 + b 1 x
where,
y is final output of prediction,
b0 is bias,
b1 is coefficient of input x,
The coefficient b1 must be used to train each of the input data.

4.4. Summary

  • The important feature of this research work is that it is utilizing the heat map obtained from Grad-CAM++ to detect the object along with Mask R-CNN, which worked with machine learning algorithms to boost the accuracy of object detection using our proposed GradCAM-MLRCNN method.
  • Another feature of this research work is that it detects the object efficiently if the object behind the heat map is produced by Grad-CAM++.
  • Our proposed method is efficiently reducing the redundancy of detector boxes and allows the multi-scale targets under complex background images.
  • Furthermore, this research work studies the behavior of the Mask R-CNN when it is combined with various machine learning algorithms, such as decision tree, Gaussian algorithm, k-means clustering, and k-nearest neighbor.
After producing the heat map of the image that will be given to Mask R-CNN to classify the objects and once the class of the object is detected, logistic regression is trained with the same set of images using the regression equation, then the prediction is made using testing images. Furthermore, the correct labels will be stored into the new file by checking both classes of Mask R-CNN and logistic regression. The final decision will be made based on the correct predicted labels with ground truth labels.

5. Results and Discussion

This section contains the description of the main findings of our study, which are the results of comparing some machine learning algorithms (decision tree, Gaussian algorithm, k-means clustering, k-nearest neighbor, and logistic regression) with pre-trained models, such as VGG 16, VGG 19, ResNet 101, and ResNet 152.
This research work has been carried out using Python programming language and its version is 3.7. We used a few pre-trained models as feature extractors, including VGG 16, VGG 19, ResNet 101, and ResNet 152 to employ our proposed (GradCAM-MLRCNN) method on the MS-COCO dataset. Before the beginning of the training, we resized the images to 224 × 224 pixels. To ease the training of the Mask R-CNN, we set up the learning rate as 0.00001. In this proposed method, feature maps will be updated by Grad-CAM++ according to a few epochs, then Mask R-CNN will start training to predict objects in an image by gradually producing a bounding box along with segmentation. It is important to note that Grad-CAM++ is mainly used to distinguish the class label on an object in the image by producing a heat map, then Mask R-CNN will examine this by applying a bounding box and segmentation using the NMS technique, which is based upon the IoU. The proposed method demonstrated using pre-trained models (VGG 16, VGG 19, ResNet 101, and ResNet 152) as a backbone network concerning each machine learning algorithm, such as logistics regression, decision tree, Gaussian classifier, k-means clustering, and k-nearest neighbor (KNN). Hence, proving that logistic regression performs well, with an accuracy rate of 98.4%, recall rate of 99.6%, and precision rate of 97.3% by ResNet 152 and VGG 19 among other models. Figure 2 shows the original image and its corresponding class activation map (CAM).
Figure 2. (a) Input image; (b) output of class activation map (CAM).
Figure 3 shows the results of various pre-trained models, such as VGG 16, VGG 19, ResNet 101, and ResNet 152 with the corresponding results of Grad-CAM, Grad-CAM++, bounding box prediction, and proposed method (GradCAM-MLRCNN), which contains the bounding box with mask prediction with respect to logistics regression: (a) various pre-trained models; (b) results of Grad-CAM; (c) results of Grad-CAM++; (d) bounding box prediction on Grad-CAM++; and (e) GradCAM-MLRCNN. Figure 4 shows the original image and its corresponding class activation map (CAM).
Figure 3. (a) Various pre-trained models; (b) results of Grad-CAM; (c) results of Grad-CAM++; (d) bounding box prediction on Grad-CAM++; and (e) GradCAM-MLRCNN.
Figure 4. (a) Input image; (b) output of class activation map (CAM).
Figure 5 shows the results of various pre-trained models, such as VGG 16, VGG 19, ResNet 101, and ResNet 152 with the corresponding results of Grad-CAM, Grad-CAM++, bounding box prediction, and proposed method (GradCAM-MLRCNN) that contains the bounding box with mask prediction for the Gaussian algorithm: (a) various pre-trained models; (b) results of Grad-CAM; (c) results of Grad-CAM++; (d) bounding box prediction on Grad-CAM++; and (e) GradCAM-MLRCNN. Figure 6 shows the original image and its corresponding class activation map (CAM).
Figure 5. (a) Various pre-trained models; (b) results of Grad-CAM; (c) results of Grad-CAM++; (d) bounding box prediction on Grad-CAM++; and (e) GradCAM-MLRCNN.
Figure 6. (a) Input image; (b) output of class activation map (CAM).
Figure 7 demonstrates the results of various pre-trained models, such as VGG 16, VGG 19, ResNet 101, and ResNet 152 with the corresponding results of Grad-CAM, Grad-CAM++, bounding box prediction, and proposed method (GradCAM-MLRCNN), which contains the bounding box with mask prediction with respect to k-nearest neighbor: (a) various pre-trained models; (b) results of Grad-CAM; (c) results of Grad-CAM++; (d) bounding box prediction on Grad-CAM++; and (e) GradCAM-MLRCNN. Figure 8 shows the original image and its corresponding class activation map (CAM).
Figure 7. (a) Various pre-trained models; (b) results of Grad-CAM; (c) results of Grad-CAM++; (d) bounding box prediction on Grad-CAM++; and (e) GradCAM-MLRCNN.
Figure 8. (a) Input image; (b) output of class activation map (CAM).
Figure 9 shows the results of various pre-trained models, such as VGG 16, VGG 19, ResNet 101, and ResNet 152 with the corresponding results of Grad-CAM, Grad-CAM++, bounding box prediction, and proposed method (GradCAM-MLRCNN) that contains the bounding box with mask prediction concerning the decision tree: (a) various pre-trained models; (b) results of Grad-CAM; (c) results of Grad-CAM++; (d) bounding box prediction on Grad-CAM++; and (e) GradCAM-MLRCNN. Figure 10 shows the original image and its corresponding class activation map (CAM).
Figure 9. (a) Various pre-trained models; (b) results of Grad-CAM; (c) results of Grad-CAM++; (d) bounding box prediction on Grad-CAM++; and (e) GradCAM-MLRCNN.
Figure 10. (a) Input image; (b) output of class activation map (CAM).
Figure 11 shows the results of various pre-trained models, such as VGG 16, VGG 19, ResNet 101, and ResNet 152 with the corresponding results of Grad-CAM, Grad-CAM++, bounding box prediction, and proposed method (GradCAM-MLRCNN), which is the bounding box with mask prediction for k-means clustering: (a) various pre-trained models; (b) results of Grad-CAM; (c) results of Grad-CAM++; (d) bounding box prediction on Grad-CAM++; and (e) GradCAM-MLRCNN.
Figure 11. (a) Various pre-trained models; (b) results of Grad-CAM; (c) results of Grad-CAM++; (d) bounding box prediction on Grad-CAM++; and (e) GradCAM-MLRCNN.
As demonstrated in Figure 3, Figure 5, Figure 7, Figure 9 and Figure 11, the framework generated the bounding boxes with segmented masks for each class in an image. To avoid overfitting the model and improve its performance, the transfer learning from the pre-trained model (VGG 16, VGG 19, ResNet 101, and ResNet 152) on the MS-COCO dataset was used. Initially, we used ResNet 152 as a feature extractor. This research work used Mask R-CNN and fine-tuned the parameters to match the pre-trained model to the Grad-CAM++.The input frames were scaled to 224 × 224 and have three channels (R, G, and B). The model is optimized using the stochastic gradient descent (SGD) technique with an initial learning rate of 0.00001 to discover the best weights to reduce the error between the expected and desired output. Momentum is a technique for reducing weight variation swings in continuous iterations that work on weighting after the weight has been changed. The proposed model run with 50 epochs to train and validate the data and then the loss is computed for every epoch while the learning processes.
The confusion matrix on the testing dataset is provided in Table 7 to analyze the efficacy of the proposed method to prove the goodness of fit by chi-square statistical test. As displayed in the confusion matrix, the values in the first row, from left to right, reflect the True-Positive (TP) percent of properly identifying shadow and False-Positive (FP) percent of misclassifying shadow as object. The second row shows the False-Negative percentage (FN) of misclassifying an object as a shadow and the True-Negative percentage (TN) of accurately detecting an object. Three additional measures, namely precision, recall, and accuracy, are determined based on the confusion matrix results to quantitatively evaluate the proposed framework. The findings are tabulated using state-of-the-art machine learning algorithms, such as logistic regression, Gaussian algorithm, k-nearest neighbor, decision tree, and k-means clustering among all of the pre-trained models. Finally, the proposed method proved that logistic regression performed well among other machine learning algorithms with an accuracy rate of 98.4%, recall rate of 99.6%, and precision rate of 97.3% by ResNet 152 and VGG 19.

5.1. Evaluation Metrics-1

In this section, some performance metrics, such as accuracy, recall, and precision, are discussed, analyzed, and compared with other pre-trained models, such as VGG 16, VGG 19, ResNet 101, and ResNet 152, concerning each machine learning algorithms, such as logistic regression, Gaussian algorithm, k-nearest neighbor, decision tree, and k-means clustering. These performance metrics were expressed in Equations (21)–(23) as below,
A c c u r a c y % = T P + T N / T P + F N + T N + F P × 100
R e c a l l % = T P / T P + F N × 100
P r e c i s i o n % = T P / T P + F P × 100
where TP, TN, FN, and FP are True-Positive, True-Negative, False-Negative, and False-Positive, respectively.
In Table 2, the accuracy, recall, and precision are displayed concerning the logistics regression classifier among various pre-trained models, such as VGG 16, VGG 19, ResNet 101, and ResNet 152. Compared with other pre-trained models, VGG 19 and ResNet 152 perform better according to logistics regression.
Table 2. Performance using logistic regression.
Table 3 shows the performance metrics which are used to compare other pre-trained models, such as VGG 16, VGG 19, ResNet 101, and ResNet 152 for the Gaussian algorithm. According to the Gaussian algorithm, VGG 19 performs better in accuracy, recall, and precision. Resnet 152 also performs better recall and precision than others.
Table 3. Performance using Gaussian algorithm.
Table 4 demonstrates the performance metrics which are used to compare other pre-trained models, such as VGG 16, VGG 19, ResNet 101, and ResNet 152 to k-nearest neighbor. As per the observation, ResNet 101 has better accuracy. Furthermore, ResNet 152 and VGG 19 performed better recall and ResNet 101 demonstrates better precision.
Table 4. Performance using k-nearest neighbor.
In Table 5, we can observe that the decision tree has better accuracy in ResNet 101, better recall in VGG 16, and better precision in ResNet 101 among other pre-trained models.
Table 5. Performance using decision tree.
In Table 6, the performance metrics among other pre-trained models were analyzed by comparing various pre-trained models concerning k-means clustering. It showed that VGG 19 and ResNet 152 have better accuracy, VGG 19 has better recall, and Resnet 152 has better precision than other models.
Table 6. Performance using k-means clustering.

5.2. Evaluation Metrics-2: Goodness of Fit Using (Chi-Square Test)

The evaluation indexes for the performance of the models were evaluated based on goodness of fit using the chi-square test. With 20% images as the testing set, the validation results of the proposed method are determined as follows:
The chi-square distributions are the group of distributions that take positive values alone and that are skewed to the right. The degrees of freedom (df) are used to specify the chi-square distribution. The chi-square table was used to evaluate the chi-square test by n rows and m columns with (n-1) (m-1) degrees of freedom (df). Then the p value will be calculated by area under the density curve of the chi-square distribution. The following Equation (24) is used to find out the chi-square value with the help of Table 7.
χ 2 = o b s e r v e d   v a l u e s e x p e c t e d   v a l u e s 2 e x p e c t e d   v a l u e s
Table 7. Confusion matrix of ResNet 152 and VGG 19 model.
H0. 
There is no significant relationship between the correctly predicted and wrongly predicted objects for accuracy improvement.
H1. 
There is a significant relationship between the correctly predicted and wrongly predicted objects for accuracy improvement.
The chi-square value is calculated by using Equation (24); chi-square = 555.86; p = 0.00001. From the result it could be observed that the significant level (p)value is less than 0.05; therefore a significant relationship is available between correctly predicted and wrongly predicted objects. Hence, the null hypothesis (H0) was rejected, and the alternative hypothesis (H1) was accepted.
Figure 12, Figure 13, Figure 14, Figure 15 and Figure 16 show the visual representation of the comparative analysis of the performance metrics among pre-trained models concerning various machine learning algorithms, such as logistic regression, Gaussian algorithm, k-nearest neighbor, decision tree, and k-means clustering.
Figure 12. Comparative analysis of the performance metrics among various models using logistic regression.
Figure 13. Comparative analysis of the performance metrics among various models using Gaussian algorithm.
Figure 14. Comparative analysis of the performance metrics among various models using k-nearest neighbor (KNN).
Figure 15. Comparative analysis of the performance metrics among various models using decision tree.
Figure 16. Comparative analysis of the performance metrics among various models using k-means clustering.
From the above Figure 12, Figure 13, Figure 14, Figure 15 and Figure 16, we can understand that various machine learning algorithms compare with pre-trained models by the proposed method. Finally, we found that logistics regression performs better with ResNet 152 and VGG 19 than other machine learning algorithms, compared with other pre-trained models.

6. Conclusions

The research work proposed GradCAM-MLRCNN to localize and detect the objects accurately by using some pre-trained models (VGG 16, VGG 19, ResNet 101, and ResNet 152) as backbone networks along with machine learning algorithms, such as decision tree, Gaussian algorithm, k-means clustering, k-nearest neighbor, and logistic regression, and it combined two methodologies, namely Grad-CM++ for localizing the objects by producing the heat map and Mask R-CNN for detecting objects. The main contribution of this method is that even though Mask R-CNN failed to predict objects in an image, the machine learning algorithm cum Grad-CAM++ will help to predict objects in an image perfectly and accurately.
The experimental results showed that logistics regression for ResNet 152 and VGG 19 proved to have a good performance with an accuracy rate of 98.4%, recall rate of 99.6%, and precision rate of 97.3%. This work can be extended by including decision-based rules using deep networks in fields such as video applications, reinforcement learning, and natural language processing.

Author Contributions

Conceptualization, A.I.X.; methodology, C.V. and J.J.M.; validation, J.J.M. and C.V.; formal analysis, C.V. and J.J.M.; writing—original draft preparation, A.I.X.; writing—review and editing, J.J.M., A.I.X., C.V. and J.-H.J.; visualization, C.V.; supervision, J.-H.J. and J.-G.H.; project administration, J.-H.J. and J.-G.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded as a scholar of the Ministry of Science and Technology (MOST), Taiwan and I-Shou University, Kaohsiung City, Taiwan.

Institutional Review Board Statement

Not Applicable.

Data Availability Statement

https://cocodataset.org/#download, (Accessed on 01022021).

Acknowledgments

The authors wish to thank I-Shou University, Taiwan. This work was supported by the Ministry of Science and Technology (MOST), Taiwan under grant MOST 110-2221-E-214-019. The researchers would also like to thank the ALMIGHTY GOD for His guidance from the start until the completion of this study.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef] [Green Version]
  2. Kuznetsova, A.; Rom, H.; Alldrin, N.; Uijlings, J.; Krasin, I.; Pont-Tuset, J.; Kamali, S.; Popov, S.; Malloci, M.; Duerig, T.; et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. Int. J. Comput. Vis. 2020, 128, 1956–1981, arXiv2018, arXiv:1811.00982. [Google Scholar] [CrossRef] [Green Version]
  3. Zhu, P.; Wen, L.; Bian, X.; Ling, H.; Hu, Q. Vision meets drones: A challenge. arXiv 2018, arXiv:1804.07437. [Google Scholar]
  4. Chattopadhay, A.; Sarkar, A.; Howlader, P.; Balasubramanian, V.N. Grad-CAM++: Improved Visual Explanations for Deep Convolutional Networks. arXiv 2018, arXiv:1710.11063v3. [Google Scholar]
  5. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
  6. Uijlings, J.R.R.; van de Sande, K.E.A.; Gevers, T.; Smeulders, A.W.M. Selective search for object recognition. Int. J. Comput. Vis. 2013, 104, 154–171. [Google Scholar] [CrossRef] [Green Version]
  7. Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning Deep Features for Discriminative Localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  8. Selvaraju, R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations From Deep Networks via Gradient-Based Localization. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
  9. Schöttl, A. A light-weight method to foster the (Grad)CAM interpretability and explainability of classification networks. In Proceedings of the 2020 10th International Conference on Advanced Computer Information TechnologiesAlfred Schöttl, Deggendorf, Germany, 13–15 May 2020. [Google Scholar]
  10. Ennadifi, E.; Laraba, S.; Vincke, D.; Mercatoris, B.; Gosselin, B. Wheat Diseases Classification and Localization Utilizing system network. In Proceedings of the 2020 International Conference on Intelligent Systems and Computer Vision (ISCV), Fez, Morocco, 1–5 June 2020. [Google Scholar]
  11. Liu, R.; Yu, Z.; Mo, D.; Cai, Y. An Improved Faster-RCNN Algorithm for Object Detection in Remote Sensing Images. In Proceedings of the 2020 39th Chinese Control Conference (CCC), Shenyang, China, 27–29 July 2020; pp. 7188–7192. [Google Scholar] [CrossRef]
  12. Yin, X.; Yang, Y.; Xu, H.; Li, W.; Deng, J. Enhanced Faster-RCNN Algorithm for Object Detection in Aerial Images. In Proceedings of the 2020 IEEE 9th Joint International Information Technology and Artificial Intelligence Conference (ITAIC), Chongqing, China, 11–13 December 2020; pp. 2355–2358. [Google Scholar] [CrossRef]
  13. Songhui, M.; Mingming, S.; Chufeng, H. Objects detection and location based on mask RCNN and stereo vision. In Proceedings of the 2019 14th IEEE International Conference on Electronic Measurement & Instruments (ICEMI), Changsha, China, 1–3 November 2019; pp. 369–373. [Google Scholar] [CrossRef]
  14. Song, X.; Jiang, P.; Zhu, H. Research on Unmanned Vessel Surface Object Detection Based on Fusion of SSD and Faster-RCNN. In Proceedings of the 2019 Chinese Automation Congress (CAC), Hangzhou, China, 22–24 November 2019; pp. 3784–3788. [Google Scholar] [CrossRef]
  15. Ouadiay, F.Z.; Bouftaih, H.; Bouyakhf, E.H.; Himmi, M.M. Simultaneous object detection and localization using convolutional neural networks. In Proceedings of the 2018 International Conference on Intelligent Systems and Computer Vision (ISCV), Fez, Morocco, 2–4 April 2018; pp. 1–8. [Google Scholar] [CrossRef]
  16. Gupta, S.; Bagga, S.; Dharandher, S.K.; Sharma, D.K. GPOL: Gradient and Probabilistic approach for Object Localization to understand the working of CNNs. In Proceedings of the 2019 IEEE Bombay Section Signature Conference (IBSSC), Mumbai, India, 26–28 July 2019; pp. 1–6. [Google Scholar] [CrossRef]
  17. Lin, Q.; Ding, Y.; Xu, H.; Lin, W.; Li, J.; Xie, X. ECascade-RCNN: Enhanced Cascade RCNN for Multi-scale Object Detection in UAV Images. In Proceedings of the 2021 7th International Conference on Automation, Robotics and Applications (ICARA), Prague, Czech Republic, 4–6 February 2021; pp. 268–272. [Google Scholar] [CrossRef]
  18. Yao, N.; Shan, G.; Zhu, X. Substation Object Detection Based on Enhance RCNN Model. In Proceedings of the 2021 6th Asia Conference on Power and Electrical Engineering (ACPEE), Chongqing, China, 8–11 April 2021; pp. 463–469. [Google Scholar] [CrossRef]
  19. Pramanik, A.; Pal, S.K.; Maiti, J.; Mitra, P. Granulated RCNN and Multi-Class Deep SORT for Multi-Object Detection and Tracking. IEEE Trans. Emerg. Top. Comput. Intell. 2022, 6, 171–181. [Google Scholar] [CrossRef]
  20. Krishna, N.M.; Reddy, R.Y.; Reddy, M.S.C.; Madhav, K.P.; Sudham, G. Object Detection and Tracking Using Yolo. In Proceedings of the 2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA), Coimbatore, India, 2–4 September 2021; pp. 1–7. [Google Scholar] [CrossRef]
  21. Mukherjee, S.; Valenzise, G.; Cheng, I. Potential of deep features for opinion-unaware, distortion-unaware, no-reference image quality assessment. In Proceedings of the International Conference on Smart Multimedia (Springer), San Diego, CA, USA, 16–18 December 2019. [Google Scholar] [CrossRef]
  22. Zeiler, M.D.; Fergus, R. Visualizing and understanding convolutional networks. In European Conference on Computer Vision; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2014; Volume 8689. [Google Scholar] [CrossRef] [Green Version]
  23. JSpringenberg, T.; Dosovitskiy, A.; Brox, T.; Riedmiller, M. Striving for simplicity: The all convolutional net. arXiv 2014, arXiv:1412.6806. [Google Scholar]
  24. He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
  25. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NIPS); MIT Press: Montreal, QC, Canada, 2015; pp. 91–99. [Google Scholar]
  26. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Las Vegas, NV, USA, 2016; pp. 770–778. [Google Scholar]
  27. Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.