1. Introduction
In the field of computer vision, object detection is a basic technique that combines classification and recognition. In recent years, it has been applied in several ways, including automatic driving, robotic grabbing, and face recognition. Various factors can disrupt the detection process, such as incorrect angles, occlusion, and uneven light. Traditional object recognition methods involve manually designing some features, such as the histogram of oriented gradients (HOG) feature [
1], the scale-invariant feature transform (SIFT) [
2], and the deformable part-based model (DPM) [
3].
In recent years, emerging deep learning techniques have also been applied in the field of object recognition. First, Krizhevsky [
4] proposed a large-scale deep neural network called AlexNet and implemented the classification technology in the ImageNet dataset, following which many new types of deep neural networks were proposed for object recognition. The deep neural network for object detection can be divided into one-stage detection and two-stage detection depending on the structure [
5]. The former directly generates and finds objects in the network after inputting the image. Examples of such algorithms include YOLO [
6] and SSD [
7]. By contrast, the latter approach involves extracting the features of the convolutional neural network (CNN) after inputting the image and then predicting the classification and position of the object. Representative algorithms include the R-CNN series [
8,
9,
10].
The YOLO algorithm, proposed by Redmon et al. [
6], is a CNN that can predict multiple box positions and categories simultaneously. The network design approach of the YOLO algorithm extends the core idea of GoogleNet Although it can perform end-to-end target detection and is less time consuming, its accuracy has declined.
The SSD algorithm, proposed by Liu et al. [
7], is a single-layer deep neural network that can be applied for multi-class object detection. It involves using a small convolution filter to predict a set of default bounding box category scores and box offsets in the feature map.
Girshick et al. [
8] proposed the R-CNN model, which uses Selective Search to obtain candidate regions (approximately 2000 regions). The size of the candidate area is then normalized and used as the standard input to the CNN network. Then, AlexNet is used to identify the features in the candidate area; finally, multiple support vector machines (SVMs) are used to classify and fine-tune the positioning box.
In 2016, Ren et al. [
11] proposed the Faster-R-CNN algorithm, which introduces RPN to extract proposals. RPN is a fully convolutional neural network and shares the features of the convolutional layer. Therefore, it can realize the extraction of a proposal. The core idea of RPN is to use the CNN to generate region proposals directly by using a sliding window. RPN only needs to slide on the last convolutional layer because the anchor mechanism and box regression can be used to obtain region proposals with multi-scale aspect ratios.
Mask R-CNN [
12] is an improvement on Faster R-CNN because it focuses on instance segmentation. In addition to classification and positioning regression, this algorithm adds parallel branches for instance segmentation and jointly trains their losses. The detailed structure of the algorithm is shown in
Figure 1. The Mask R-CNN network has two main parts, of which the first is RPN. After the alignment using ROIAlign, the second part begins, which includes the segmentation mask prediction network. The main structure of the network uses VGG [
13]. RPN connects to the last convolutional layer of VGG and produces the RoI as the output. Then, the feature extraction is performed and pooled to a fixed size. These pooled features are used as branch inputs. For the network’s positioning and classification branches, an architecture consisting of fully connected layers, convolution layers, and deconvolution layers is used. For the segmentation branch, the target object is accurately segmented through an architecture composed of multiple convolutional layers, deconvolution, and a segmentation mask. Therefore, the object detection method based on Mask R-CNN has three different tasks branches, namely, positioning, classification, and object segmentation, which aim to achieve the classification, positioning, and segmentation of objects simultaneously.
Mask R-CNN has achieved very satisfactory results in the classification of object instances. However, it is very laborious to obtain the required object masks for training, such as in using the LabelMe annotation tool (
http://labelme.csail.mit.edu/Release3.0/, accessed on 15 September 2022). Therefore, we propose a new method based on the GrabCut method by which to automatically mark and obtain image masks to train deep learning models. The proposed method consists of two stages: The first stage is based on GrabCut’s interactive image segmentation method by which to generate masks [
14]. The second stage involves using the GrabCut output of the mask for detection.
Figure 1.
Ref. [
15]. The Mask R-CNN network.
Figure 1.
Ref. [
15]. The Mask R-CNN network.
This paper is organized as follows: Section two briefly describes the automatically generating image mask method, followed by the experimental results as well as the discussion and conclusions.
2. Automated Generating Image Mask Method
As shown in
Figure 2, the proposed method consists of two parts. The first part implements GrabCut-based interactive image segmentation. This process yields a pixel-level segmentation result, which is the mask of the image. In the second part, Mask R-CNN-based object detection is performed, in which the image mask, the original input image, and the image label information, such as the object type and background, are used for training. The outputs include the object segmentation results, label information, and average value of precision.
2.1. GrabCut-Based Mask Segmentation
In this paper, using GrabCut to perform the segmentation task, we must first manually frame and select the target area, which automatically segments the possible target area, and then conduct a small amount of user interaction, that is, specify that some pixels belong to the target, and that the cut image will be a color image with the background removed. The removed target area image needs to be converted into a black-and-white–gray image, which is more convenient for image processing as an image mask.
GrabCut is an improvement on the iterative Graph Cut algorithm [
16] and is an iterative minimization algorithm. Each iteration in the process decides each parameter of the Gaussian mixture model (GMM) to make the segmentation between the object and the background where it is easier to perform so that the image segmentation also makes the final segmentation result look better from the effect.
According to Rother [
14], the GrabCut algorithm uses texture (color) information and boundary (contrast) information in the image. Therefore, this algorithm requires only a small amount of user interaction or simple frame selection and labeling to obtain better segmentation results.
The GrabCut algorithm first requires the user to simply select the foreground and background to establish a GMM on the foreground and background area. Then, it initializes the GMM using the k-means algorithm to calculate the distance from the nodes to the foreground or background and the distance between adjacent nodes. Based on this information, it obtains the split energy weight, constructs the s-t network graph for the unknown area, and uses the maximum flow/minimum cut algorithm to split it. The segmentation process of the GrabCut algorithm involves continuously updating and modifying the GMM parameters through iterations so that the algorithm tends to converge. Because the group parameters of the GMM are optimized during the iteration process, the segmentation energy is gradually reduced. Finally, it is ensured that the segmentation energy converges to the minimum value and image segmentation is realized.
The specific process is as follows: When running the GrabCut algorithm with PyCharm, an interactive interface pop ups. The instructions are then followed to process the image in the interactive interface. First, the target area must be manually box selected in the image. The algorithm automatically segments the possible target area according to the box selected area. If the segmentation effect is poor and the target area is not segmented or the background is wrongly segmented into the target area, we can enter the subsequent interactive operation, mark the target area or background with a simple line, and then execute the segmentation algorithm to achieve the goal of the semi-automatic segmentation of the target area. According to the minimum energy method, the algorithm can segment the pixels that approximate the target area to achieve interactive GrabCut. Due to the increase in manual intervention, it is more accurate than automatic segmentation. The following is a representative image that marks five types of electronic components and provides the operations required for segmentation as shown in
Figure 3. Here, the required mask can be obtained by simply framing and labeling the target area. The final mask result is shown in
Figure 3.
2.2. Object Detection Based on Mask R-CNN Method
The proposed object detection method based on Mask R-CNN has three branches that perform different tasks, namely, the bounding box positioning branch, bounding box classification branch, and segmentation branch. The positioning and classification branches of the bounding box directly use the fully connected layer to obtain the results. The segmentation branch mainly includes continuous convolution, deconvolution, and a segmentation mask. More specifically, it first obtains the Feature Maps from the labeled training dataset through the FPN network, following which they are fed into RPN to obtain the region proposals. These are input into the ROIAlign module to extract the region of interest, which is then inputted to two branches of the segmentation branch and the regression classification network. While the former receives the target mask and segmentation results, the latter receives the results of the classification and positioning of the box area. In order to train our proposed detection method, we use a platform consisting of Python and TensorFlow-GPU. The network is first initialized and trained on the Microsoft COCO dataset [
17] after pre-training on the same dataset, and the detection method is then fine-tuned on our dataset.
5. Conclusions
Although the current deep learning method represented by Mask R-CNN has achieved high-pixel-level segmentation accuracy, it is based on training via inputting masks. At present, these masks are made manually. When the object boundary is very complex and the dataset is especially large, this consumes time and energy. Therefore, we proposed a mask-making method based on GRABCUT which can quickly obtain masks for object detection.
Experiments on the BigBIRD (Big Berkeley Instance Recognition Dataset) verified the effectiveness of our proposed method, which achieved a mAP index of over 95% for segmentation. While maintaining the positioning and segmentation performance of Mask R-CNN, this method ensures that the required mask can be obtained simply and efficiently. We also extended our experiments to the COCO dataset and electronic component solder joint defect detection to further prove the effectiveness of our proposed method.
The proposed method can also be applied to other object recognition tasks and can be easily generalized to other fields that require image annotation. Although the efficiency of our proposed method is improved compared with that of the manual annotation method, it still requires some labeling and image conversion operations; thus, we will focus on these issues in the future to achieve real automatic mask acquisition.