1. Introduction
The real-time information acquisition and monitoring growth status of tomato young fruits can grasp the early quality information of young fruits, and it is of great significance to timely remove abnormal fruits having deformities, diseases, insects, etc., to ensure the normal growth of healthy fruits and to improve fruit quality and yield [
1]. The automatic monitoring of the growth status of tomato young fruits is an important part of the tomato production process. Therefore, it is necessary to develop agricultural robots and other technical means to complete this work. At present, most robots which used to complete advanced agricultural tasks, such as automatic flower thinning, fruit thinning and picking, are using object detection techniques to achieve fruits positioning, and the image-based object detection algorithm is the key factor affecting the recognition performance of robots [
2,
3,
4]. Therefore, high-performance object detection algorithms play a fundamental role in improving the performance of robots, and they can provide theoretical guidance for the object recognition of fruit-thinning robots.
The traditional fruits detection method mainly extracts the color, shape, texture and other shallow feature information of the image and scholars have conducted extensive research on it [
5,
6,
7]. Zhao et al. extracted the Haar-like features of tomato grayscale images and used the Ada Boost classifier to identify the fruits, and eliminated false positives in the classification results through the color analysis method based on Average Pixel Value (APV) [
8]. This method used the combination of Ada Boost classification and color analysis to correctly detect 96% of mature tomatoes, but the false negative is approximately 10%, and 3.5% of tomatoes are not detected, which is seriously affected by the background. In order to reduce the interference caused by a complex background and lighting, Zhao et al. extracted two new feature images from L*a*b* color space and luminance, in-phase, quadrature-phase (YIQ) color space, respectively, a*-component image and I-component image, used wavelet transform to fuse two feature images at the pixel level and then segment them to obtain tomato fruit recognition results [
9]. In order to simplify the calculation and improve recognition efficiency, Wu et al. first extracted different texture features and color component information in each image block, using the Iterative RELIEF (I-RELIEF) algorithm to analyze related features and their weights, and then they used the weighted Relevance Vector Machine (RVM) classifier of selected related features to divide the image blocks into different categories, and finally obtained the fruits recognition results [
10]. The above studies have improved the recognition accuracy of mature fruits. However, due to the small size of tomato young fruits, their color is similar to the stems and leaves, and there are effects such as stems and leaves occlusion and changes in ambient light. The method mentioned above is not sensitive to small-sized, near color background tomato young fruits, therefore it is difficult to achieve a stable recognition effect. Yamamoto et al. used a regression tree classifier to build a decision tree, extracted tomato fruit pixels through pixel segmentation, and detected single mature and immature tomato fruits in the image with an accuracy of 88% [
11]. This method has a certain improvement in the ability of green tomato fruits recognition, but it has higher requirements for image quality and poor adaptability to noise. In summary, traditional fruits detection methods are mainly based on shallow features such as color and shape outline, and the adaptability to the changeable fruit morphology and tomato stems and leaves occlusion is not strong [
12]. Therefore, it is difficult to achieve high-precision and real-time requirements for the recognition of tomato young fruits with near color background using traditional methods.
In recent years, with the rapid development of deep learning techniques and continuous improvement of computer computing power, Convolutional Neural Network (CNN) has shown great advantages in the field of object detection [
13,
14]. In the field of agriculture, compared with traditional fruit detection methods, CNN has better performance in tasks, such as image classification [
15,
16], object detection [
17,
18,
19,
20] and object segmentation [
21,
22]. Chen et al. proposed a dual-path feature extraction network to extract the semantic feature information of small tomato objects using the K-means++ clustering method to calculate the scale of the bounding box, and the test accuracy was up to 94.29% [
23]. The studies mentioned above have made targeted improvements to the model in terms of different scales, environmental interference, and background removal and achieved good results. Wang et al. proposed a Region-Based Fully Convolutional Network (R-FCN) apple young fruits object detection method, which can effectively identify occluded, blurred and shadowed young fruits objects [
24]. However, the problem of fruits overlap caused by the dense distribution of clustered fruits will lead to a large false detection and missed detection, which makes the generalization ability of the model insufficient and the recognition accuracy is relatively low [
25,
26]. At the same time, due to the presence of complex backgrounds in the orchard and the irregular growth status and growth position of tomato young fruits, it is more severely affected by the occlusion of tomato stems and leaves during the detection process.
In response to the above problems, this paper proposes a method for detecting tomato young fruits in a near color background based on improved Faster R-CNN with an attention mechanism. First, the pre-trained weights of the feature extraction network ResNet50 is used and fine-tuned. In order to solve the problem that the feature is difficult to extract due to the occlusion of stems and leaves, the Convolutional Block Attention Module (CBAM) attention mechanism [
27] is used to process the feature map to strengthen the regional characteristics of the fruits and increase the richness of the feature map. In addition, in order to enhance the model’s adaptability to tomato young fruits of different scales, Feature Pyramid Networks (FPN) [
28], which has a lower computational cost, is used to fuse high-level semantic features with low-level detailed features. Then, according to the growth characteristics of the young fruit clusters, the Soft Non-Maximum Suppression (Soft-NMS) method is used to reduce the missed detection rate of overlapping fruits. Finally, the Region of Interest Align (RoI Align) region feature mapping method is used to optimize the positioning of the bounding box, and the detection model of tomato young fruits is constructed.
The rest of the work is arranged as follows: In Materials and Methods, the source and structure of the data set used in this research are introduced, the improvement of related structure and algorithm of Faster R-CNN, as well as model training and testing, are discussed in detail. Results and Discussion presents the test to evaluate the performance of the model and analyzes the test results. The Conclusion summarizes the work of this paper.
2. Materials and Methods
2.1. Data Sources
RGB images contain only three channel information of red, green and blue, data redundancy is small and the cost of image acquisition is low, so this work uses RGB images of tomato young fruits. From April 2021 to May 2021, a total of 2235 images of tomato young fruits were collected in the agricultural digital greenhouse of Northwest A&F University to construct the data set. In the image acquisition process, taking into account the difference in imaging results caused by different weather and acquisition times, in order to ensure the diversity and effectiveness of the data set, images were collected on tomatoes transplanted for 30 days and with a fruit size of approximately 28 mm–55 mm in different weather conditions (sunny, cloudy) and different time periods (morning, noon, evening). The digital image acquisition device is an MI 9 smartphone, and the image size is 3000 × 3000 pixels. Because the clusters of densely distributed tomato young fruits have different growth status and positions, and there are stems and leaves occluded, etc., the fruits are photographed from different angles and directions to increase the diversity and complexity of the samples.
Table 1 shows sample images taken under different conditions.
Image labeling is an important part of the object detection process. LabelImage labeling software is used to annotate the position information of the tomato in the image. Annotate according to the format of PASCAL VOC 2007 data set (a data set containing multiple types of target labeled images and annotated files), and automatically generate the corresponding XML file. Take any image in the data set as an example,
Figure 1a,b are the position of the manual annotation box and the corresponding XML description file, respectively.
In order to train and test the object detection model, the annotated data set is randomly divided into 3 independent subsets: training set, validation set and test set. Divide 80% of the images in the data set into the training set, 10% into the validation set, and the remaining 10% into the test set to ensure that each subset contains different forms of fruit images. The training set is used to train the network, and each structure of the network is automatically learned by adjusting the weights and bias. The validation set is used to make a preliminary evaluation of the model and visually demonstrate the effect of model training through the recognition accuracy. The test set is used to evaluate the generalization ability of the model.
2.2. Feature Extraction Network
The PASCAL VOC 2007 data set of the original Faster R-CNN used for training and testing contains 21 different types of objects, and the feature differences among different types of objects are obvious. The VGG16 network used in the model can achieve good results in the feature extraction process. However, there are still some problems in detecting images of tomato young fruits collected in a field environment. First, the color of young tomato fruits is green, which is similar to the color of the tomato stems and leaves. At the same time, the occlusion of fruits by the stems and leaves in the images taken at different angles are also different. Second, the images acquired in the field environment all contain a large number of irrelevant and complex backgrounds, and there are environmental interference factors such as lighting. In addition, young fruits in the field are mostly clustered and densely distributed, with different growth status and uneven distribution. The traditional feature extraction network VGG16 has insufficient feature richness for the extraction of young tomato fruits with the above characteristics, and it is difficult to achieve satisfactory results.
In order to solve the problems mentioned above in the task of young tomato fruits detecting and improve the richness of feature extraction of young tomato fruits by the network, this paper adopts ResNet50 with residual structure as the feature extraction network.
Figure 2 shows 2 different basic structures of ResNet50. The first layer structure of each residual body in the ResNet50 network structure is shown in
Figure 2a. The input feature is extracted through the main branch and the depth of the input feature matrix is expanded to twice the input. The Shortcut branch uses 1 × 1 convolution to increase the dimension and adds it to the output of the main branch. The subsequent layer structure of each residual body is shown in
Figure 2b. The main branch performs feature extraction, while Shortcut does not do any processing and directly adds to the output of the main branch, and the network learns the difference between the 2 branches. Due to the small sample size of acquired images, it is difficult to retrain the model to achieve good results. In order to avoid over-fitting problems, the transfer learning method is used to fine-tune the network according to the characteristics of the data set in this paper.
Due to the fact that recognizing tomato young fruits obtained under different conditions is not high, and the relevant fine-grained features are difficult to capture, enhancing the effective attention of the network to the fine-grained features is the key to solving this problem. The attention mechanism assigns high-contribution information to larger weights while suppressing other irrelevant information through weight distribution, which is an effective method to improve the performance of feature extraction networks [
29,
30]. Based on ResNet50, this paper uses the CBAM attention module to further optimize acquired features. As shown in
Figure 3, the shape of the input feature matrix is W × H × C. After Max pooling and Average pooling, 2 groups of 1 × 1 × C feature matrices are obtained, and they are passed through Multilayer Perceptron (MLP), then the 2 output feature matrices are added to obtain the weight information Channel Attention (CA) of different channels. The calculation method is shown in Equation (1). After the CA is multiplied by the input feature matrix, the feature matrix that integrated with the channel attention is obtained, as shown in Feature X’ in
Figure 3. The feature matrix with channel attention fused is then passed through the Max pool and Average pool of W × H × 1, respectively, to obtain 2 feature maps, and concat the 2 feature maps in depth direction, then the convolution operation is performed to obtain the Spatial Attention (SA) that integrates the spatial weight information, and the calculation method is shown in Equation (2). In Equation (2),
f7×7 represents that the size of the pooling kernel is 7 × 7. Finally, SA is multiplied by feature X’ to obtain the feature map Refined Feature X’’, which combines channel and spatial attention information.
2.3. Multi-Scale Feature Fusion
Since there may be multiple different-sized tomatoes in the image, different features are needed to distinguish objects of different scales. However, with the abstraction of features extracted by CNN, the size of its feature maps gradually shrinks, resulting in the loss of detailed information. In order to obtain robust high-level semantic and low-level detailed information at the same time, and to improve the model’s adaptation to different scale features based on different scale feature maps generated by the feature extraction network and CBAM attention mechanism, this paper constructs a pyramid structure with strong semantic features on all scales. Through the bottom-up feature extraction and top-down feature fusion process, the semantic information and detailed information in the features are optimized to form a feature pyramid structure.
The detailed structure of the feature pyramid constructed in this paper is shown in
Figure 4. The feature extraction network ResNet50 contains 4 residual bodies: C2, C3, C4 and C5. Convolving the original image, the last residual structure of each residual body is selected to output the feature map; therefore, a total of 4 feature maps are obtained. It should be noted that the size of the output feature map of each residual body is half of the previous residual body, and the depth of the feature matrix is doubled. The obtained feature maps of different scales are respectively optimized by the CBAM module and then input into the FPN structure. In order to ensure the normal fusion of subsequent feature maps, first, adjust the channel of the feature maps which input FPN through 1 × 1 convolution to ensure that the feature matrices of different scales have the same depth. Then, the top-down process in FPN performs up-sampling of the abstract semantic features twice, and fuses it with the corresponding horizontally connected feature maps of the next layer. Since this method only adds a horizontal connection to the initial network, it generates very little additional computational cost. Finally, a 3 × 3 convolution operation is used for each layer of fused feature maps to reduce the aliasing effect caused by up-sampling. After the above operations, the features are optimized and feature fusion is performed through the top-down method, so that feature maps of all scales have rich semantic information. The feature maps P2–P5 output through the FPN structure in
Figure 4 correspond to C2–C5 in ResNet50, and P6 is a higher-level abstract feature obtained by Max pooling on the basis of P5.
2.4. The Network Architecture of Improved Faster R-CNN with Attention Mechanism
The overall structure of the improved Faster R-CNN is shown in
Figure 5. It is mainly composed of 4 parts, including optimized backbone, region proposal network (RPN), bounding box regression and classification. Since all of the above modules are implemented by CNN, all can run on the GPU, the detection speed is fast and the comprehensive performance of the model is better. First, the input image is standardized and scaled to a fixed size. ResNet50 with CBAM is used to extract image features and the transfer learning strategy is used to fine-tune network parameters. Then, through FPN, 4 fusion feature maps of P2, P3, P4 and P5 are obtained, which are shared in RPN and subsequent bounding box regression and classification. The P6 with a higher degree of abstraction is only used for RPN training. The RPN is used to generate region proposals, determine whether the anchors are foreground or background through softmax function and then use bounding box regression to correct the anchors to obtain an accurate bounding box. The position of the original Faster R-CNN bounding box is obtained from model regression, while the pooled feature map size required to be fixed; therefore, the region of interest pooling (RoI Pooling) operation has 2 quantization processes: one is to quantize the boundary of the bounding box into integer coordinate values, and the other is to divide the quantized boundary area into k × k units and quantize the boundaries of each unit. There is a certain deviation between RoI after the above operation and the initial RoI, which affects the detection accuracy of small targets. In order to further improve the positioning accuracy of the bounding box, the RoI Align feature mapping method is used to improve the extraction accuracy of the RoI. Finally, the proposals are sent to the subsequent fully connected layer for classification and bounding box regression.