A Mature-Tomato Detection Algorithm Using Machine Learning and Color Analysis

An algorithm was proposed for automatic tomato detection in regular color images to reduce the influence of illumination and occlusion. In this method, the Histograms of Oriented Gradients (HOG) descriptor was used to train a Support Vector Machine (SVM) classifier. A coarse-to-fine scanning method was developed to detect tomatoes, followed by a proposed False Color Removal (FCR) method to remove the false-positive detections. Non-Maximum Suppression (NMS) was used to merge the overlapped results. Compared with other methods, the proposed algorithm showed substantial improvement in tomato detection. The results of tomato detection in the test images showed that the recall, precision, and F1 score of the proposed method were 90.00%, 94.41 and 92.15%, respectively.


Introduction
Intelligent agriculture has attracted more and more attention around the world. Fruit harvesting robots are being rapidly developed due to their enormous potential efficiency. The first critical step for harvesting robots is detecting fruits autonomously. However, it is very difficult to develop a vision system that is as intelligent as humans for fruit detection. There are many reasons, such as uneven illumination, nonstructural fields, occlusion, and other unpredictable factors [1].
Intensive efforts have been made in vision system research for harvesting robots. Bulanon et al. [2] proposed a color-based segmentation method for apple recognition using the luminance and red color difference in the YCbCr model. Mao et al. [3] used the Drg-Drb color index to segment apples from their surroundings. The L*a*b* color space was employed to extract ripe tomatoes [4]. These methods use only color features for fruit detection and heavily rely on the effectiveness of the color space used. However, it is difficult to select the best color model for color image segmentation in real cases [5]. Furthermore, relying only on color features causes the loss of much of the other visual information in the image, which was proven to be very efficient for object recognition [6].
Kurtulmus et al. [7] proposed a green citrus detection method for use in natural outdoor conditions by combining Circular Gabor Texture features and Eigen Fruit. They reported a 75.3% accuracy. This method uses several fixed thresholds for detection. A method using feature image fusion was utilized for tomato

Histograms of Oriented Gradients Feature Extraction
Dalal and Triggs first proposed using HOG [18] as features for pedestrian detection. Due to its efficiency in pedestrian detection, the HOG feature has been widely used since then. The HOG can capture the shape information of an object and is invariant to geometric and photometric transformations. It can also deal with slight occlusion. However, to our knowledge, there is little research on fruit detection using HOG. Thus, in this work, HOG features are used in tomato detection.
HOG is a descriptor that encodes the shape of an object. It divides an image into a number of cells. For each cell, a 1-D histogram of gradient directions or edge orientations over each pixel in the cell is calculated. All the histogram entries are combined to form the representation of the image. For a better illumination invariance, a local response contrast-normalization method is employed, which is performed by accumulating a measure of the local histogram energy over a block and normalizing all the cells of the block with the results. Figure 1 shows an example of HOG features of a tomato.

Linear SVM
The principle of linear SVM [19] is to find a hyperplane that can maximize the distance from the support vectors to the hyperplane. In Figure 2, the equation w · x + b = 0 denotes the separating hyperplane. The two positive samples (red) and one negative sample (blue) which are on the margins are called support vectors. The support vectors determine the separating hyperplane. In some cases, there are some outliers that cannot be separated linearly. In these cases, a slack variable i is introduced to deal with the outlier data while accepting a reasonable error. The decision function f (x) = sign( w · x + b) is solved using Equation (1): where α i and α j are the Lagrange multipliers and x i and y i are the feature vector and label of sample i. C is the penalty parameter, and N is the total number of samples.

The Non-Maximum Suppression for Merging Results
The NMS is a method for reducing repetitive detections or for merging the nearby detections around one object. It has been widely used [21,22] and proved efficient in object detection. It relies on the classification probability from the classifier and the overlap area among the bounding boxes to merge the results. After detection using the proposed method, there may be multiple detections pointing to the same tomato. Thus, we adopt NMS as a post-processing step to address this problem.

Image Acquisition and Preprocessing
To develop and evaluate the proposed algorithm, images of tomatoes in a greenhouse were acquired in late December 2017 and April 2019 in Vegetable High-tech Demonstration Park, Shouguang, China. A total of 247 images were captured using a color digital camera (Sony DSC-W170) with a resolution of 3648 × 2056 pixels. The photographs were taken at distances of 500-1000 mm, which is in accordance with the best operation distance for the harvesting robot. As shown in Figure 3, the growing circumstances of the tomatoes vary and include separated tomatoes; multiple overlapped tomatoes; and tomatoes occulted by leaves, stems, or other non-tomato objects. To speed up the image processing, all of the images were resized to 360 × 202 pixels using a bicubic interpolation algorithm. The dataset has been made publicly available [23].   An illumination enhancement method was used to decrease the effect of uneven illuminations. The image was first converted from RGB space to Hue-Saturation-Intensity (HSI) space. The I layer was then split, and a natural logarithm function was applied to each pixel. Next, the Contrast Limited Adaptive Histogram Equalization (CLAHE) method [24] was applied to the transformed I component. Finally, the H, S, and processed I layers were combined to obtain the final enhanced image. This procedure was performed on all the images as a preprocessing step before training the classifier. An example of image enhancement is shown in Figure 4.

The Dataset
A total of 247 images were used for the experiment. To train the SVM classifier, 100 images were randomly selected from the captured images, 72 images were used for validation set, and the remaining 75 images were used for the test. From the training images, 207 tomato samples and 621 background samples were manually cropped to construct a training set. The training samples were augmented with random rotations of 0°-360°. This doubles the size of the training set (1656 samples in all). All of the cropped samples were resized to 64 × 64 pixels to unify the size. The tomato samples contained a margin of about 5 pixels on all the sides. The background samples were randomly cropped to contain leaves, stems, strings, and other objects, and all the samples were separately labeled, 1 for the tomatoes and −1 for the backgrounds. Some examples for the datasets are shown in Figure 5.  (1) Extracting the HOG features of the training samples (2) Training an SVM classifier using the extracted features and corresponding labels (3) Extracting the Region-of-Interest (ROI) on the test image using a pretrained Naive Bayes classifier (4) Sliding a sub-window on the ROI of the image with different resolutions using an image pyramid (5) Extracting the HOG features of each sub-window (6) Recognizing tomatoes within the pretrained classifier (7) Performing FCR to remove any false positive detections (8) Merging the detection results using the NMS method

Image Scanning Method
After training the SVM classifier using the training set, a coarse-to-fine detection framework is used to detect tomatoes. The pseudo code and detailed detection process are described in Algorithm 1.
All the pixels are classified as belonging to tomatoes or the background using a Naïve Bayes classifier (NB) trained on color features. Since mature tomatoes are red, three color transformations are performed to distinguish the fruits from background: R − G, R − B, and R/(R + G + B). After classification, a binary image is obtained, in which white pixels represent the potential tomatoes and black pixels represent the potential background. Label the pixel as black 8 end 9 end 10 Apply morphological processing to obtain the ROI 11 for all sub-windows in the ROI pyramid do 12 Apply the SVM classifier f SV M (·) 13 if f SV M (·) is +1 then 14 Apply FCR to the sub-window 15 if pixel-ratio > threshold then 16 The window is recognized as a tomato 17 else 18 The window is recognized as background 19 end 20 else 21 The window is recognized as background Next, a morphological processing is applied to the binary image, and the Region-of-Interest (ROI) is extracted. A sliding window is applied to the ROI and slides with a fixed step. At each step, the sub-window is input to the pretrained SVM classifier to be classified as a tomato or not a tomato. If the sub-window is classified as a tomato, then FCR is used to implement further classification. After the sliding window slides all over the ROI, the image is downscaled by a fixed scaling factor, followed by the same sliding process until a defined minimum size is reached. The sliding window size is 64 × 64 based on the size of the tomatoes in the images. The sliding step and minimum size of the scaled image are set to 16 and 113 × 64, respectively. The image scaling factor is 1.1, which downscales the image by 10% at each step. A sketch map of the sliding window and image pyramid is shown in Figure 8.

False Color Removal
All sub-windows of the image could be classified using the SVM classifier. However, there are some false positive detections after the classification, and a false positive elimination method is needed to reduce them. Color features play an important role in fruit detection, especially when the fruits have a different color from the background. A False Color Removal (FCR) method is proposed for false detection elimination. The sub-window image is binarized using a color feature which is derived as shown below, and then, the ratio of the number of white pixels to the number of all pixels in the sub-window is calculated. If the ratio exceeds a threshold of 0.3, the sub-window is classified as a tomato. Otherwise, it is classified as the background. The cost function minimization [19] was applied as follows to obtain the color feature for binarization. A total of 897 samples including tomatoes and the background were chosen as the training set. The R, G, and B components of the RGB color model were extracted, and the mean value of each component over all the pixels of each sample was calculated to represent the sample. The tomato samples were labeled as 1, and the background samples were labeled as −1. Motivated by Cortes [19], a separating plane in Equation (2) is needed to separate tomatoes and background in the R-G-B coordinates: where x is the feature vector (R, G, B). w and b are the weight vector and bias of the separating plane, respectively. It is derived by minimizing the cost function L in Equation (3): where x i and y i are the feature vector (R i , G i , B i ) and label of the sample i, respectively. M is the number of samples, and i is the slack variable of sample i, which is used to deal with the outliers.
The color feature derived for sub-window binarization is 0.16 × R − 0.093 × G − 0.037 × B − 11.032, and the threshold is 0.

Experimental Setup
In this study, all experiments of the developed algorithm were performed on Python version 3.5 with an Intel ® Core TM i5-4590 CPU@3.30 GHz. Several experiments were conducted to validate the performance of the developed method. The datasets used in the experiments are listed in Table 1. Some examples of the results in each step are shown in Sections 4.2-4.5. Three indexes were used to evaluate the performance of the proposed algorithm and recently developed algorithms: recall, precision, and F 1 score, which are defined by Equations (4)

Results of Different HOG Features
HOG features with different cell sizes, block sizes, and number of orientation bins were tested on the validation sample set. The Receiver Operating Characteristic (ROC) curve and the Area Under the Curve (AUC) [25] were used to evaluate their performance. In this section, the default HOG has the following characteristics: 8 × 16 pixel blocks of four 4 × 8 pixel cells; a linear gradient voting into 10 orientation bins in 0°-180°; and an L2 block normalization.
The number of orientation bins was tested with 3, 4, 6, 9, and 10. As shown in Figure 9c, the HOG feature with 10 orientation bins achieved the best performance.
In this study, the HOG feature with 4 × 8 pixel cells, 2 × 2 cell blocks, and 10 orientation bins was used for experiments.

Results of the Image Scanning Method
An example of the coarse-to-fine framework proposed for tomato detection is shown in Figure 10. After classification by the NB classifier and the morphological operation, the binary image in Figure 10b was obtained, and the ROI extracted is shown in Figure 10c. After ROI extraction, the area was reduced by more than 50%, which means the method can accelerate the detection speed by about two times. The ROI still contained the detection targets. Finally, the detection was performed on only the ROI, and the final results are shown in Figure 10d.

Results of the SVM Classifier
The SVM uses a linear kernel with a penalty parameter C = 1. It is implemented using the open-source SCIKIT-LEARN package [26]. Examples from before and after applying the SVM classifier are shown in Figure 11. Both the tomatoes were correctly detected with the inscribed circle of the bounding box.

Results of False Color Removal (FCR)
After detection using the SVM classifier, the tomatoes can be found along with some false positives (i.e., the backgrounds). Thus, the proposed FCR method is then applied to reduce the false positives. Examples from before and after applying FCR are shown in Figure 12. A generated false positive is shown in Figure 12a since the shape of the region is similar to a circle. It was successfully removed after performing FCR, as shown in Figure 12b.

Results of the Non-Maximum Suppression
After detection using the SVM classifier and the FCR in each sub-window, there are many sub-windows classified as tomatoes, and some of them correspond to the same one. Therefore, the NMS method is introduced.
The performance of the NMS mainly depends on choice of overlap and confidence thresholds. The overlap-confidence threshold combinations were tuned on the validation detection set. The impacts of the thresholds on recall, precision, and F 1 score are shown in Figure 13. The thresholds of 0.3 (overlap) and 0.7 (confidence), which could get the best F 1 score, were selected as the optimal thresholds and used in this paper.
An example of using the NMS is shown in Figure 14. The bounding box that had the highest prediction probability was chosen for the final prediction compared to other boxes which overlapped with it over the threshold of 0.3.

Performance of the Developed Classifier for Cropped Samples
Manually cropped tomato samples were used in an experiment to evaluate the proposed method. Both the training and test sets were utilized, and the results are shown in Table 2. The recall and precision on the test set were 96.85% and 98.40%, respectively, which shows that the proposed method is effective for tomato detection.

Robustness of the Proposed Algorithm to Illumination
The performance of the proposed method was evaluated using 75 tomatoes in sunny conditions and 75 in shaded conditions. The results are shown in Table 3. For the sunny conditions, a 90.67% accuracy was achieved, while an 89.33% accuracy was obtained for the shaded conditions. The false positive rates were 5.56% and 5.63% for the sunny and shaded conditions, respectively. The results were comparable which proved that the proposed method was insensitive to illumination variation inside the greenhouse environment. This was mainly due to two factors: the illumination enhancement in the preprocessing step and the illumination normalization in the HOG-feature calculation process. Some examples of the results are shown in Figure 15.

Performance of the Proposed Method under Separated, Overlapped, and Occluded Conditions
The tomatoes under separated, overlapped, and occluded conditions were also tested. For the overlapped conditions, the tomatoes overlapped with each other in the image, while the occluded conditions referred to the tomatoes being blocked by leaves or stems. Notably, some of the overlap or occluded areas reached over 50%, which maked the detection task much more challenging. The detection results under each condition are shown in Table 4, and 135 tomatoes were detected out of the total of 150 tomatoes. The overall correct identification rate was 90.00%.
All the tomatoes were correctly identified under the separated conditions as expected. For the overlapped and occluded conditions, the results became worse. Under overlapped conditions, the correct identification rate was 91.14% due to high overlap areas between tomatoes in some cases reaching over 50%. When the overlap area was under 50%, the proposed method could detect most of the tomatoes, but it failed if the area exceeded 50%. An example in Figure 16a illustrates this phenomenon. In Figure 16a (left), two overlapped tomatoes were both detected, while in Figure 16a (right), just two tomatoes were correctly detected and not the top-right one, which was largely covered by another tomato.
A similar explanation accounts for the results of the occluded conditions. As a result, 42 tomatoes were correctly detected out of the 50 tomatoes, and the missed ones were mostly due to heavy occlusion by leaves or stems. The correct identification rate was 84.00%. An example is shown in Figure 16b. In Figure 16b (left), two tomatoes that are blocked by leaves and stems were still correctly detected. However, in Figure 16b (right), only two tomatoes were detected, while the one in the left that was largely occluded by leaves and stems was not detected.
In addition, there were some false positives in the detection results, which was mainly due to the various tomato sizes. When several detections corresponded to the same tomato, only one was considered to be the true positive, and the others were all regarded as false positives. An example is shown in Figure 17.

Comparison with Other Methods
Two other recently proposed methods [7,14] were compared with the proposed method. The first method [7] uses a Circular Gabor Filter and Eigen Fruit as features, and the other method uses an AdaBoost classifier [14], which uses a Haar-like feature as input. Moreover, one of the popular deep learning frameworks-YOLO (You Only Look Once) [27]-was also applied to evaluate its performance. Another experiment was set up using all of the same steps as the proposed method except for the false detection elimination step to test the effectiveness of the FCR. Table 5 shows the results. The proposed method achieved the second highest recall and had the second highest precision. This benefit from the better representation of the descriptor, the scanning framework, and the merging method.
The deep learning methods usually perform better than traditional methods in the face of big data. However, when the data is small or insufficient, they may be underfit due to the deep network structure, and the performance may be equal or even inferior to traditional methods. Table 5 shows the results.
The precision of the proposed method improved substantially after the FCR, which was largely due to the false positive elimination. To provide a more objective assessment, the F 1 score [28] was calculated, which combined both the recall and precision together. Table 5 shows that the proposed method had the highest F 1 score, which demonstrated that the method was effective and could be applied for the detection of mature tomatoes. Table 5. A comparison of several tomato detection methods.

Conclusions and Future Work
An algorithm was proposed to overcome the difficulties that harvesting robots face in fruit detection. The method used color images captured by a regular color camera. Compared with single-feature detection methods, the proposed method used a combination of features for fruit detection, including shape, texture, and color information. This approach can reduce the influence of illumination and occlusion factors. HOG descriptors were adopted in this work. An SVM classifier was used to implement the classification task. In the scanning stage, a coarse-to-fine framework was applied, and then, an FCR method was used to eliminate the false positives. Lastly, NMS was used to obtain the final results.
Several experiments were conducted to evaluate the efficiency of the proposed method. A total of 510 samples were used to validate the classification efficiency of the SVM classifier. The recall was 96.85%, and the precision was 98.40%. The results showed that the classifier with only HOG features can distinguish tomatoes from backgrounds very well. When it comes to detection, the correct identification rate is 90.67% in sunny conditions and 89.33% in shaded conditions. Similar results showed that the proposed method could reduce the influence of various illumination levels in the greenhouse environment. The correct identification rate was 100% for separated tomatoes, 91.14% for overlapped tomatoes, and 84.00% for occluded tomatoes, and a reasonable false positive rate was maintained. The missed tomatoes were mainly due to the area largely being blocked by other tomatoes or the background by over 50%. If the blocked area was less than 50%, most of the tomatoes could be detected correctly. Compared with other methods, the proposed method gave the best results. As a reference, the average processing time of one image was about 0.95 s.
However, there are still some problems in the proposed method. The accuracy is not satisfactory for the overlapped and occluded tomatoes, especially when the blocked area exceeds 50%. Another limitation is that the experiment was carried out in the harvesting stage. Therefore, most of the tomatoes of the experiment were ripen well and fully red. The authors believe that the detection of tomatoes at other stages including green and breaking red is also needed for the harvesting robot. Our future research will focus on further improving the detection accuracy and extension to other stages of tomatoes. Transfer learning [29,30] can also be applied with an extension of the datasets in the future.