Artiﬁcial Intelligence Approach for Tomato Detection and Mass Estimation in Precision Agriculture

: Application of computer vision and robotics in agriculture requires su ﬃ cient knowledge and understanding of the physical properties of the object of interest. Yield monitoring is an example where these properties a ﬀ ect the quantiﬁed estimation of yield mass. In this study, we propose an image-processing and artiﬁcial intelligence-based system using multi-class detection with instance-wise segmentation of fruits in an image that can further estimate dimensions and mass. We analyze a tomato image dataset with mass and dimension values collected using a calibrated vision system and accurate measuring devices. After successful detection and instance-wise segmentation, we extract the real-world dimensions of the fruit. Our characterization results exhibited a signiﬁcantly high correlation between dimensions and mass, indicating that artiﬁcial intelligence algorithms can e ﬀ ectively capture this complex physical relation to estimate the ﬁnal mass. We also compare di ﬀ erent artiﬁcial intelligence algorithms to show that the computed mass agrees well with the actual mass. Detection and segmentation results show an average mask intersection over union of 96.05%, mean average precision of 92.28%, detection accuracy of 99.02%, and precision of 99.7%. The mean absolute percentage error for mass estimation was 7.09 for 77 test samples using a bagged ensemble tree regressor. This approach could be applied to other computer vision and robotic applications such as sizing and packaging systems and automated harvesting or to other measuring instruments.


Introduction
Artificial intelligence-based fruit monitoring and grading systems are being considered to potentially replace traditional manual inspection in the agricultural and packaging industry [1][2][3][4][5][6][7][8]. This is mainly because of the challenges faced by food production, that is, to meet the rising demands of an ever-growing world population. Tomato (Solanum lycopersicum) is one of the widely produced and consumed agricultural products, and approximately 182 million tons of tomatoes were produced in 2017 [9]. The main purposes of these systems include harvesting, sorting, and grading of fruits while performing calibrations for parameters such as color, size, shape, mass, and defects. Hence, the development of an accurate fruit detection and mass estimation system is crucial toward developing a fully automated agricultural and packaging pipeline. The three major steps in this process are object detection, classification, and analysis (e.g., color, dimension, volume, or mass estimation).
Fruit detection systems have significantly advanced, considering the complexity of the natural environment and the unstructured features of fruits, in addition to other machine-vision challenges, such as occlusions and variations in illumination. Existing traditional approaches involve the use The remainder of this paper is organized as follows. Section 2 provides an overview of the overall system and describes the various steps involved. We demonstrate our experimental results with a discussion based on our results in Section 3. Finally, we draw conclusions on our work in Section 4.

Materials and Methods
In this section, we describe the overall framework of our tomato detection and mass estimation system. Figure 1 provides an overview of the training and testing scheme of our method. We started by collecting tomato images, dimensions, and mass values for training, validation, and testing of the following sub-modules of the tomato detection and mass estimation system. The training images were annotated and fed to train a DCNN for detection and mask generation. Meanwhile, the collected dimension and mass features were used to train the regression model for mass estimation. This mass estimation regression model was trained separately from the detection and mask generation DCNN.
Sustainability 2020, 12, x FOR PEER REVIEW 3 of 15 and prior knowledge regarding harvest crops would help farmers in estimating and controlling their yield. The remainder of this paper is organized as follows. Section 2 provides an overview of the overall system and describes the various steps involved. We demonstrate our experimental results with a discussion based on our results in Section 3. Finally, we draw conclusions on our work in Section 4.

Materials and Methods
In this section, we describe the overall framework of our tomato detection and mass estimation system. Figure 1 provides an overview of the training and testing scheme of our method. We started by collecting tomato images, dimensions, and mass values for training, validation, and testing of the following sub-modules of the tomato detection and mass estimation system. The training images were annotated and fed to train a DCNN for detection and mask generation. Meanwhile, the collected dimension and mass features were used to train the regression model for mass estimation. This mass estimation regression model was trained separately from the detection and mask generation DCNN. Step-by-step illustration of our detection and instance-based segmentation system with dimension and mass estimation of tomato fruits.
During the test phase, we first obtained the mask of the input test image using a trained DCNN for detection and mask generation. Next, the dimensions of tomatoes were extracted from the mask using a two-dimensional reference object with known width or height in the geometry module. These dimensions were the input to a trained regression model to estimate the final mass of test tomato images. In the following subsections, we provide technical details for various steps involved in training our system for estimating the mass of tomato fruits from RGB images.

Data Collection and Annotation
We collected tomato image data with mass and dimension values in a smart greenhouse in Jeounju, Korea. A simple hand-held camera device was used to capture images. The camera was held perpendicular to the plane carrying the objects to be measured, including a known two-dimensional reference, such that all the objects appeared to be co-planar. Figure 2a shows the experimental setup for image acquisition. The images were captured under different circumstances depending on the stage (ripe and unripe), illumination, and fruit size of the time period when they were taken. The dimensions (fruit length and width) and the mass of tomato were precisely measured using an ABS digimatic caliper (Mitutoyo Corporation, Kawasaki, Japan) and T-4002 (Symaxkorea, Anyang, Korea), respectively. We collected a total of 651 images and physical data values (dimension and mass) for 2521 samples. The image data were further split into training (73%), validation (15%), and test sets (12%). Due to insufficient amount of total available samples, we chose this split to place the maximum possible samples in the training split which helps to slightly increase the variance in the training data, hence avoiding overfitting to some extent. For thorough evaluation, we used test set samples with known dimensions and mass values to further compare and evaluate against the predicted dimension and mass. For training and optimization of the detection and mask generation modules, we have Step-by-step illustration of our detection and instance-based segmentation system with dimension and mass estimation of tomato fruits.
During the test phase, we first obtained the mask of the input test image using a trained DCNN for detection and mask generation. Next, the dimensions of tomatoes were extracted from the mask using a two-dimensional reference object with known width or height in the geometry module. These dimensions were the input to a trained regression model to estimate the final mass of test tomato images. In the following subsections, we provide technical details for various steps involved in training our system for estimating the mass of tomato fruits from RGB images.

Data Collection and Annotation
We collected tomato image data with mass and dimension values in a smart greenhouse in Jeounju, Korea. A simple hand-held camera device was used to capture images. The camera was held perpendicular to the plane carrying the objects to be measured, including a known two-dimensional reference, such that all the objects appeared to be co-planar. Figure 2a shows the experimental setup for image acquisition. The images were captured under different circumstances depending on the stage (ripe and unripe), illumination, and fruit size of the time period when they were taken. The dimensions (fruit length and width) and the mass of tomato were precisely measured using an ABS digimatic caliper (Mitutoyo Corporation, Kawasaki, Japan) and T-4002 (Symaxkorea, Anyang, Korea), respectively. We collected a total of 651 images and physical data values (dimension and mass) for 2521 samples. The image data were further split into training (73%), validation (15%), and test sets (12%). Due to insufficient amount of total available samples, we chose this split to place the maximum possible samples in the training split which helps to slightly increase the variance in the training data, hence avoiding overfitting to some extent. For thorough evaluation, we used test set samples with known dimensions and mass values to further compare and evaluate against the predicted dimension and mass. For training and optimization of the detection and mask generation modules, we have provided the labeled training set annotated using the manual image annotation tool VGG image annotator [31]. Figure 2b shows a sample image from our dataset along with its annotation mask. provided the labeled training set annotated using the manual image annotation tool VGG image annotator [31]. Figure 2b shows a sample image from our dataset along with its annotation mask.

Detection and Mask Generation Module
This module aims to detect and generate instance-wise segmentation masks for tomato images. We used the state-of-the-art Mask-RCNN [25], which adds an extra branch for high-quality instance segmentation to its predecessor Faster-RCNN [32], for object detection. The instance segmentation task of each region of interest (RoI) runs parallel to the classification and bounding box regression pipeline. The mask branch includes a small fully convolutional network (FCN) [33] applied to each RoI, which can predict a binary segmentation mask for each class instance in a pixel-to-pixel manner. Figure 3 illustrates a complete framework for the Mask-RCNN with three stages: a feature extraction backbone network, a region proposal network (RPN) to generate anchors, and an FCN running parallel to fully connected networks that output instance-wise semantic masks and target detection with classification outputs. In this study, we used ResNet101 [34] with a feature pyramid network (FPN) [35] as the convolutional feature extractor backbone, which provides excellent gains in both speed and accuracy. While ResNet101 extracts low-level features in shallow layers and high-level features in deep layers, FPN combines the semantically strong low-resolution features with semantically weak highresolution features at all levels using lateral connections. The convolutional feature maps extracted from the backbone network are then used as input to the RPN. The anchors in the RPN span multiple pre-defined scales and aspect ratios to cover tomatoes of different shapes. The generated anchors

Detection and Mask Generation Module
This module aims to detect and generate instance-wise segmentation masks for tomato images. We used the state-of-the-art Mask-RCNN [25], which adds an extra branch for high-quality instance segmentation to its predecessor Faster-RCNN [32], for object detection. The instance segmentation task of each region of interest (RoI) runs parallel to the classification and bounding box regression pipeline. The mask branch includes a small fully convolutional network (FCN) [33] applied to each RoI, which can predict a binary segmentation mask for each class instance in a pixel-to-pixel manner. Figure 3 illustrates a complete framework for the Mask-RCNN with three stages: a feature extraction backbone network, a region proposal network (RPN) to generate anchors, and an FCN running parallel to fully connected networks that output instance-wise semantic masks and target detection with classification outputs. provided the labeled training set annotated using the manual image annotation tool VGG image annotator [31]. Figure 2b shows a sample image from our dataset along with its annotation mask.

Detection and Mask Generation Module
This module aims to detect and generate instance-wise segmentation masks for tomato images. We used the state-of-the-art Mask-RCNN [25], which adds an extra branch for high-quality instance segmentation to its predecessor Faster-RCNN [32], for object detection. The instance segmentation task of each region of interest (RoI) runs parallel to the classification and bounding box regression pipeline. The mask branch includes a small fully convolutional network (FCN) [33] applied to each RoI, which can predict a binary segmentation mask for each class instance in a pixel-to-pixel manner. Figure 3 illustrates a complete framework for the Mask-RCNN with three stages: a feature extraction backbone network, a region proposal network (RPN) to generate anchors, and an FCN running parallel to fully connected networks that output instance-wise semantic masks and target detection with classification outputs. In this study, we used ResNet101 [34] with a feature pyramid network (FPN) [35] as the convolutional feature extractor backbone, which provides excellent gains in both speed and accuracy. While ResNet101 extracts low-level features in shallow layers and high-level features in deep layers, FPN combines the semantically strong low-resolution features with semantically weak highresolution features at all levels using lateral connections. The convolutional feature maps extracted from the backbone network are then used as input to the RPN. The anchors in the RPN span multiple pre-defined scales and aspect ratios to cover tomatoes of different shapes. The generated anchors In this study, we used ResNet101 [34] with a feature pyramid network (FPN) [35] as the convolutional feature extractor backbone, which provides excellent gains in both speed and accuracy. While ResNet101 extracts low-level features in shallow layers and high-level features in deep layers, FPN combines the semantically strong low-resolution features with semantically weak high-resolution features at all levels using lateral connections. The convolutional feature maps extracted from the backbone network are then used as input to the RPN. The anchors in the RPN span multiple pre-defined scales and aspect ratios to cover tomatoes of different shapes. The generated anchors were trained to perform classification using a Softmax loss function layer. A SmoothL1 loss [24], which is less sensitive to outliers, was used to calculate the loss between the proposed and predicted bounding box. Figure 4 shows example anchors generated using RPN. The positive anchors shown were examined by the classifier and the regressor during the training process.
were trained to perform classification using a Softmax loss function layer. A SmoothL1 loss [24], which is less sensitive to outliers, was used to calculate the loss between the proposed and predicted bounding box. Figure 4 shows example anchors generated using RPN. The positive anchors shown were examined by the classifier and the regressor during the training process. The spatial structure of masks was extracted by pixel-to-pixel correspondence provided by convolutions, which requires these small extracted RoI feature maps to be well aligned so that spatial pixel-to-pixel correspondence is preserved. Mask-RCNN uses RoIAlign instead of RoIPool used in Faster-RCNN to remove any forced quantization that introduces misalignments between the extracted features and RoIs. This is followed by a multi-branch prediction network consisting of an FCN for the generation of a binary mask for each class instance, a fully connected layer for classification, and an L1 regression layer to predict accurate bounding boxes. The total training optimization loss can be summarized as follows: where ℒ is the sum of the Softmax classification loss for the generated anchors and the SmoothL1 bounding box regression loss in the RPN, as shown in [24]. The ℒ _ loss function optimizes the classification, localization, and segmentation mask and can be represented as: where, ℒ and ℒ are the classification and localization loss functions, respectively, similar to Faster-RCNN, and ℒ is the average binary cross-entropy loss for the nth mask with the region classified as ground truth class n. Thus, any competition between classes for mask generation can be avoided.
We used transfer learning to improve the generalization of our Mask-RCNN to our sparse tomato dataset. Transfer learning is used to extract the knowledge of a trained machine-learning model applied to a different but related problem. The main advantages of transfer learning are that we get a better performance of the neural network at reduced training time and lesser available training data. To train our detection and segmentation module, we used pre-trained Mask-RCNN weights on the Microsoft Common Objects in Context dataset [36] for transfer learning because of inadequate training samples and annotations. The framework for Mask-RCNN was implemented using the deep learning libraries-Tensorflow and Keras. We used stochastic gradient descent with an initial learning rate of 0.001 and momentum of 0.9. The mini-batch size was set to 1 image on an NVIDIA V100 graphics processing unit with 64 GB of memory, which took an hour to train for 47K iteration.

Geometry Module
In this module, we extracted the dimensions of the tomato fruit from the image using a reference object with known dimensions [27]. We also extracted the edge contours of all objects in the generated mask from the detection and mask generation modules and defined a minimum bounding rectangle The spatial structure of masks was extracted by pixel-to-pixel correspondence provided by convolutions, which requires these small extracted RoI feature maps to be well aligned so that spatial pixel-to-pixel correspondence is preserved. Mask-RCNN uses RoIAlign instead of RoIPool used in Faster-RCNN to remove any forced quantization that introduces misalignments between the extracted features and RoIs. This is followed by a multi-branch prediction network consisting of an FCN for the generation of a binary mask for each class instance, a fully connected layer for classification, and an L1 regression layer to predict accurate bounding boxes. The total training optimization loss can be summarized as follows: where L RPN is the sum of the Softmax classification loss for the generated anchors and the SmoothL1 bounding box regression loss in the RPN, as shown in [24]. The L multi_task loss function optimizes the classification, localization, and segmentation mask and can be represented as: where, L cls and L bbox are the classification and localization loss functions, respectively, similar to Faster-RCNN, and L mask is the average binary cross-entropy loss for the nth mask with the region classified as ground truth class n. Thus, any competition between classes for mask generation can be avoided. We used transfer learning to improve the generalization of our Mask-RCNN to our sparse tomato dataset. Transfer learning is used to extract the knowledge of a trained machine-learning model applied to a different but related problem. The main advantages of transfer learning are that we get a better performance of the neural network at reduced training time and lesser available training data. To train our detection and segmentation module, we used pre-trained Mask-RCNN weights on the Microsoft Common Objects in Context dataset [36] for transfer learning because of inadequate training samples and annotations. The framework for Mask-RCNN was implemented using the deep learning libraries-Tensorflow and Keras. We used stochastic gradient descent with an initial learning rate of 0.001 and momentum of 0.9. The mini-batch size was set to 1 image on an NVIDIA V100 graphics processing unit with 64 GB of memory, which took an hour to train for 47K iteration.

Geometry Module
In this module, we extracted the dimensions of the tomato fruit from the image using a reference object with known dimensions [27]. We also extracted the edge contours of all objects in the generated mask from the detection and mask generation modules and defined a minimum bounding rectangle for each object contour. Furthermore, we determined the pixel-per-metric ratio, which is a measure of the number of pixels per given metric observed from the reference object with actual dimensions, and obtained the real-world dimensions for these minimum bounding rectangles. These dimensions represent the actual length and width of the tomato instances, which are further fed into the final mass Sustainability 2020, 12, 9138 6 of 15 estimation module. We found that this method is fast and accurate, irrespective of the shape and number of tomato instances.

Mass Estimation Module
The characterization results of our data show a high correlation between the dimensions and mass of the tomato samples. This correlation is depicted in Figure 5 with the Pearson correlation coefficient r used to illustrate the strength and direction of this linear relationship. We used various regression models in our mass estimation module to identify this complex physical relationship and predict the mass of a tomato fruit given its dimensions. This module was trained separately before the detection and mass estimation module using the mass and dimensional features collected. The final mass predictions were only based on the dimensions extracted from the geometry module. for each object contour. Furthermore, we determined the pixel-per-metric ratio, which is a measure of the number of pixels per given metric observed from the reference object with actual dimensions, and obtained the real-world dimensions for these minimum bounding rectangles. These dimensions represent the actual length and width of the tomato instances, which are further fed into the final mass estimation module. We found that this method is fast and accurate, irrespective of the shape and number of tomato instances.

Mass Estimation Module
The characterization results of our data show a high correlation between the dimensions and mass of the tomato samples. This correlation is depicted in Figure 5 with the Pearson correlation coefficient r used to illustrate the strength and direction of this linear relationship. We used various regression models in our mass estimation module to identify this complex physical relationship and predict the mass of a tomato fruit given its dimensions. This module was trained separately before the detection and mass estimation module using the mass and dimensional features collected. The final mass predictions were only based on the dimensions extracted from the geometry module. For our mass estimation regression model, we performed experiments using both parametric as well as non-parametric machine-learning algorithms like support vector regression [29,36], bagged ensemble trees [37], Gaussian process regression (GPR) [38][39][40], and regression neural networks [41]. In non-parametric support vector regression, a prediction model is constructed in a similar manner to that for Support Vector Machines (SVMs), except that SVR minimizes the regression error instead of the classification error using kernel functions. SVR is a useful and flexible model, which helps the user to tackle the limitations involving the distributional properties of underlying variables, the geometry of the data, and the most common problem of model overfitting. In particular, we found that using quadratic and Gaussian radial basis kernel (RBF) functions for our dataset provided the best results. Ensemble tree is another non-parametric machine-learning algorithm that combines several base decision tree models also sometimes known as weak learners to produce an optimal predictive model or a strong learner without overfitting the data. The goal is to reduce the variance of the model by randomly creating several subsets of data in the training set. In our experiments, we obtained optimal results using a bagged tree with 30 learners and a minimum leaf size of 8.
GPR models are a non-parametric Bayesian approach to the regression problem. They are known to capture various relations between inputs and outputs by exploiting an infinite number of parameters and allowing the data to determine the level of complexity through Bayesian inference. Based on the evaluation of various error measures, we obtained better performance using an exponential function kernel in a GPR model for our dataset. Most of the models discussed above were implemented and compared using the Statistics and Machine-learning Toolbox in MATLAB R2019a. For our mass estimation regression model, we performed experiments using both parametric as well as non-parametric machine-learning algorithms like support vector regression [29,36], bagged ensemble trees [37], Gaussian process regression (GPR) [38][39][40], and regression neural networks [41]. In non-parametric support vector regression, a prediction model is constructed in a similar manner to that for Support Vector Machines (SVMs), except that SVR minimizes the regression error instead of the classification error using kernel functions. SVR is a useful and flexible model, which helps the user to tackle the limitations involving the distributional properties of underlying variables, the geometry of the data, and the most common problem of model overfitting. In particular, we found that using quadratic and Gaussian radial basis kernel (RBF) functions for our dataset provided the best results. Ensemble tree is another non-parametric machine-learning algorithm that combines several base decision tree models also sometimes known as weak learners to produce an optimal predictive model or a strong learner without overfitting the data. The goal is to reduce the variance of the model by randomly creating several subsets of data in the training set. In our experiments, we obtained optimal results using a bagged tree with 30 learners and a minimum leaf size of 8.
GPR models are a non-parametric Bayesian approach to the regression problem. They are known to capture various relations between inputs and outputs by exploiting an infinite number of parameters and allowing the data to determine the level of complexity through Bayesian inference. Based on the evaluation of various error measures, we obtained better performance using an exponential function kernel in a GPR model for our dataset. Most of the models discussed above were implemented and compared using the Statistics and Machine-learning Toolbox in MATLAB R2019a. We also implemented a regression artificial neural network (ANN) which is a parametric machine-learning method and optimized its parameters such as the number of hidden layers and neurons per layer using a genetic algorithm [42,43]. Usually, selecting an ANN architecture i.e., the number of hidden layers and the number of neurons in the hidden layers, is based on a hit and trial method which can be time-consuming and a tedious process. To address this issue, genetic algorithm is used to automatically devise an optimal architecture of the ANN with improved generalization ability. Genetic algorithm is capable of searching for the overall optimum in the complex, multimodal and non-differentiable search space to determine the optimal ANN architecture. For the neural network, the number of hidden layers ranged from 1 to 4, and the number of neurons per layer is 64n, where n ranges from 1 to 6. The network uses ReLU activations with the Adam optimizer. The number of generations was set to 10, with 20 networks in each generation.

Results and Discussion
In this section, we present a discussion of the qualitative and quantitative evaluation results for each module of our proposed system. First, we evaluated Mask-RCNN for detection and segmentation of our test data and some random samples collected from the tomato farm. Furthermore, using the same test data instances, we evaluated the geometry and mass estimation modules using regression and error analysis, respectively. It must be noted that during the whole evaluation process, no attempt was made to remove any outliers from the training or test dataset using any preprocessing technique. Figure 6 shows the convergence of various loss functions mentioned in Section 2.2. In our experiments, we used the validation data to identify a training time sufficient for the model to reach the state of convergence on our dataset without overfitting. Figure 7 shows the detection and instance segmentation results for our test data samples. We also show the detection and segmentation results for random samples collected from a tomato field, as shown in Figure 8. As can be observed from these figures, Mask-RCNN shows good results even under challenging conditions without exhibiting any systematic artifacts under a single instance or multi-instance output setting. The output masks show that the segmentation of the tomato fruit agrees well with the ground truth even around the edges; however, a slight delineation around the edges of our reference can be noted. This problem can be avoided by using a suitable fixed reference in addition to providing more annotated data samples.    In this study, we used the standard COCO mean average precision (mAP) metric at a mask intersection over union (IoU) threshold of 0.5-0.95 with a step size of 0.05 to quantitatively report the performance of Mask-RCNN on our test samples. During our experiments, we found that the model using the ResNet-101 backbone performed the best, and an average mask IoU of 96.05% and mAP of 92.28% were obtained with a detection accuracy and precision of 99.02% and 99.7%, respectively. In Table 1, we report the ablations using our test data to compare the ResNet backbones for Mask-RCNN. In our case, higher mask IoU and mAP are crucial for effective dimension extraction in the In this study, we used the standard COCO mean average precision (mAP) metric at a mask intersection over union (IoU) threshold of 0.5-0.95 with a step size of 0.05 to quantitatively report the performance of Mask-RCNN on our test samples. During our experiments, we found that the model using the ResNet-101 backbone performed the best, and an average mask IoU of 96.05% and mAP of 92.28% were obtained with a detection accuracy and precision of 99.02% and 99.7%, respectively. In Table 1, we report the ablations using our test data to compare the ResNet backbones for Mask-RCNN. In our case, higher mask IoU and mAP are crucial for effective dimension extraction in the geometry module. Therefore, the slightest error in the generated semantic mask would accumulate with the error in the geometry module and would result in significant mass estimation error. Compared with the previous approaches mentioned in Section 2.2, the performance of Mask-RCNN for detection is comparable to its counterparts, even in the presence of multiple machine-vision challenges, such as illumination, occlusion, and the presence of multiple fruit instances in an unstructured scene. Moreover, several of these approaches detect tomato instances with common features in their ripe or unripe state. However, as can be observed from Figure 8, Mask-RCNN improvises by detecting multiple instances occurring at variable states in an unstructured environment. This improved performance is further supported by an additional characteristic of Mask-RCCN, where it can semantically segment all instances of multiple classes. Traditional methods fail to segment such multiple adherent tomato fruits by erroneously picturing them as a single collective target, making it difficult to segment each instance as its respective class. Moreover, since the Mask-RCNN was not trained on such images of clustered tomato samples, this evaluates the fact that it did not overfit to the training data. It can thus be inferred that using Mask-RCNN for detection and mask generation helps overcome problems such as robustness and generalization toward complex scenarios associated with traditional artificial intelligence algorithms for tomato fruit detection and segmentation.

Evaluation of the Geometry Module
Even though Mask-RCNN effectively detects occluded tomato fruits, fruit dimensions can only be extracted when the entire object is visible. We used regression analysis to evaluate the results for the extracted dimensions of the tomato fruit from the output segmentation mask in the geometry module. The estimated outcome showed excellent correlation, displaying a strong relationship between the measured and calculated dimensions for our test dataset. This correlation is characterized by R 2 = 0.90 for fruit width estimation and R 2 = 0.97 for fruit length estimation (Figure 9).
Various statistical indicators that can estimate errors were also used to evaluate the relationship between the estimated and real fruit dimensions in millimeter (mm) units. In particular, we reported the mean absolute error (MAE), mean squared error (MSE), root mean squared error (RMSE) [44], and mean absolute percentage error (MAPE) [45] for our results. Table 2 shows the error analysis results for the tomato dimension estimation using our test data samples.
The error in the estimated fruit dimensions was caused by segmentation error; more specifically, it was caused by the segmentation error of the reference object. The absence of depth information introduces an additional error when estimating the three-dimensional fruit size and comparing it with a flat reference object in a two-dimensional space. However, we find that an RMSE of 2.9 mm and 3.4 mm for fruit width and fruit length, respectively, is sufficient to estimate fruit dimensions from a single RGB image. The MAE and the RMSE can be used together to identify the variation in the errors in a set of estimations. The RMSE is always larger or equal to the MAE; the greater difference between the RMSE and MAE, the greater is the variance in the individual errors in the sample. The MSE criterion is a tradeoff between bias and variance. The smaller the MSE, the closer we are to finding the line of best fit. As explained earlier in this paragraph, due to accumulation of various errors during the segmentation phase, the MSE error found in Table 2 is as good as it could get. Similarly, MAPE is another statistical measure that calculates the accuracy of a prediction system. The higher value of MAPE in Table 2 corresponds to the fact that MAPE gives the best insight of the outcome if there are no extremes or outliers in the data. Moreover, these figures can be further improved by introducing more annotated data for the reference object while training Mask-RCNN or by alternatively calibrating using a fixed reference in the image acquisition system (e.g., camera), which would avoid the loss of object depth information when using a single RGB image at the time of data collection. This essentially requires anchors with known physical dimensions in a camera to be used as reference instead of the objects in addition to other adjustments required for camera calibration. fail to segment such multiple adherent tomato fruits by erroneously picturing them as a single collective target, making it difficult to segment each instance as its respective class. Moreover, since the Mask-RCNN was not trained on such images of clustered tomato samples, this evaluates the fact that it did not overfit to the training data. It can thus be inferred that using Mask-RCNN for detection and mask generation helps overcome problems such as robustness and generalization toward complex scenarios associated with traditional artificial intelligence algorithms for tomato fruit detection and segmentation.

Evaluation of the Geometry Module
Even though Mask-RCNN effectively detects occluded tomato fruits, fruit dimensions can only be extracted when the entire object is visible. We used regression analysis to evaluate the results for the extracted dimensions of the tomato fruit from the output segmentation mask in the geometry module. The estimated outcome showed excellent correlation, displaying a strong relationship between the measured and calculated dimensions for our test dataset. This correlation is characterized by R 2 = 0.90 for fruit width estimation and R 2 = 0.97 for fruit length estimation (Figure 9). Various statistical indicators that can estimate errors were also used to evaluate the relationship between the estimated and real fruit dimensions in millimeter (mm) units. In particular, we reported the mean absolute error (MAE), mean squared error (MSE), root mean squared error (RMSE) [44], and mean absolute percentage error (MAPE) [45] for our results. Table 2 shows the error analysis results for the tomato dimension estimation using our test data samples. The error in the estimated fruit dimensions was caused by segmentation error; more specifically, it was caused by the segmentation error of the reference object. The absence of depth information introduces an additional error when estimating the three-dimensional fruit size and comparing it with a flat reference object in a two-dimensional space. However, we find that an RMSE of 2.9 mm and 3.4 mm for fruit width and fruit length, respectively, is sufficient to estimate fruit dimensions

Evaluation of the Mass Estimation Module
The final mass estimation results in mass unit grams (g) were evaluated using the extracted dimensions from the geometry module. Each model was trained on 2444 collected dimension and mass instances. For fair comparison and evaluation of various artificial intelligence algorithms (Section 2.4) on our test data, we report the results of both manually measured fruit dimension features (X r ) and the estimated dimension features from the geometry module (X p ). Table 3 lists the performance indicators of these algorithms using error analysis measures of the actual test data dimensions (X r ) collected. In Table 4, we list the extracted dimensions (X p ) of the test samples from the geometry module.  As shown in Table 3 and Figure 10a, once the relationship between the dimensions and mass is established, one can readily estimate the fruit mass, given a constant fruit density. The positive correlation indicates promising estimation perspectives on real-time test data. In addition, from Table 4, the observed minimum MAE using bagged ensemble tree on estimated fruit dimensions (X p ) is 13.03, which can be considered acceptable given the outliers in the test data, the absence of a large amount of variation in the training data samples, and the error in the geometry module. Moreover, this gives us the perception that the bagged ensemble tree model gives us low bias and low variance without overfitting to the training data when compared to other models for this particular dataset. Furthermore, as can be understood from Figure 10a,b, with improved segmentation and size estimation performance, the final mass can be estimated within a more acceptable standard error range. This effect is displayed in Figure 11, where we have plotted the real and estimated values for all test samples.
Sustainability 2020, 12, x FOR PEER REVIEW 12 of 15 amount of variation in the training data samples, and the error in the geometry module. Moreover, this gives us the perception that the bagged ensemble tree model gives us low bias and low variance without overfitting to the training data when compared to other models for this particular dataset. Furthermore, as can be understood from Figure 10a,b, with improved segmentation and size estimation performance, the final mass can be estimated within a more acceptable standard error range. This effect is displayed in Figure 11, where we have plotted the real and estimated values for all test samples.     Figure 11 shows the plot of estimated mass values using Xp and Xr features in the bagged tree ensemble model. As illustrated in the figure, the mass estimated using manually measured fruit dimension features (Xr) agrees more with the real mass than the mass estimated using the estimated dimension features from the geometry module (Xp). We also notice that the mass calculated when using dimension features from the geometry module (Xp) mostly follows the real mass values with slight error gaps, even in the presence of huge variation in the test data and the outliers. This again illustrates that the bagged ensemble tree model does not overfit to the training data in the mass estimation module. Nevertheless, a vision-based tomato mass estimation system would provide an effective alternate method for real-time measurement of tomato mass, which could be tedious and  Figure 11 shows the plot of estimated mass values using X p and X r features in the bagged tree ensemble model . As illustrated in the figure, the mass estimated using manually measured fruit dimension features (X r ) agrees more with the real mass than the mass estimated using the estimated dimension features from the geometry module (X p ). We also notice that the mass calculated when using dimension features from the geometry module (X p ) mostly follows the real mass values with slight error gaps, even in the presence of huge variation in the test data and the outliers. This again illustrates that the bagged ensemble tree model does not overfit to the training data in the mass estimation module. Nevertheless, a vision-based tomato mass estimation system would provide an effective alternate method for real-time measurement of tomato mass, which could be tedious and time consuming. With further improvements, this would also avoid the need for weighing devices while mass sorting and grading on a packaging line.

Conclusions and Future Work
In this study, we developed a novel vision-based system for tomato fruit detection with dimension and mass estimation. The results highlight the robustness and accuracy of the overall system and support its applicability in the development of industry-or agriculture-based sorting and grading systems. The detection and segmentation modules showed good performance in terms of accuracy and robustness with a mean IoU of 96.05%, mAP of 92.28%, detection accuracy of 99.02%, and precision of 99.7%. The trained model is also effective for detecting and segmenting multiple instances of tomato fruit in complex environmental scenarios. The estimated dimensions from the geometry module show a promising correlation with the actual dimensions. This performance, with MAEs of 2.34 and 2.58 for fruit length and width, respectively, is sufficient for related tasks, such as estimation of fruit growth, surface area, mass, and other related physical properties. Furthermore, based on our results with a MAPE of 7.09 for our test data, the final mass estimation module can be readily applied to any axisymmetric fruit for mass estimation.
However, there are some limitations to this system. Estimating the mass of occluded tomato fruit from a single RGB image is a challenging task and should be addressed. In addition, in our work, the density of fruits was set as a constant, while there are a number of tomato varieties where internal fruit structure may exhibit variable densities. As a potential solution to this problem, we can determine the relationship for each type by training and categorizing the individual variety and treating them as a sub-class. This strategy can also help effectively detect and estimate the mass of multiple fruit types. Moreover, the proposed approach is suitable for systems where the acquisition of data is calibrated in a manner in which a single camera is used at right angles to the object surface. However, this makes the overall system cheaper, but at the cost of lower accuracy in the estimation of fruit mass. Nevertheless, the proposed system is a promising starting point toward the development of automatic sorting, grading, and measuring technologies based on machine vision.
An autonomous vision-based fruit detection system for dimension and mass estimation will play a revolutionary role in various agricultural, robotic, and packaging industries by downsizing the required number of measuring instruments and manual labor. While this research highlighted the strength of our method on a small single-class dataset, future research will focus on using multiple classes for their detection and mass estimation using a common pipeline. Furthermore, because of the lack of a proper publicly available dataset on fruit dimensions and mass, the focus would be on growing the size of the training data to induce more variation for improved performance. Anothe avenue of work to further improve the performance of our approach is to acquire depth images with 3-D information for volume computation to further aid the regression models for improved mass estimation.