Strawberry Yield Prediction Based on a Deep Neural Network Using High-Resolution Aerial Orthoimages

Strawberry growers in Florida suffer from a lack of efficient and accurate yield forecasts for strawberries, which would allow them to allocate optimal labor and equipment, as well as other resources for harvesting, transportation, and marketing. Accurate estimation of the number of strawberry flowers and their distribution in a strawberry field is, therefore, imperative for predicting the coming strawberry yield. Usually, the number of flowers and their distribution are estimated manually, which is time-consuming, labor-intensive, and subjective. In this paper, we develop an automatic strawberry flower detection system for yield prediction with minimal labor and time costs. The system used a small unmanned aerial vehicle (UAV) (DJI Technology Co., Ltd., Shenzhen, China) equipped with an RGB (red, green, blue) camera to capture near-ground images of two varieties (Sensation and Radiance) at two different heights (2 m and 3 m) and built orthoimages of a 402 m2 strawberry field. The orthoimages were automatically processed using the Pix4D software and split into sequential pieces for deep learning detection. A faster region-based convolutional neural network (R-CNN), a state-of-the-art deep neural network model, was chosen for the detection and counting of the number of flowers, mature strawberries, and immature strawberries. The mean average precision (mAP) was 0.83 for all detected objects at 2 m heights and 0.72 for all detected objects at 3 m heights. We adopted this model to count strawberry flowers in November and December from 2 m aerial images and compared the results with a manual count. The average deep learning counting accuracy was 84.1% with average occlusion of 13.5%. Using this system could provide accurate counts of strawberry flowers, which can be used to forecast future yields and build distribution maps to help farmers observe the growth cycle of strawberry fields.

illumination conditions presented a significant challenge. Deep learning has recently entered the domain of agriculture for image processing and data analysis [29]. Dyrmann et al. used convolutional neural networks (CNNs) to recognize 22 crop and weed species, achieving a classification accuracy of 86.2% [30]. CNN-based systems have also been increasingly used for obstacle detection, which helps robots or vehicles to locate and track their position and work autonomously in a field [31]. The framework of deep-level region-based convolutional neural network (R-CNN) [32] combines region proposals, such as the selective search (SS) [33] and edge boxes [34] methods, with CNNs, which improved mean average precision (mAP) to 53.7% on PASCAL 2010. Christiansen [35] used an R-CNN to detect obstacles in agricultural fields and proved that the R-CNN was suitable for a real-time system, due to its high accuracy and low computation time. Recent work in deep neural networks has led to the development of a state-of-the-art object detector, termed Faster Region-based CNN (Faster R-CNN) [36], which has been compared to the R-CNN and Fast R-CNN methods [37]. It uses a region proposal network (a fully convolutional network paired with a classification deep convolutional network), instead of SS, to locate regional proposals, which improves training and testing speed while also increasing detection accuracy. Bargoti and Underwood [38] adapted this model for outdoor fruit detection, which could support yield map creation and robotic harvesting tasks. Its precision and recall performance varied from 0.825 to 0.933, depending on the circumstances and applications. A ground-robot system using the Faster R-CNN method to count plant stalks yielded a coefficient of determination of 0.88 between the deep learning detection results and manual count results [39]. Sa et al. [40] explored a multi-modal fusion method to combine RGB and near infrared (NIR) image information and used a Faster R-CNN model which had been pre-trained on ImageNet to detect seven kinds of fruits, including sweet pepper, rock melon, apple, avocado, mango, and orange.
The objective of this study was to develop an automatic near-ground strawberry flower detection system based on the Faster R-CNN detection method and a UAV platform. This system was able to both detect and locate flowers and strawberries in the field, as well as count their numbers. With the help of this system, the farmers could build flower, immature fruit, and mature fruit maps to quickly, precisely, and periodically predict yields.

Data Collection
A strawberry field (located at 29.404265 • N, 82.141893 • W) was prepared at the Plant Science Research and Education Unit (PSREU) at the University of Florida in Citra, Florida, USA during the 2017-2018 growing season. The Florida strawberry season normally begins in December and ends in the following April [41]. The strawberry experiment field were 67 m long and six meters wide with five rows of strawberry plants, each being 67 m long and 0.5 m wide. Three rows were the 'Sensation' cultivar and the other two were the 'Florida Radiance' cultivar.
The UAV used to capture images of strawberry field was a DJI Phantom 4 Pro (DJI Technology Co., Ltd., Shenzhen, China), and its specifications are shown in Table 1. The drone works fully automatically, as long as the target area and flight parameters are pre-set at the ground control station. The Phantom 4 Pro has a camera with a one-inch, 20 megapixel sensor, capable of shooting 4 K/60 frames per second (fps) video. The camera used a mechanical shutter to eliminate rolling shutter distortion, which can occur when taking images of fast-moving subjects or when flying at high speeds. The global navigation satellite system (GNSS) uses satellites to provide autonomous geo-spatial positioning and the inertial navigation system (INS) continuously calculates the position, orientation, and velocity of the platform. These systems enabled the UAV to fly stably and to record the GPS position information at the time each image was taken, which is necessary for the digital surface model (DSM) and building orthoimages. The small size and low cost of this drone made it easy to carry and use, which was suitable for this study. The flight images were taken every two weeks, around 10:30 a.m.-12:30 p.m., from March to the beginning of June in 2018, for training and testing the deep neural network. An image of the calibrated reflectance panel (CRP), as shown in Figure 1, was taken before each flight for further radiometric correction in the Pix4D software (Pix4D, S.A., Lausanne, Switzerland) [42]. The CRP had been tested to determine its reflectance across the visible light captured by the camera. The flight image resolution was set to 3000 × 4000 pixels (JPG format). Three additional image sets were taken during the following growing season (in November and December 2018) with the same acquisition timing and resolution, which were used not only for training and testing the deep neural network, but also for comparison with the manual counts to check the accuracy of the model. In order to accommodate different weather conditions in the deep learning detection model, images were collected on both cloudy and sunny days. The specific imaging information and weather conditions are shown in Table 2 Figure 1, was taken before each flight for further radiometric correction in the Pix4D software (Pix4D, S.A., Lausanne, Switzerland) [42]. The CRP had been tested to determine its reflectance across the visible light captured by the camera. The flight image resolution was set to 3000 × 4000 pixels (JPG format). Three additional image sets were taken during the following growing season (in November and December 2018) with the same acquisition timing and resolution, which were used not only for training and testing the deep neural network, but also for comparison with the manual counts to check the accuracy of the model. In order to accommodate different weather conditions in the deep learning detection model, images were collected on both cloudy and sunny days. The specific imaging information and weather conditions are shown in Table 2.  Two different heights were explored for image acquisition, due to the small size of flower and fruit: one was at 2 m and the other was at three meters, as shown in Figure 2. The drone could take higher resolution images at 2 m, but only covered two rows in each image. On the other hand, at three meters, three rows could be covered but the plants had lower resolution in the images. In order to meet the 70% frontal overlap and 60% side overlap requirement for building orthoimages [43,44], the drone took an average of 185 images at 3 m, which took approximately 25 min for the whole field. At two meters, the drone took approximately 40 min for the whole field, with an average of 479 images. All the flights were performed automatically by the DJI Ground Station Pro (DJI Technology Co., Ltd., Shenzhen, China) iPad application, which is designed to conduct automated flight missions and manage the flight data of DJI drones [45]. The three-meter height images were taken in March 2018 and two-meter height images were taken from April to early June in 2018. Another three sets of two-meter images were acquired in November and December 2018, during the following season, so we could evaluate the developed model on image sets acquired in a different season. Two different heights were explored for image acquisition, due to the small size of flower and fruit: one was at 2 m and the other was at three meters, as shown in Figure 2. The drone could take higher resolution images at 2 m, but only covered two rows in each image. On the other hand, at three meters, three rows could be covered but the plants had lower resolution in the images. In order to meet the 70% frontal overlap and 60% side overlap requirement for building orthoimages [43,44], the drone took an average of 185 images at 3 m, which took approximately 25 min for the whole field. At two meters, the drone took approximately 40 min for the whole field, with an average of 479 images. All the flights were performed automatically by the DJI Ground Station Pro (DJI Technology Co., Ltd., Shenzhen, China) iPad application, which is designed to conduct automated flight missions and manage the flight data of DJI drones [45]. The three-meter height images were taken in March 2018 and two-meter height images were taken from April to early June in 2018. Another three sets of two-meter images were acquired in November and December 2018, during the following season, so we could evaluate the developed model on image sets acquired in a different season.
(a) (b) Figure 2. Two heights were chosen for the image acquisition: (a) at a three-meter height, three rows were acquired in a single image; and (b) at a two-meter height, two rows were acquired in a single image.

Orthoimage Construction
In order to identify the exact numbers of flowers and strawberries in every square meter, orthoimages were needed. An orthoimage is a raster image created by stitching aerial photos which have been geometrically corrected for perspective, so that the photos do not have any distortion. An orthoimage can be used to measure true distances, because it is adjusted for topographic relief, lens distortion, and camera tilt, and can present the Earth's surface accurately. With the help of orthoimages, we can locate the position of every flower and strawberry and build their distribution maps. The accuracy of the orthoimages was mainly based on the quality of the aerial images, which could be significantly affected by the camera resolution, focal length, and flight height. The ground sample distance (GSD) was the distance between pixel centers in the image measured on the ground, which can be used to measure the quality of aerial images and orthoimages, calculated with the formula:

Orthoimage Construction
In order to identify the exact numbers of flowers and strawberries in every square meter, orthoimages were needed. An orthoimage is a raster image created by stitching aerial photos which have been geometrically corrected for perspective, so that the photos do not have any distortion. An orthoimage can be used to measure true distances, because it is adjusted for topographic relief, lens distortion, and camera tilt, and can present the Earth's surface accurately. With the help of orthoimages, we can locate the position of every flower and strawberry and build their distribution maps. The accuracy of the orthoimages was mainly based on the quality of the aerial images, which could be significantly affected by the camera resolution, focal length, and flight height. The ground sample distance (GSD) was the distance between pixel centers in the image measured on the ground, which can be used to measure the quality of aerial images and orthoimages, calculated with the formula: where GSD is ground sample distance, H is flying height, c is camera focal length, and λ is camera sensor pixel size. For the 3 m height images, the GSD was 2.4 mm, and for the 2 m height images, the GSD was 1.6 mm. The process of building orthoimages is shown in Figure 3.
Remote Sens. 2019, 11, x FOR PEER REVIEW 6 of 21 where GSD is ground sample distance, H is flying height, c is camera focal length, and λ is camera sensor pixel size. For the 3 m height images, the GSD was 2.4 mm, and for the 2 m height images, the GSD was 1.6 mm. The process of building orthoimages is shown in Figure 3.

Figure 3.
Steps for generating orthoimages. An orthoimage is a raster image came from stitching aerial photos, and the process is a series of methods to make the orthoimage represent the same area and distance as in the real world.
Firstly, the aerial triangulation was processed to determine all image orientations and surface projections, with the help of GPS and position orientation system (POS) information provided by the drone, by using the pyramid matching strategy and bundle adjustment to match key points on each level of images [46]. Bundle adjustment [47] treats the measured area as a whole block and uses a least-squares method to meet the corresponding space intersection conditions, which can be explained by the following formula: Assuming the image feature error is consistent with the Gaussian distribution, the number of 3D points is , and the number of images is , is the actual projected coordinates of point on the image . If point was visible on image , = 1; otherwise, = 0. ( , ) was the predicted projection coordinate of point point on image . The formula minimizes the projected error of 3D points onto the images and obtains more precise image orientations and 3D points.
After the point-cloud model was generated for the irregular distribution cloud data, mesh networks were needed to store and read the surface information of objects. An irregular mesh network, such as a triangulated irregular network (TIN) [48], was used to join discrete points into triangles that covered the entire area without overlapping with each other. Thus, it established a Steps for generating orthoimages. An orthoimage is a raster image came from stitching aerial photos, and the process is a series of methods to make the orthoimage represent the same area and distance as in the real world.
Firstly, the aerial triangulation was processed to determine all image orientations and surface projections, with the help of GPS and position orientation system (POS) information provided by the drone, by using the pyramid matching strategy and bundle adjustment to match key points on each level of images [46]. Bundle adjustment [47] treats the measured area as a whole block and uses a least-squares method to meet the corresponding space intersection conditions, which can be explained by the following formula: Assuming the image feature error is consistent with the Gaussian distribution, the number of 3D points is n, and the number of images is m, x ij is the actual projected coordinates of point i on the image j. If point i was visible on image j, v ij = 1; otherwise, v ij = 0. Q a j , b i was the predicted projection coordinate of point i point on image j. The formula minimizes the projected error of 3D points onto the images and obtains more precise image orientations and 3D points.
After the point-cloud model was generated for the irregular distribution cloud data, mesh networks were needed to store and read the surface information of objects. An irregular mesh network, such as a triangulated irregular network (TIN) [48], was used to join discrete points into triangles that covered the entire area without overlapping with each other. Thus, it established a spatial relationship between discrete points. Using the Markov random field method [49], each mesh network matched the best suitable image as its model texture, based on the spatial positions and corresponding visible relationships.
The relationship of the x, y image co-ordinate to the real-world co-ordinate was calculated for orthorectification; that is, to remove the effects of image distortion caused by the sensor and viewing perspective. Similarly, a mathematical relationship between the ground co-ordinates, represented by the mesh model and the real-world co-ordinate, was computed and used to determine the proper position of each pixel from the source image to the orthoimage. The orthoimage's distance and area are uniform in relationship to real-world measurements.

Date Pre-Processing
In order to identify objects in images using Faster R-CNN, the locations and classes of the objects need to be determined first. Faster R-CNN requires bounding box annotation for object localization and detection. Annotations of three different objects were collected using rectangular bounding boxes, as shown in Figure 4: flowers with white color, strawberries with red color, and immature strawberries with green or yellow color. All labels were created manually using the labelImg software developed by the Computer Science and Artificial Intelligence Laboratory (MIT, MA, USA).
Remote Sens. 2019, 11, x FOR PEER REVIEW 7 of 21 spatial relationship between discrete points. Using the Markov random field method [49], each mesh network matched the best suitable image as its model texture, based on the spatial positions and corresponding visible relationships. The relationship of the x, y image co-ordinate to the real-world co-ordinate was calculated for orthorectification; that is, to remove the effects of image distortion caused by the sensor and viewing perspective. Similarly, a mathematical relationship between the ground co-ordinates, represented by the mesh model and the real-world co-ordinate, was computed and used to determine the proper position of each pixel from the source image to the orthoimage. The orthoimage's distance and area are uniform in relationship to real-world measurements.

Date Pre-Processing
In order to identify objects in images using Faster R-CNN, the locations and classes of the objects need to be determined first. Faster R-CNN requires bounding box annotation for object localization and detection. Annotations of three different objects were collected using rectangular bounding boxes, as shown in Figure 4: flowers with white color, strawberries with red color, and immature strawberries with green or yellow color. All labels were created manually using the labelImg software developed by the Computer Science and Artificial Intelligence Laboratory (MIT, MA, USA).  All the labeled images came from the orthoimages. The orthoimages were split into small rectangular images (480 × 380 pixels) to train the Faster R-CNN model faster. Every small image had its own sequence name, so it would be easy to restore the orthoimages from the small images after detection.
The images from March to early June were labeled for training the Faster R-CNN model; the total number was 12,526. Of these, 4568 were from the three-meter height image set and 7958 of them were from the two-meter height image set. Ten objects of interest were chosen for detection: flower at 2 m, flower at 3 m, Sensation strawberry at 2 m, Sensation strawberry at 3 m, Sensation immature at 2 m, Sensation immature at 3 m, Radiance strawberry at 2 m, Radiance strawberry at 3 m, Radiance immature at 2 m, and Radiance immature at 3 m. Five-fold cross-validation was used to train and test the model. In five-fold cross-validation, the original sample is randomly divided into five equal-size sub-samples. One of the five sub-samples was retained as the validation data for testing the model, and the remaining four sub-samples were used as training data. The cross-validation process was then repeated five times, with each of the five sub-samples being used exactly once as the validation data. Then, the results from the five iterations were averaged (or otherwise combined) to produce a All the labeled images came from the orthoimages. The orthoimages were split into small rectangular images (480 × 380 pixels) to train the Faster R-CNN model faster. Every small image had its own sequence name, so it would be easy to restore the orthoimages from the small images after detection.
The images from March to early June were labeled for training the Faster R-CNN model; the total number was 12,526. Of these, 4568 were from the three-meter height image set and 7958 of them were from the two-meter height image set. Ten objects of interest were chosen for detection: flower at 2 m, flower at 3 m, Sensation strawberry at 2 m, Sensation strawberry at 3 m, Sensation immature at 2 m, Sensation immature at 3 m, Radiance strawberry at 2 m, Radiance strawberry at 3 m, Radiance immature at 2 m, and Radiance immature at 3 m. Five-fold cross-validation was used to train and test the model. In five-fold cross-validation, the original sample is randomly divided into five equal-size sub-samples. One of the five sub-samples was retained as the validation data for testing the model, and the remaining four sub-samples were used as training data. The cross-validation process was then repeated five times, with each of the five sub-samples being used exactly once as the validation data. Then, the results from the five iterations were averaged (or otherwise combined) to produce a single estimation. The advantage of this method was that all observations were used for both training and validation, and each observation was used for validation exactly once. The numbers of training and test images for each object are shown in Table 3. All the objects detected at the same height (for example, flower at 2 m, Sensation mature and immature fruit at 2 m, and Radiance mature and immature fruit at 2 m) shared the same image set.

Model Training
We used the Faster R-CNN method based on the ResNet-50 architecture [50] in this study. As shown in Figure 5, instead of stacking layers directly to fit a desired underlying mapping, like VGG nets [51], ResNet-50 introduces a deep residual learning framework to fit a residual mapping which helps to address the degradation problem [52]. The deep residual learning framework is composed of a stack of residual blocks, each of which consists of a small network plus a skip connection. If an identification mapping is optimal, it is easier to push the residual to zero and skip the connection than to fit an identification mapping by a stack of non-linear layers. Compared with VGG-16/19, ResNet-50 has lower error rates for ImageNet validation and a lower complexity.
The whole structure of Faster R-CNN [36] is shown in Figure 6. It consists of convolutional layers (ResNet-50), region proposal networks (RPN), an ROI pooling layer, and a classifier. The convolutional layers were used for extracting image features to be shared with the RPN and classifier. The feature maps were first operated on by a 3 × 3 convolution layer. Then region proposals were generated in the RPN by classifying the feature vectors for each region with the softmax function and locating the boundary with bounding box regression. The proposals generated from RPN had different shapes, which could not be operated on by full connection and, so, the ROI pooling collects the features and proposals from former layers and filters the max values to the classifier. The classifier then determines if the region belongs to an object class of interest. Compared with R-CNN [32] or Fast R-CNN [37], the RPN of Faster R-CNN shares convolutional features with the classification network, where the two networks are concatenated as one network which can be trained and tested through an end-to-end process. This architecture makes the running time for region proposal generation much shorter. The model was trained on ImageNet and fine-tuned by initializing a new classification layer and updating all layers for both the region proposal and classification networks. This process is named transfer learning. The training iteration was 5000, with a basic learning rate of 0.01.  . Faster R-CNN is a single, unified network for object detection. The region proposal network (RPN) shares convolutional features with the classification network, where the two networks are concatenated as one network that can be trained and tested through an end-to-end process, which makes the running time for region proposal generation much shorter.

Orthoimages Generation
We used the Pix4D software to generate high-resolution orthoimages. As the two-meter height image sets had more images, their point cloud models were more dense and complete. The mesh surface models were established based on the point cloud, as shown in Figure 7. The image shown in Figure 8 was the final orthoimage from the mesh surface models, a 2D projection of the 3D model.

Orthoimages Generation
We used the Pix4D software to generate high-resolution orthoimages. As the two-meter height image sets had more images, their point cloud models were more dense and complete. The mesh surface models were established based on the point cloud, as shown in Figure 7. The image shown in Figure 8 was the final orthoimage from the mesh surface models, a 2D projection of the 3D model. The orthoimage and Faster R-CNN processes were performed on the image dataset with a desktop computer consisting of an NVIDIA TITAN X (Pascal) integrated RAMDAC 12 GB graphics card (NVIDIA, Santa Clara, USA) and Intel Core (TM) i7-4790 CPU 4.00 GHz (Intel, Santa Clara, USA). The algorithms were performed in TensorFlow on the Windows 7 operating system.

Orthoimages Generation
We used the Pix4D software to generate high-resolution orthoimages. As the two-meter height image sets had more images, their point cloud models were more dense and complete. The mesh surface models were established based on the point cloud, as shown in Figure 7. The image shown in Figure 8 was the final orthoimage from the mesh surface models, a 2D projection of the 3D model. The upper three rows in the orthoimages contained the Sensation variety and the other two rows were Radiance. There was some distortion at the edges of the orthoimages, due to the lack of image overlap. However, this did not affect the counting of the number of flowers and strawberries. The upper three rows in the orthoimages contained the Sensation variety and the other two rows were Radiance. There was some distortion at the edges of the orthoimages, due to the lack of image overlap. However, this did not affect the counting of the number of flowers and strawberries.

Faster R-CNN Detection Presults
Both quantitative and qualitative measurements were taken to evaluate the performance of Faster R-CNN detection in three experimental settings: (1)   The upper three rows in the orthoimages contained the Sensation variety and the other two rows were Radiance. There was some distortion at the edges of the orthoimages, due to the lack of image overlap. However, this did not affect the counting of the number of flowers and strawberries.

Faster R-CNN Detection Presults
Both quantitative and qualitative measurements were taken to evaluate the performance of

Faster R-CNN Detection Presults
Both quantitative and qualitative measurements were taken to evaluate the performance of Faster R-CNN detection in three experimental settings: (1) we trained the Faster R-CNN model on the image sets from March to early June 2018, and analyzed its performance; (2) we compared the Faster R-CNN detection results of 2 m images and 3 m images; and (3) we used the Faster R-CNN model trained on the image sets from March to early June to count flower numbers using the image sets from November to December 2018. The deep learning counting results were compared with the manual count to check the deep learning counting accuracy and calculate flower occlusion.

Quantitative Analysis of Faster R-CNN Detection Performance
The correctness of a detected object is evaluated by the intersection-over-union (IoU) overlap with the corresponding ground truth bounding box. The IoU overlap was defined as follows: where Area(GroundTruthI Detected) is the intersection area of the prediction and ground truth bounding boxes and the Area(GroundTruthU Detected) is the union area of the prediction and ground truth bounding boxes. It was considered to be a true positive (TP) if the IoU was greater than the threshold value. If a detected object did not match with the ground truth bounding box, it was considered to be a false positive (FP). A false negative (FN) was determined if the ground truth bounding box was missed. We chose 0.5 for the threshold value, which means if the IoU between the prediction and ground truth bounding boxes was greater than 0.5, it was considered to be a TP. This was the same as in the ImageNet challenge. Precision and recall are calculated according to the following equations: The single class detection performance was measured with average precision (AP), which is the area under the precision-recall curve [53]. The overall detection performance was measured with the mean average precision (mAP) score, which is the average AP value over all classes. The higher the mAP was, the better the overall detection performance of the Faster R-CNN. The detection performance of the Faster R-CNN is shown in Table 4. As we can see from Table 4, the detection performance on the 2 m image set increased significantly, from that of the 3 m one, for the detection of flowers, immature fruit, and mature fruit. The best detection results were Radiance strawberry mature fruit at 2 m (94.5%), Sensation mature fruit at 2 m (86.4%), and flowers (from both varieties) at 2 m (87.9%). The worst results were Radiance immature fruit at 3 m (70.2%), flowers (from both varieties) at 3 m (77.5%), and Sensation immature fruit at 3 m (74.2%). The mature fruit detection performances were much better than that of the immature fruit, but the gap was decreased at 2 m. The total mAP was 0.83 for all 2 m objects and 0.72 for all 3 m objects.

Qualitative Analysis for Different Heights' Detection Results
The small (480 × 380 pixel) images were stitched back into the original orthoimages after Faster R-CNN detection. Figure 9 shows orthoimage detection examples at two-meter and three-meter heights. There were some blurred parts, which were caused by the strong wind produced by the propellers of the drone. The strong wind could help the camera capture more flowers and fruits hidden under the leaves but also caused the strawberry plants to slightly shake when the drone was flying close to the ground. This may affect the quality of aerial images for orthoimage building. The leaves were more susceptible to wind than flowers and strawberries, so most of the blurred or distorted parts were in the leaves, rather than the flowers or fruits, which barely affected the detection results. This phenomenon was more common in 2 m height orthoimages, as the drone flew closer to the ground. In both images, the model detected flowers precisely, even though some of them were covered partially by leaves. However, in both images, the model confused some mature and immature strawberries with dead leaves. At 3 m height, there were more false detections for mature and immature fruit than at 2 m heights. propellers of the drone. The strong wind could help the camera capture more flowers and fruits hidden under the leaves but also caused the strawberry plants to slightly shake when the drone was flying close to the ground. This may affect the quality of aerial images for orthoimage building. The leaves were more susceptible to wind than flowers and strawberries, so most of the blurred or distorted parts were in the leaves, rather than the flowers or fruits, which barely affected the detection results. This phenomenon was more common in 2 m height orthoimages, as the drone flew closer to the ground. In both images, the model detected flowers precisely, even though some of them were covered partially by leaves. However, in both images, the model confused some mature and immature strawberries with dead leaves. At 3 m height, there were more false detections for mature and immature fruit than at 2 m heights. has lower resolution than that at 2m height. The detection results for the flowers and mature fruit were more precise than those of the immature fruit. The model could discover flowers, even when part of the petal was hidden by leaves. However, the model confused immature fruit with green leaves and mature fruit with dead leaves in some areas.
In order to check the effect of different heights on detection performance, we compared the detection results between 2 m and 3 m split orthoimages. Figures 10 and 11 show some quantitative results for 2 m and 3 m split orthoimages. has lower resolution than that at 2 m height. The detection results for the flowers and mature fruit were more precise than those of the immature fruit. The model could discover flowers, even when part of the petal was hidden by leaves. However, the model confused immature fruit with green leaves and mature fruit with dead leaves in some areas.
In order to check the effect of different heights on detection performance, we compared the detection results between 2 m and 3 m split orthoimages. Figures 10 and 11 show some quantitative results for 2 m and 3 m split orthoimages.
We can see that the 2 m images were clearer and more precise than the 3 m images at the same scale level. The 3 m images had more blur and distortion problems. Therefore, there were many more FP for mature and immature fruit in the 3 m images than the 2 m ones. (c) (d) We can see that the 2 m images were clearer and more precise than the 3 m images at the same scale level. The 3 m images had more blur and distortion problems. Therefore, there were many more FP for mature and immature fruit in the 3 m images than the 2 m ones.

Comparison of Deep Learning Count and Manual Count
We used the Faster R-CNN model, trained by the images from March to early June, to count the flower numbers in the images from November and December 2018. We manually counted the number of flowers in the field before flying the drone to capture images, so that the manual count could be compared with the deep learning count. For each bounding box prediction, the neural network also output a confidence (between 0 and 1), indicating how likely it was that the proposed box contained the correct object. A threshold was used to remove all predictions that had a confidence below the threshold. By increasing the confidence threshold, fewer predictions are kept and, therefore, recall decreases, but precision should increase. Alternatively, decreasing the threshold will improve recall while potentially decreasing precision. Higher values of precision and recall are preferable, but they are typically inversely related. Thus, the score threshold should be properly adjusted, depending on the circumstances and applications. In this study, we set the confidence threshold as 0.85 for the deep learning counting, because we noted that most FP in flower detection had a relatively low confidence (lower than 0.85), whereas objects being truly detected were all above 0.85.
Although the wind generated by the drone propeller could expose some of the flowers hidden under the leaves, there were still some flowers hidden under the leaves which could not be captured by the flight images. Thus, the effect of occlusion needed to be considered when using the deep learning model to count the flowers. Occlusion and deep learning count accuracy are calculated according to the following equations: Accuracy = (7) Table 5 shows the comparison results between manual count and deep learning count in the number of flowers. The average accuracy for deep learning counting is 84.1%, and average occlusion is 13.5%. We can see that the occlusion and FN number increased as the number of flowers decreased. Generally, when the flowers in the field are the majority, the proportion of the mature and immature fruit is relatively smaller, which will lead to a burst growth of the fruit and the majority will turn to (mature and immature) fruit. Thus, a decrease in the number of flowers means that more flowers

Comparison of Deep Learning Count and Manual Count
We used the Faster R-CNN model, trained by the images from March to early June, to count the flower numbers in the images from November and December 2018. We manually counted the number of flowers in the field before flying the drone to capture images, so that the manual count could be compared with the deep learning count. For each bounding box prediction, the neural network also output a confidence (between 0 and 1), indicating how likely it was that the proposed box contained the correct object. A threshold was used to remove all predictions that had a confidence below the threshold. By increasing the confidence threshold, fewer predictions are kept and, therefore, recall decreases, but precision should increase. Alternatively, decreasing the threshold will improve recall while potentially decreasing precision. Higher values of precision and recall are preferable, but they are typically inversely related. Thus, the score threshold should be properly adjusted, depending on the circumstances and applications. In this study, we set the confidence threshold as 0.85 for the deep learning counting, because we noted that most FP in flower detection had a relatively low confidence (lower than 0.85), whereas objects being truly detected were all above 0.85.
Although the wind generated by the drone propeller could expose some of the flowers hidden under the leaves, there were still some flowers hidden under the leaves which could not be captured by the flight images. Thus, the effect of occlusion needed to be considered when using the deep learning model to count the flowers. Occlusion and deep learning count accuracy are calculated according to the following equations: Accuracy = deep learning count Manual count (7) Table 5 shows the comparison results between manual count and deep learning count in the number of flowers. The average accuracy for deep learning counting is 84.1%, and average occlusion is 13.5%. We can see that the occlusion and FN number increased as the number of flowers decreased. Generally, when the flowers in the field are the majority, the proportion of the mature and immature fruit is relatively smaller, which will lead to a burst growth of the fruit and the majority will turn to (mature and immature) fruit. Thus, a decrease in the number of flowers means that more flowers become immature or mature fruit and that the leaves would grow larger to feed the fruit, which leads to an increase of occlusion, making the flowers harder detect.

Comparison of Region-Based Object Detection Methods
Region-based CNN frameworks have been commonly used in the area of object detection. In this section, we compare the detection performance of Faster R-CNN with other region-based object detection methods, including R-CNN [32] and Fast R-CNN [37]. These detection models are based on the same architecture (ResNet-50) and were trained on the same strawberry dataset. We used selective search (SS) to extract 2000 region proposals for the R-CNN and Fast R-CNN models and RPN to generate 400 proposals for the Faster R-CNN model [36]. The results are shown in Table 6. As summarized in Table 6, Faster R-CNN could perform 8.872 frames per second (FPS), in terms of detection rate; much faster than the R-CNN and Fast R-CNN methods. The Faster R-CNN method also had the highest mAP score (0.772) and the lowest training time (5.5 h). It is clear that the performance of the Faster R-CNN model exceeded those of the R-CNN and Fast R-CNN models.

Flower and Fruit Distribution Map Generation
It usually takes only a few weeks for flowers to become fruit for both Sensation and Radiance varieties. In order to observe the growth cycle of strawberry fields, we built distribution maps of flowers and immature fruit on 13 April and immature and mature fruit on 27 April, based on the numbers and locations calculated by Faster R-CNN. The numbers of flowers, mature fruit, and immature fruit (including both TP and FP) detected by Faster R-CNN on 13 April and 27 April are shown in Table 7. Flower and fruit distribution maps were created by ArcMap 10.3.1 (ESRI, Redlands, USA). Inverse distance weighting (IDW) was used as an interpolation method. The field was divided into 30 areas (each row was divided into six areas) and the numbers of flowers, mature fruit, and immature fruit of each area were counted and placed at the center point of each area. For the IDW, the variable search radius was 10 points and the power was 2. Distribution maps of flowers on 13 April and immature fruit on 27 April are shown in Figure 12.
By comparing these two maps, we can see that the distribution of flowers on 13 April had many similarities to the distribution of immature fruit on 27 April. Both had high production in the central part of the field and have relatively low production in the east and west edges, which means most flowers became immature fruit after two weeks.
Both flowers and immature fruit on 13 April could become mature fruit on 27 April, so we compared them in Figure 13. By comparing these two maps, we can see that the distribution of flowers on April 13th had many similarities to the distribution of immature fruit on April 27th. Both had high production in the central part of the field and have relatively low production in the east and west edges, which means most flowers became immature fruit after two weeks.
Both flowers and immature fruit on April 13th could become mature fruit on April 27th, so we compared them in Figure 13.  By comparing these two maps, we can see that the distribution of flowers on April 13th had many similarities to the distribution of immature fruit on April 27th. Both had high production in the central part of the field and have relatively low production in the east and west edges, which means most flowers became immature fruit after two weeks.
Both flowers and immature fruit on April 13th could become mature fruit on April 27th, so we compared them in Figure 13.  We can see that the mature fruit map of 27 April shows a similar trend to both the flower map of 13 April and the immature fruit map of 13 April. For the west part, the mature fruit map is more similar to the immature map; on the other hand, the east part is more similar to the flower map. The central part of the mature fruit map is more likely a combination of the flower and immature fruit maps. Thus, both the flowers and immature fruit on 13 April contributed to the mature fruit distribution on 27 April.

Discussion
In order to quickly count and locate flowers and fruit in the strawberry field with the help of a normal consumer drone, we stitched the images captured by the drone together and transformed them to an orthoimage. An orthoimage is a raster image that has been geometrically corrected for topographic relief, lens distortion, and camera tilt, which accurately presents the Earth's surface and can be used to measure true distance [18,54]. The quality of orthoimages mainly depends on the quality and overlaps of aerial images. The frontal overlap is usually 70-80% and the side overlap is usually no less than 60%. For the same overlap conditions, the closer the aircraft is to the ground, the higher the GSD of the images will be, which helps the detection system to perform better. However, it will also take more time and consume more battery power to take images at a lower altitude, which would reduce efficiency and drone life. The specific working altitude should be adjusted according to the environmental conditions and task requirements. Many studies [25,44,55,56] set the frontal overlap around 80% and 70% for the side overlap. Most of their drones flew above 50 m in height and had relatively low GSD values. In our experiments, the flight images were taken near the ground, so we set the frontal overlap to be 70% and the side overlap to 60%, in order to increase flight efficiency while ensuring the orthoimage building requirements.
Some distortions happened at the edges of the orthoimages, due to the lack of image overlap. More flight routes will be used to cover the field edge in our next experiment. There were also some blurred or distorted parts in the plant areas of the orthoimages. These were caused by the strong wind produced by the drone as it flew across the field. These were more common in 2-m height orthoimages, as the drone flew closer to ground. Most of the blurred or distorted parts happened in the leaf areas, which were more susceptible to wind than flowers and fruits; the flowers and fruits were barely affected. As a bonus, the wind could actually help the camera to capture more flowers and fruit hidden under the leaves, so more flowers and fruit were detected in the 2 m height orthoimages. Object detection is the task of finding different objects in an image and classifying them. R-CNN [32] was the first region-based object detection method. It selects multiple high-quality proposed regions by using the selective search [33] method and labels the category and ground-truth bounding box of each proposed region. Then, a pre-trained CNN transforms each proposed region into the input dimensions required by the network and uses forward computation to output the feature vector extracted from the proposed regions. Finally, the feature vectors are sent to linear support vector machines (SVMs) for object classification and then to a regressor to adjust the detection position. Fast R-CNN [37] inherited the framework of R-CNN, but performs CNN forward computation on the image as a whole and uses a region-of-interest pooling layer to obtain fixed-size feature maps. Faster R-CNN replaced the selective search method with a region proposal network. This reduces the number of proposed regions generated, while ensuring precise object detection. We compared the performances of R-CNN, Fast R-CNN, and Faster R-CNN on our dataset. The results showed that Faster R-CNN had the lowest training time, highest mAP score, and the fastest detection rate. So far, Faster R-CNN is the best region-based object detection method for identifying different objects and their boundaries in images. In our detection system, we fine-tuned a Faster R-CNN detection network which was based on the pre-trained ImageNet model which gave state-of-the-art performance on split orthoimage data. The average precisions varied from 0.76 to 0.91 for the 2 m images and from 0.61 to 0.83 for the 3 m images. Detection for flowers and mature fruit worked well, but immature fruit detection did not meet our expectations. The shapes and colors of immature fruit were sometimes very similar to dead leaves, which was the main reason for the poor results. More images are needed for the future network training. Additionally, there were always some occlusion problems, where flowers and fruit hidden under the leaves could not be captured by the camera. This occlusion varied slightly in the different growth stages of strawberries; when more flowers turned to fruit, the leaves tended to expand larger in order to deliver more nutrients to the fruit. The occlusion was around 11.5% and 15.2% in our field in November and December 2018, respectively. Further field experiments are needed to identify different seasonal occlusions, so that we can establish an offset factor to reduce counting errors by deep learning detection.
We chose IDW for the interpolation of the distribution maps. IDW is a method of interpolation that estimates cell values by averaging the values of sample data points in the neighborhood of each processing cell. The closer a point is to the center of the cell being estimated, the more influence (or weight) it has in the averaging process. Kriging is an advanced geostatistical procedure that generates an estimated surface from a scattered set of points with z-values; however, it requires many more data points. A thorough investigation of the spatial behavior of the phenomenon represented by the z-values should be done before selecting the best interpolation method for generating the output surface. In many studies, Kriging interpolation has been reported to perform better than IDW. However, this is highly dependent on the variability in the data, distance between the data points, and number of data points available in the study area. We will try both methods with more data in the future, and better results may be obtained by comparing multiple interpolation results with actual counts in the field and acquired images.

Conclusions
In this paper, we presented a deep learning strawberry flower and fruit detection system, based on high resolution orthoimages reconstructed from drone images. The system could be used to build yield estimation maps, which could help farmers predict the weekly yields of strawberries and monitor the outcome of each area, in order to save their time and labor costs.
In developing this system, we used a small UAV to take near-ground RGB images for building orthoimages at 2 m and 3 m heights, where the GSD was 2.4 mm and 1.6 mm, respectively. After their generation, we split the original orthoimages into sequential pieces for Faster R-CNN detection, which was based on the ResNet-50 architecture and transfer learning from ImageNet, to detect 10 objects. The results were presented in both a quantitative and qualitative way. The best detection performance was for mature fruit of the Sensation variety at 2 m, with an AP of 0.91. Immature fruit of the Radiance variety at 3 m was the most difficult to detect (since the model tended to confuse them with green leaves), having the worst AP of 0.61. We also compared the number of flowers counted by the deep learning model and the manual count numbers, and found the average deep learning counting accuracy to be 84.1%, with an average occlusion of 13.5%. Thus, this method has proved that it can be used to count flower numbers effectively.
We also tried to build distribution maps of flowers and immature fruit on 13 April and immature and mature fruit on 27 April, based on the numbers and distributions calculated by Faster R-CNN. The results showed that the mature fruit map of 27 April had obvious connection with the flower and immature fruit maps of 13 April. The flower distribution map of 13 April and immature map of 27 April also showed a strong relationship, which proved that this system could help farmers to monitor the growth of strawberry plants.