Automated Health Estimation of Capsicum annuum L. Crops by Means of Deep Learning and RGB Aerial Images

: Recently, the use of small UAVs for monitoring agricultural land areas has been increasingly used by agricultural producers in order to improve crop yields. However, correctly interpreting the collected imagery data is still a challenging task. In this study, an automated pipeline for monitoring C. Annuum crops based on a deep learning model is implemented. The system is capable of performing inferences on the health status of individual plants, and to determine their locations and shapes in a georeferenced orthomosaic. Accuracy achieved on the classiﬁcation task was 94.5. AP values among classes were in the range of [ 63,100 ] for plant location boxes, and in [ 40,80 ] for foliar area predictions. The methodology requires only RGB images, and so, it can be replicated for the monitoring of other types of crops by only employing consumer-grade UAVs. A comparison with random forest and large-scale mean shift segmentation methods which use predetermined features is presented. NDVI results obtained with multispectral equipment are also included.


Introduction
Chili pepper (Capsicum sp.) is one of the most widely known condiments worldwide. In particular, the Capsicum annuum L. genus is of great economic importance in Mexico, since it has the largest distribution in the country [1]. During the nursery stage, and throughout the production process, the crop is affected by several types of microorganisms that cause diseases in seedlings. As a consequence, plant counts and fruit production volume are reduced, which represents a major problem for farmers. The epidemiological monitoring of crops allows one to know the health status of the total plant population, helping with timely implementation of preventive or corrective agronomic practices, and thus, allowing us to obtain maximum yields at the lowest cost. For the implementation of a technified on-field monitoring, it is necessary to develop an effective inspection plan, including sampling patterns, determination of unit sizes, number of samples, and the establishment of a set of severity scales to evaluate standard area diagrams (SADs) [2]. Then, a database with the records of individual cases must be kept. To perform these tasks, it is required to have trained personnel that can generate reliable results for disease severity estimates. Usually, a lot of human labor is needed to perform the entire monitoring process as described above, which increments total production costs; therefore, effective automation methods for such labor have a major impact on the adequate use of production resources [3]. Technological help for continuously monitoring C. Annuum crop at an early plant life is necessary to optimize the management of detected diseases. There exist many techniques to approach the problem of disease detection in crops using remote sensing. Nowadays, equipping UAVs with different types of sensors to acquire crop images is a Table 1. Deep learning agricultural applications. Previous works applying deep learning in agricultural processes.

Detected Objects Publications
pests [21][22][23][24][25] weeds [26][27][28][29][30] irrigation/drought levels [31][32][33][34][35][36] diseases [37][38][39] To aid in the analysis of vegetation images, there are different computer vision tasks that can be performed using deep neural networks (NN). One of them is image recognition, which assigns a label or a class to a digital image, while object detection is the process of locating entities inside an image in which a box is usually drawn to delimit the regions of interest. Aside from object detection operations, there is the semantic segmentation process, which occurs when each pixel of an image is labeled as belonging to a class, and it is possible to modify the number of classes. Another more accurate alternative is instance segmentation, in which the boundary of each object instance is drawn. This technique, unlike simple object detection, allows for the location and delimitation of objects that have irregular shapes. It can be seen that, as the challenges increase, the complexity of the techniques also increases.
Amidst the different DL methodologies focused on disease detection, the region-based convolutional neural network (RCNN) stands out due to its capacity to extract image features. Since its emergence, it has been improved, giving rise to Fast RCNN, Faster RCNN, and Mask RCNN, appearing in [40][41][42] respectively. Mask RCNN models have the property of not only being able to detect instances of objects; they are also able to delimit the area of occurrence for each instance, proving instance segmentation capabilities. These approaches have been used alone or in combination with other algorithms for disease detection. Table 2 shows the detection objectives, and the accuracy achieved in works found in the literature. Canopy classification presents a challenging problem when orchard areas overlap. In such cases, traditional classification algorithms need to label individual plants manually to validate the models. This drawback can be overcome using a NN to automatically extract the relevant features at the convolutional layers from just a set of examples given to the network as training. Other technologies that deal with the overlapping problem include ones based on airborne LIDAR systems, but even when they possess several advantages, such as functionality in both daytime and nighttime, and the ability to be combined with other sensors, the expensive acquisition, costly data processing (in both time and computational resources), and low performance in some weathers [46], are major drawbacks for their implementation in real-life systems.
This paper presents a methodology based on the Mask RCNN deep learning ensemble to detect every plant cluster in the crop that originated at the same seeding point. The procedure localizes the plant objects and performs an instance segmentation of an image used as input, which represents a segmented portion of a large orthomosaic of the crop's area under study. The present work goes beyond performing instance segmentation, as the technique described here also estimates the health state of each plant cluster based on visual phenotypic features, implicitly extracted by the Mask R-CNN model. Unlike general purpose object detection that use regular CNNs, which aim to detect several types of unrelated objects under different background conditions, the model presented here is fine-tuned to detect plants and their discerning features present in a crop field environment. This level of specialization is intended for the purpose of not only detecting vegetation objects, but also to distinguish some of the features and visible traits that are correlated with the plant's health.
In the proposed model, only RGB aerial images were used. To verify the consistency of our results, we compare the method with results obtained from large-scale mean shift segmentation (LSMSS) composed with spatial KMeans [47,48] and random forest classification over local image filters [49], which both are methods in which predefined features are used. Additionally, spectral reflectance signatures for the plant samples were collected to ensure that the defined classes can be characterized not only by the plant's morphology and phenotypic traits, but also by their reflectance spectrum [50], as this latter property is the basis of many plant health indices [51], including the widely accepted normalized differential vegetation index (NDVI) [52]. The introduced methodology can be implemented as an automated pipeline to quantitatively determine the health state of C. Annuum crops in precise geolocalized manner.

Study Area
The experiments for this research were conducted on a C. Annuum crop field located at a rectangular region delimited by the coordinates (2610818N, 11429693W) and (2610758N, 11429543W) at 2205 m.a.s.l, in the municipality of Morelos, Zacatecas, Mexico. This specific portion of the field, was chosen in the interest of having a fair number of samples of plants with variable health conditions to define the comparative plant health classes. The region for the study has a Cwb climate according to the Köppen-Geiger categories [53], with average annual average temperature of 17 • C with minimum and maximum temperatures around 3 • C and 30 • C, respectively, and the annual rainfall is 510 mm [54].

Data Collection and Prepossessing
Airborne images were captured using multirotor UAVs, the Phantom III Standard ® (SZ DJI Technology Co., Ltd., Shenzhen, China), equipped with an RGB camera with image resolution of 12 Mpx. The RGB image dataset was taken at 15 m from the ground, generating pictures with a resolution of 5.1 cm/px. With the purpose of comparison of the deep learning method proposed here, with respect to standard techniques of vegetation health assessment of crops from airborne images, a second multirotor equipped with a Sequoia Parrot ® (Parrot SA, Paris, France) multispectral camera was flown over the same area and at the same height in order to generate a NDVI map. The multispectral camera captured reflectance levels at the near infrared (NIR) band, with center at 790 nm and 40 nm width, red edge (REG) band, centered at 735 nm with 10 nm width, red band (RED), centered at 660 nm with 40 nm width, and a green (GRE) band, with center at 550 nm and 40 nm of width. The Sequoia Parrot has a resolution of 1.2 megapixels for each of the individual spectral channels, which gave multiespectral images with a resolution of 11 cm/px. The images were post processed with the Pix4DMapper ® (Pix4D, Lucerne, Switzerland) software, which was responsible for performing orthogonal rectification, pose estimation, vignetting and radiometric corrections for each picture, and generated the RGB and multispectral orthomosaics. Figure 1 shows the post processed RGB orthomosaic obtained with the unmanned aerial vehicle (UAV) survey of the study area, over the corresponding satellite image of the same region as background.
In addition to the UAVs imagery collected, the health state of every plant cluster located along two plowing rows of the cropland was also registered, their respective locations relative to a set of ground control points (GCP) placed at 3 m intervals were recorded. Five plant health condition classes, labeled from HC1 up to HC5, were established according to observable trait combinations of visible features associated with plant health status. The attributes considered were plant height (cm), foliar area (m 2 ), and percentage of canopy areas presenting disease symptoms (leaf spots, chlorosis, curly and wilting leaves). The characterization mentioned above was based on the average SAD maps of individual leaves. The threshold values for every tracked trait are shown in Figure 2, where the ranges of the maximum plant height, maximum canopy area, and percentage of damaged leaves are depicted as radial bar graphs for each of the HC1, . . . , HC5 classes.  For the task of collecting spectral signatures of plant samples, a custom portable spectrometer based on the C12880MA (Hamamatsu Photonics, Shizuoka, Japan) sensor was used. The C12880MA sensor is capable of detecting 288 spectral bands with centers at intervals of about 2 nm, in a wavelength range between 330 nm and 890 nm. The spectrometer was connected to a smartphone with GPS capabilities through the on-the-go (OTG) universal serial bus (USB) peripheral port [55]. An Arduino(Arduino.cc, Somerville, MA, USA) microcontroller was used to convert the inter integrated circuit (I2C) [56] bus signal from the sensor to USB serialized signal levels. A user interface programmed in Java language and the Android Development Studio ® (Google Inc., Mountain View, CA, USA) tools was developed. In addition to recording and transmitting sensor signals, the application also took charge of attaching geotags to the spectral data, and of performing wavelength calibrations according to factory parameters [57]. Reflectance variations due to different intensity and illumination sources were also compensated by the program. The reference used for reflectance adjustments was a Micasense ® (AgEagle Sensor Systems Inc., Wichita, KS, USA) calibration panel for which reflectance values in the range 300-900 nm at 2 nm intervals were provided by the manufacturer. The spectral signatures of 20 samples for every defined class were taken; the average spectrum for each class smoothed with a third degree polynomial is presented in Figure 3. The intervals corresponding to the four channels of the multispectral camera are used as x axis in order to highlight the reflectance features relevant to the health classes that can be captured by the multispectral camera from which the comparative NDVI map was built. Manual annotations regarding plant health classes, and canopy area pixel masks of the plant clusters appearing on each of the training and validation images were created. Labels and instance masks were outlined according to georeferenced data collected on-field. An example of image annotation masks for one of the aerial images from the crop can be seen in Figure 4a,b. Note that pixels at the edges of very small plant twigs for which their color were heavily mixed with background soil color, were not considered to be part of the plant's canopy, otherwise they would have induced noise in the reflectance features that differentiate health classes. A total of 60 images were annotated; 40 of them were used for training the model and 20 were used for validation and to estimate the Mask RCNN's hyperparameters. The annotated objects combined, added up to the amount of 2139 instances inside the images for which their respective contours, delimited by polygonal boundaries, were registered, thus providing a representative set of object instances of each health level to properly train the Mask RCNN ensemble.

Hardware and Software
Model implementation of the neural network ensemble was performed using the Detectron2 ® [58] deep learning framework, which has been released as open source [59] by Meta ® AI Research. The original source code for Mask RCNN was made publicly available at the Detectron repository in [60] based on the Caffe [61] deep learning framework. Currently, a second revision of the framework based on Pytorch is available at the Detectron2 repository [59], which is the version of the code used here. Detectron2 acts as a wrapper for several Pytorch models, allowing them to interact together to conform ensembles of deep neural networks. Training, validation and testing of the model was executed using a workstation with CPU and GPU compute capabilities featuring an Intel Core ® i7-9700FV (Intel Co., Santa Clara, CA, USA) CPU with 8 physical cores at 4.7 GHz and 32 GB of RAM. The GPU employed was a Nvidia GeForce RTX 3090 ® (Nvidia Co., Santa Clara, CA, USA) with 24 GB of video memory, with drivers configured with support for CUDA version 11.4. The Pytorch version used was 1.8. The main python scripts for data management and processing were elaborated on in the Visual Studio Code ® (Microsoft, Redmond, WA, USA) integrated development environment (IDE). Georeferencing for use of the generated maps in standard GIS tools with a satellital base layer was carried out with the aid of QGIS [62] version 3.26.1 using the pseudo-Mercator projection [63]. Boundary polygons to generate instance training masks were made with the Computer Vision Annotation Tool [64]. In order to compare the instance segmentation and plant classification performed in this work with alternative traditional methods, we implemented a random forest classifier on local features (RFLF) extracted from the same images used to train the Mask RCNN ensemble. The local features images used the RFLF segmentation were extracted using the Laplacian-of-Gaussian detector [65], the Harris and Hessian affine regions [66] and a Gaussian blur filter to deal with features at different scales [67]. This was achieved using the Scikit-learn [68] libraries written in Python with the Scikit-image [69] extensions for digital image processing. Additionally to the RFLF reference segmentation, the LSMSS method using spatial KMeans was also compared with the technique presented in this research using the same validation set. The LSMSS version used here was taken from the orfeo toolbox [70] libraries, the spatial KMeans implementation was programmed as a Python script based on Scikit-learn. The details for achieving object detection with LSMSS and KMeans are described in [48].

Data Augmentation and Class Balance
The original training image dataset was augmented by applying random transformations consisting of rotations at 90 • , 180 • and 270 • , color saturation, contrast and brightness modifications at ranges between −20% and +20%. In this way, we provided a continuous stream of images for training, preventing overfitting problems at early training stages. This aspect is important for our model, as it provides a mechanism of adaptation for images taken under different lighting conditions and variation in camera settings. An example of such augmented images can be seen in Figure 4c, where synthetically generated transformations simulating such conditions have been applied. Batch normalization was avoided, as in most of the training images, several plant instances belonging to different classes appear at different positions, and the same is expected for the input images when the ensemble is operating at prediction stages. In addition to the random image augmentation, the stream of images was also modified by the repeat factor sampling (RFS), introduced in [71]. Mechanisms such as RFS are used to counteract the data imbalance present in training samples. Specifically, RSF consist of oversampling images that contain objects belonging to the less frequent classes by assigning to each class c a repeat factor where f c is the fraction of images that contain at least one object belonging to class c.
Considering that training images can contain several objects in different class categories, the repeat factor for an individual image is set to We used the parameter t = 0.001, as this resampling factor provides acceptable results for images containing multiple objects of different classes [71]. The distribution of class objects in our training and validation data is shown in Table 3. The GCP class was not used to train the Mask RCNN ensemble; however, it was detected using a grayscale-level histogram normalization and thresholding method for geolocation and photogrammetry purposes.

Deep Learning Ensemble Model
The multi-stage architecture deep neural network Mask RCNN [42] is employed in this research as a feature extractor, classifier and instance detector. This architecture consists of an ensemble derived from the region-based convolutional neural network (RCNN) [72]; a simplified schematic diagram of Mask RCNN is shown in Figure 5. In the Mask RCNN model, a backbone network is used as a feature extractor, De-tectron2 allows several backbones to be used for this purpose. When implemented in conjunction with a feature pyramid network (FPN) [73], the outputs of the last residual blocks from the backbone network are linked to a series of C2, . . . , C5, 1 × 1 convolutions that reduce the number of feature channels. These convolutions are concatenated with the previous features map that composes each stage of the map pyramid. Maps at every scale in the set {1/4, 1/8, 1/16, 1/32} are identified with the P2, . . . , P5 labels. A final feature map P6 scaled at 1/64 is also taken at the end of the backbone network. The output of the stem layers of the residual network is ignored in favor of reducing memory footprint. Note that the FPN is not a necessary component for RCNNs, and the original Faster RCNN, the network on which Mask RCNN is based, does not implement it [41]. However, FPN improves detection and training speeds while reasonably maintaining accuracy [42]. The feature maps are fed as inputs to the region proposal network (RPN) [40,41] and to the region of interest (ROI) heads, which are primarily comprised of pooling and convolutional layers. The RPN component includes an anchor generator that produces predetermined locations and shapes for the initial proposals, returning the scores of each candidate region. The RPN output is a set of rectangular boxes with the respective scores as candidates for containing an object, along with class logits. Based on the feature maps, a box regressor and a softmax discriminator, the best candidate regions are given as inputs to the ROI heads module, whose main functions are to crop and pool regions taken from the proposals with higher objectness scores. These proposals have been previously relocated by an extra step called ROI alignment [42]. Final predictions for masks, locations and classes for each detected object are determined at this stage. For the particular instance segmentation problem investigated here, the backbone network used was a Resnet 101 model [72], which was previously initialized with pretrained weights and biases under three COCO epochs [74], as a way of having an initial state that included some connections related to semantic features involved in image classification tasks. This approach has been documented to speed up the training process by introducing some transfer learning operations [75]. The backbone choice was based on the fact that large residual networks are better for detecting fine grain features appearing in small objects [76]. When working with plant crop images, one problem that arises at the segmentation of contiguous plants is that in some cases, it is difficult to discern the boundaries of neighboring plant clusters. So, we tuned the model ensemble to learn the plant's morphology and phenology from the training examples to determine the borders of each cluster plant. The mechanism implemented to achieve this particular task is to give a set of fixed size and shapes of the predefined regions to the anchor generator at the RPN component of the ensemble. The set was adjusted to have areas at intervals from the average size of the foliar canopy of each of the plant health classes, plus and minus two standard deviations; aspect ratios for the anchors were also shaped in the same way. The loss function L for Mask RCNN models is composed of three elements: where is the loss function for the label classification, with (p u ) representing the softmax operation for a ground-truth class labeled as u. For the case of pixel masks, L mask is the average binary cross-entropy. L box is the loss function that evaluates the precision of the location of the bounding boxes containing detected objects. For a class u with a ground-truth box v defined by the values v = (v x , v y , v w , v h ) in which v x and v y , are the coordinates of the upper left corner, and v w , v h correspond to its width and height, the loss of the regression for a predicted box t u = (t u x , t u y , t u w , t u h ) uses the following loss function [40]: where A custom script in the Python programming language was written to modify the default data loader of the Detectron2 libraries, with the intention of providing an input image with constant resolution to the first layer of the ensemble. To apply this script, it is necessary for the images to have units and values for the resolution tags at their exchangeable image file format (EXIF) [77] header. This represents no technical limitation, as most of the images gathered for the purpose of employing them in agriculture studies usually have them recorded [51]. Because most aerial imagery used in precision agriculture are in the format of georeferenced orthomosaics, the loading script feeds the input images in a mosaicking way, similar to the procedure described in [78]. To this end, orthomosaics are scanned by a sliding tile of fixed size. An overlap of size s is maintained between tiles that cover the orthomosaic, the computation of the value s and the locations of the tiles, which provides uniform cover for the entire area at a constant resolution for the input images performed by the custom data loader script. Regardless of the dimension of the analyzed image, the inputs passed to the MaskRCNN model are tensors with fixed dimensions of size 1 × 512 × 512 × 3. Repeated instances from partial detections occasionally appear at the tile borders. These instances are removed using a non maximum suppression (NMS) criteria with intersection over union (IOU) thresholds of 0.3 for intraclass objects and 0.7 for interclass objects. These values were based on the default anchor NMS values at the RPN network, as we wanted to preserve similar thresholding behavior for object filtering at the external process. A second criterion is applied to suppress duplicated and partial instances by setting limits on canopy area for each class based on the average values presented at Figure 2b. The mechanism for scanning large orthomosaics by local detections on covering tiles explained above is depicted in Figure 6. Note that the phenotypic threshold filtering is only applied to detections of plants that do not appear completely on a tile, or that are duplicated because they are entirely located at the overlapping regions; all the other instance classifications are left the same as the output from the Mask RCNN. The proposed processing pipeline allows the MaskRCNN to operate efficiently, as the NN model takes a fixed size input in the form of a multichannel tensor. Otherwise, previous image scaling would have been needed for images of arbitrary sizes. Note that if only image scaling was used to adjust the original input size, and the difference in scales is significant, as is the case of large orthomosaics, the recognition performance of Mask RCNN was heavily affected. Optimization of weights and biases for the training stage was executed using stochastic gradient descent (SGD) [79] with a momentum value of m = 0.9. A multi-step learning rate, starting at lr = 0.001 with discrete exponential adjustments with a factor γ = 0.5 was applied every 2500 epochs. Image batches consisted of 16 augmented images, with 512 ROIs being analyzed by the solver for each image.

Results
Using the equipment described in the previous section, it took 7.71 h to train the ensemble up to 25,000 epochs. The prediction time needed to process a large orthomosaic of 16,384 × 3584 pixels, covered with 360 overlapping tiles took, on average, 56 s; this time includes the overhead of removing partial detections at the overlapping zones, dividing the input orthomosaic and reconstructing it again with the predicted outputs. Note that the communication of tile tensors and prediction results between GPU and CPU memory also affects the total time taken to obtain a final output orthomosaic. The loss function behavior during the training epochs is shown in Figure 7. Loss function evolution through the training epochs of the Mask RCNN ensemble depicted in Figure 7 indicate that the optimization process follows a step decreasing trend for the total loss L metric at the first 15 K steps, from which changes were less marked up to 25 K iterations of the SGD algorithm. At this point, the training was stopped to avoid overfitting, as few improvements were recorded at this step. To evaluate the performance of the classifier module of the Mask RCNN, we calculate the accuracy for all the classes, given by: where TP, TN, FP, FN are, respectively, the true and false positives and negatives using a set of IoU values in the interval (0.5, 0.95) separated at 0.05 increments, sometimes denoted as IoU ∈ [0.50:0.05:0.95]. Accuracy levels throughout the epochs along with the portion of FP and FN are shown in Figure 8; the ability of the network to detect objects of interest is represented in this figure. The confusion matrix for the detection of each object in the validation set is shown in Figure 9.  To evaluate the precision of the masks and the locations of the instances, we employed the average precision (AP) metric as defined in the PASCAL visual object classes challenge [80], by taking for each validation image metrics of precision and recall, defined as: Then, the measurements are sorted in a monotonically descending order, defining the AP as the area under precision-recall curve: with p(r) representing the precision p as a function of the recall r. To avoid approximations introduced by the use of discrete data to estimate the AP values from the equation above, the precision for a given recall r is set to the maximum precision for every recall r ≥ r. The values of AP obtained for the predictions of the validation set for the boxes that locate each instance, and for the mask regions generated by the network are shown in Figure 10.   Figure 10 shows that inferences are affected at different magnitudes for each HC1, . . . , HC5, with HC3, H4, and HC5 APs being only slightly affected at the pixel labeling stage that generates object masks, while HC1 and HC2 APs reach significantly lower AP values. For the case of object localization estimated by bounding boxes, only the classes HC1 and HC2 are affected. This phenomenon has a reduced effect for overall classification accuracy, where a value of acc = 0.945 is reached at the end of the training epochs, as can be seen in Figure 8. Final AP values were smaller for low health level classes. The reason for this is that plants belonging to these groups have a smaller foliar area, and their shapes show, in many cases, branch-like structures; therefore, pixels of the image belonging to these objects are mixed with background pixels at a larger proportion than for the other classes.
To illustrate the tile-level outputs generated by the Mask RCNN ensemble, and the reference methods, we present the instance predictions, as well as the generated masks obtained for one of the tile samples at Figure 11a. Figure 11b shows the segmentation and pixel labeling performed by the RFLF classifier. Note that RFLF cannot distinguish between the background soil and the HC1 class. This might be due to the fact that plants with a lower health condition are mainly composed of dry matter, which, according to Figure 3, exhibits lower reflectance values at the NIR wavelengths than the other categories. RFLF is also unable to locate objects, as this algorithm was not designed for such function. One common strategy to delimit objects is watershed segmentation [81]; however, by using the same input data that were used to train the Mask RCNN model, the watershed segmentation does not match with the RFLF output, as can be seen in Figure 11c. Neither of these methods provide a satisfactory solution for the plant location problem. On the other hand, the LSMSS can detect and classify plant leaves and their health condition in a better way than RFLF, as shown in Figure 11d. However, object detection for LSMSS has to be performed statistically by grouping adjacent segments originated at a centroid generated by a KMeans spatial partition, and then the condition of the segments is averaged over a predefined radius. NDVI provides a nice segmentation for vegetation covered areas when thresholding is applied, but boundaries between plants cannot be determined in this way, as can be concluded by examining Figure 11f. A larger section of the segmented orthomosaic obtained with Mask RCNN using the tiling procedure is shown in Figure 12. Figures 13 and 14 show, respectively, the RFLF, and LSMSS predictions over the same orthomosaic. Figure 15 presents the NDVI map obtained with the data collected with the multispectral camera. The five plant groups HC1, . . . , HC5 can be accurately detected by the Mask RCNN procedure, as can be seen in Figures 11 and 12. The confusion matrix at Figure 9 shows that for the validation set, only 5 plants in the category HC1 were labeled as HC2 and 12 plants in the group HC2 were assigned to the HC1 type. All other plants were classified exactly in their corresponding categories. On the other hand, NDVI maps reveal values greater than 4.0 for healthy plants, values between 3.0 and 4.0 for unhealthy plants, and values lower than 3.0 for non vegetation objects, as shown in Figures 11 and 15. Thus, NDVI alone can only determine two plant health categories when applied to the dataset gathered for this research. Unlike the process described here, NDVI cannot be used to infer plant boundaries of overlapping canopies, or to locate instances of individual plants. This is because the indices extracted by NDVI do not consider phenotypic traits of the plants, which is a key factor used to determine the health state of vegetation samples, and to calculate their shape, location and extension, which was successfully performed in this work. The LSMSS can also distinguish all the HC1, . . . , HC5 classes when is post-processed with the spatial KMeans algorithm that identifies centroids of Voronoi's regions [82] determining the object locations. The RFLF method cannot differentiate HC1 plants from background soil, and as it only performs semantic segmentation, it tends to assign a different class on the leaves at the plant's boundary disregarding the category assigned to the center mass of the objects.    The computational complexity for the Mask RCNN ensemble performing predictions on an image depends on the individual layers that execute the forward propagation operation. Mask RCNNs are primarily composed of convolutional, fully connected and pooling layers, all of them known to have a computational complexity of O(n) [83], with n being the number of input pixels. The proposed tiling process also has computational complexity of O(n), since in the implementation of this paper, only matrix operations are applied to the orthomosaic to perform such a task. Table 4 compares the segmentation obtained by the Mask RCNN, RLFLF, LSMSS, and NDVI methods as described in this work. All were applied on the same input orthomosaic in an end-to-end fashion. The time data shown in this table were averaged sampled on 10 runs for each method; time figures correspond to wall time. Program sections that were executed on the CPU for all methods are parallelizable. The multiproccess programming for these cases was implemented by the native python multiprocessing module. 8 CPU processes, and an input orthomosaic of 16,384 ×3584 pixels were used in all cases.  Table 5 shows the number of plant clusters of each class identified and segmented by the strategy described here, applied to the entire orthomosaic of the study area. Average scores presented correspond to the objectness obtained at the last layer of the network for all instances belonging to the same class. Canopy covered areas for each class and their percentage in relation to the total vegetation objects detected are also presented.

Discussion
The plant health classes defined in Section 2 are consistent with SAD maps for leaf samples collected in-field, and they are in accordance with the average spectral signatures for the reflectance of each category. In Figure 3, the differences in the average reflectance signatures of each health category can be distinguished even at the visible bands GRE, and RED. As expected, the signatures are much more easily differentiated considering the NIR band, whose comparison with the RED band is quantitatively expressed by NDVI. The behavior of signatures at the REG band shows that the spectrum of healthy plants have a more accentuated slope than unhealthy plants in the same region; therefore, the differences in healthy and unhealthy plants are also exposed in this spectral transition region.
Many image segmentation algorithms based on classic methods such as threshold, dilation and eroding perform only for semantic segmentation, and in many cases, algorithm parameters and the selection of the features employed need to be tuned for specific images [84]. The implementation of the RFLF method, which is widely used to analyze images obtained from multispectral sensors, is one of these examples. We used it here for comparison with the proposed mechanism. The main disadvantage of the RFLF approach is that it can only perform semantic segmentation, and even when the results are post-processed with a watershed detector, as described in the previous section, the location and shapes of the plants are not accurately matched by these techniques with plants' morphology, as shown in Figure 11. On the other hand, the approach introduced here allows us to detect individual instances of C. annuum plants under different contexts, given the image augmented images fed to the network in the training stage.
Techniques such as large-scale mean shift segmentation (LSMSS) [47] and objectbased image analysis (OBIA) [85] are among the algorithms that can also perform instance segmentation on images using manually engineered feature extraction. Specifically, instance segmentation using OBIA implemented by the spatial KMeans algorithm was compared with the proposed pipeline. Although training time for LSMSS + KMeans does not need prior training, as it is based on unsupervised methods, its prediction times are rather long, according to Table 4. In the present work, using the Mask RCNN in a tiled fashion, the processing of a 58.7 Mpx orthomosaic is performed in just 56 s, which represents an improvement of two orders of magnitude on the execution speed at analyzing large orthomosaics.
The input for the Mask RCNN classifier used here only needs to be in RGB image format, which is standard for many commercial UAVs used for crop monitoring. Among the advantages of working RGB images, modern RGB cameras designed for UAVs are cheaper, have larger spatial resolution and are lighter than their multispectral counterparts. Spectral cameras enable us to establish several vegetation indices with precision by applying simple operations. Nonetheless, in this study, the lack of multispectral features is compensated by the phenotype traits learned by the Mask RCNN ensemble for the accurate estimation of vegetation health, providing additional instance segmentation capabilities. These properties are very useful at crop-monitoring-related tasks, as presented in Table 5. The introduced pipeline is capable of counting the total amount of plants in a crop, detecting the health state of each plant, estimating the foliar area covered by the plants of each category and locating the pathological cases in an accurate and fast georeferenced way.

Conclusions
In this work, we developed a sequence of steps that allowed efficient processing of aerial imagery for the task of monitoring C. Annum crops. By training a Mask RCNN deep learning model with the proper annotated imagery corresponding to vegetation health classes, and polygonal boundaries of individual plants, it was possible to provide highaccuracy automated detection of up to five health classes of vegetation, and to determine their locations and shapes with acceptable precision. The classes defined for training were based on spectral signatures and phenotypic features of the vegetation under study. The model was fed with fixed tiled inputs representing the partition of larger images. In this way, the Mask RCNN performed reasonably well, without showing scaling issues when dealing with large orthomosaics representing vegetation fields. Comparison with methods such as RFLF and LSMSS, shows advantages of the proposed pipeline of using a Mask RCNN ensemble in a tiled way. Such improvements arise from the ability to perform instance segmentation on large orthomosaics with low execution times, as once the model was trained, the execution time taken to predict the classes, locations, and plant shapes was less than one minute when examining an orthomosaic composed of multiple images. For the goal of health state determination of C. Annum crops, the inferences obtained by the model proposed here using RGB imagery give more detailed and informative results than the standard NDVI method using multispectral instruments, and also outperforms the models using a set of predefined features, such as RFLF and LSMSS.
The methodology presented here can be easily adapted to other crops, which can be implemented in future works. Therefore, it represents a viable alternative for automated crop monitoring using RGB airborne images.
Author Contributions: All authors have contributed equally to this work. All authors have read and agreed to the published version of the manuscript.