Segmentation of Multiple Tree Leaves Pictures with Natural Backgrounds using Deep Learning for Image-Based Agriculture Applications

: The crop water stress index (CWSI) is one of the parameters measured in deﬁcit irrigation and it is obtained from crop canopy temperature. However, image segmentation is required for non-leaf region exclusion in temperature measurement, as it is critical to obtain the temperature values for the calculation of the CWSI. To this end, two image-segmentation models based on support vector machine (SVM) and deep learning have been studied in this article. The models have been trained with di ﬀ erent parameters (encoder depth, optimizer, learning rate, weight decay, validation frequency and validation patience), and several indicators (accuracy, precision, recall and F 1 score / dice coe ﬃ cient), as well as prediction, training and data preparation times are discussed. The results of the F 1 score indicator are 83.11% for SVM and 86.27% for deep-learning models. More accurate results are expected for the deep-learning model by increasing the dataset, whereas the SVM model is worthwhile in terms of reduced data preparation times. clustering and SVM structure. Clustering for data preparation and SVM training are strictly conducted by the colour space parameters of the pixels, which seem to be insu ﬃ cient for the segmentation. The problem does not apparently be related to the selection of channels or colour spaces, neither with the absence to add one of them, but with the nature of the method. Additional procedures are required for a segmentation that performs an analysis based on regions and textures.


Introduction
Water is a limiting factor in arid zones and its optimal management is crucial to ensure appropriate production levels and the quality of crops. One of the techniques that has been studied and applied in recent years to reduce water consumption in agriculture is deficit irrigation [1][2][3][4], which requires measurable crop stress parameters. Midday stem water potential (SWP) is the reference method [5]. However, its measurement is very time consuming and it is not automated yet. As soil-plant-atmosphere is considered as a continuum [6], several automatically measurable variables have been proposed to be related to the SWP, so that it can be measured in an indirect way. The crop water stress index (CWSI) [7,8] is one of the most widely used indicators correlated with SWP and it is remotely measurable [9].
In order to obtain the CWSI, it is necessary to measure the crop canopy temperature. One of the methods to deal with this aim is the use of infrared radiometers (IR) [10,11]. However, when installing an IR in the field no feedback is available to know the proportion of leaves in the measuring cone. Thermography techniques are an alternative tool to estimate the crop canopy temperature [12][13][14][15][16][17]. No orientation issues arise since a graphical representation of reality is always available, so that enough information is provided to decide whether the visualized region is of interest. In either case, a similar

Materials
To generate the segmentation model, a set of pictures for training was obtained. Different species of fruit trees were the target of the research: lemon (Citrus limon), orange (Citrus sinensis), almond (Prunus dulcis), olive (Olea europaea), loquat (Eriobotrya japonica), fig (Ficus carica), cherry (Cerasus) and walnut (Juglans regia) trees. The pictures were collected by means of mobile devices (smartphones) in different locations of city and countryside in Murcia, Spain (37 • 59 32.064" N 1 • 7 50.356" W). The images were taken throughout winter and spring at several times of the day, from morning to afternoon, covering the range of different lighting scenarios. Several resolutions were found: 3264 × 2448, 3264 × 1836, 1600 × 1200 and 1600 × 900 pixels. The datasets consisted of 251 pictures for SVM and 121 pictures for deep learning. The data processing and models training were performed by means of a computer with Intel ® Core i5-8600K, 16 GB RAM and GTX 1070 Ti GPU/8 GB GDDR5 equipped with MATLAB 2018b (The MathWorks, Inc., Natick, MA, USA) [35]. GIMP (GNU image manipulation program) 2.10.10 software [36] was used for image masks refining.

Methods
Two different alternatives were proposed in order to obtain the image segmentation model: a SVM model together with a clustering-based dataset generation and a Deep Learning model.

Support Vector Machine (SVM) + Clustering
The proposed SVM + Clustering method consisted of several steps, as presented in Figure 1. To build the dataset for training, image masks that discriminate leaf and non-leaf pixels were needed. Since building this is a really time-consuming task if done manually, a clustering pre-process was implemented as an alternative to facilitate the dataset generation.

Methods
Two different alternatives were proposed in order to obtain the image segmentation model: a SVM model together with a clustering-based dataset generation and a Deep Learning model.

Support Vector Machine (SVM) + Clustering
The proposed SVM + Clustering method consisted of several steps, as presented in Figure 1. To build the dataset for training, image masks that discriminate leaf and non-leaf pixels were needed. Since building this is a really time-consuming task if done manually, a clustering pre-process was implemented as an alternative to facilitate the dataset generation. As it is a supervised method, input-output pairs of data are required for SVM training. The output was defined as a binary value that classifies every pixel as leaf ("1") or non-leaf ("0"). In the case of inputs, the original pictures were taken in the Red-Green-Blue (RGB) colour space. Nonetheless, other colour spaces were used, as more relevant information for segmentation can be obtained [37,38]. Thus, a hybrid colour space formed by some channels of several colour spaces was defined. The colour spaces considered were: RGB, I1I2I3, HSV and CIE (International Commission on Illumination) L*a*b*. The procedure for choosing the channels consisted of representing each of them in a grayscale picture together with its histogram for different test images. This allowed visual determination of their sensitivity to discern between the leaves and the background. Finally, the hybrid colour space consisted of the selected channels: I3 from I1I2I3, a* and b* from CIE L*a*b*, and H from HSV.

Clustering
The clustering pre-process consisted of applying a k-means method to define several centroids and an index matrix from the pictures. All the original pixels were classified with an index depending As it is a supervised method, input-output pairs of data are required for SVM training. The output was defined as a binary value that classifies every pixel as leaf ("1") or non-leaf ("0"). In the case of inputs, the original pictures were taken in the Red-Green-Blue (RGB) colour space. Nonetheless, other colour spaces were used, as more relevant information for segmentation can be obtained [37,38]. Thus, a hybrid colour space formed by some channels of several colour spaces was defined. The colour spaces considered were: RGB, I 1 I 2 I 3 , HSV and CIE (International Commission on Illumination) L*a*b*. The procedure for choosing the channels consisted of representing each of them in a grayscale picture together with its histogram for different test images. This allowed visual determination of their sensitivity to discern between the leaves and the background. Finally, the hybrid colour space consisted of the selected channels: I 3 from I 1 I 2 I 3 , a* and b* from CIE L*a*b*, and H from HSV.

Clustering
The clustering pre-process consisted of applying a k-means method to define several centroids and an index matrix from the pictures. All the original pixels were classified with an index depending on the centroid they belonged to. A number of 20 centroids was chosen arbitrarily. To build the dataset output a graphical user interface (GUI) where every picture is loaded and the index matrix is applied as a mask was developed. As shown in Figure 2, the interface allowed to manually enable or disable the pixels associated with each cluster by using checkboxes, automatically updating the picture and defining the supervised output. Aside from facilitating the task to create the dataset, the use of clustering led to normalization of pictures from different resolutions avoiding an unbalanced dataset. The output distribution of the dataset obtained was 53.73% leaf and 46.27% non-leaf.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 4 of 18 on the centroid they belonged to. A number of 20 centroids was chosen arbitrarily. To build the dataset output a graphical user interface (GUI) where every picture is loaded and the index matrix is applied as a mask was developed. As shown in Figure 2, the interface allowed to manually enable or disable the pixels associated with each cluster by using checkboxes, automatically updating the picture and defining the supervised output. Aside from facilitating the task to create the dataset, the use of clustering led to normalization of pictures from different resolutions avoiding an unbalanced dataset. The output distribution of the dataset obtained was 53.73% leaf and 46.27% non-leaf.

b. SVM Training
In order to train the SVM model, the dataset was split into 80% for the training set and 20% for the validation set, and a fixed test set was defined from seven new pictures. The validation set was used to optimize the weight decay of the SVM model. To do this, training was carried out by using the training set with different weight decay values within a defined range and the model's performance was evaluated by using the validation set. The final resulting model trained with the optimal weight decay together with the test set were used to evaluate the real model accuracy. The whole procedure was repeated 50 times arbitrarily, so that the distribution of the dataset was different for each iteration. Finally, the optimal model was selected as that with the best test accuracy.

Deep Learning
Mask Generation

b. SVM Training
In order to train the SVM model, the dataset was split into 80% for the training set and 20% for the validation set, and a fixed test set was defined from seven new pictures. The validation set was used to optimize the weight decay of the SVM model. To do this, training was carried out by using the training set with different weight decay values within a defined range and the model's performance was evaluated by using the validation set. The final resulting model trained with the optimal weight decay together with the test set were used to evaluate the real model accuracy. The whole procedure was repeated 50 times arbitrarily, so that the distribution of the dataset was different for each iteration. Finally, the optimal model was selected as that with the best test accuracy.

Mask Generation
In the case of deep learning, the pictures were taken as model inputs and the binary images as output masks. To generate the masks, the clustering process for SVM described above was used in a first step. The accuracy of this method is limited due to a finite number of clusters and the clustering error itself. Therefore, in order to obtain ground-truth masks with sufficient accuracy, a manual edition was performed by using GIMP software in a second step, as shown in Figure 3. This task consisted of correcting the erroneous classification of regions on the mask after clustering by manually colouring the pixels. For this procedure, a remarkable cost in terms of time was required.

Data Augmentation
The main problem when deep-learning training is performed is the amount of data available. With a view to enlarge the dataset and provide more information for model training, data augmentation was applied. This procedure made it possible to create new training data artificially from the original pictures. It consisted of basic geometric transformations, such as translations or turns, on the pictures and their respective masks. In this article, the data augmentation applied to the original pictures consisted of: a square crop centred on the picture, two square crops originating at the two ends, three horizontal flips corresponding to the crops previously obtained and two rotations of 20 • in both directions with a subsequent square crop. Thus, the dataset was enlarged eight times resulting in a total of 968 pictures. These pictures were resized to a resolution of 480 × 480 pixels for training. the training set with different weight decay values within a defined range and the model's performance was evaluated by using the validation set. The final resulting model trained with the optimal weight decay together with the test set were used to evaluate the real model accuracy. The whole procedure was repeated 50 times arbitrarily, so that the distribution of the dataset was different for each iteration. Finally, the optimal model was selected as that with the best test accuracy.

Deep Learning
Mask Generation

Test Set
For testing purposes, seven masks were generated from seven different pictures. These pictures had a resolution of 3264 × 1836 pixels and were too big to be able to make predictions with the model. Therefore, they were divided into four quadrants of 1632 × 918 pixels each. This allowed to obtain a larger number of pictures to make predictions and no information from the original pictures was lost. Finally, the test set was composed of a total of 28 pictures.

Deep-Learning Parameters and Training
A SegNet network architecture [39] was chosen for the deep-learning model. SegNet is a convolutional neural network for semantic image segmentation. The network input layer size was defined as 480 × 480. Twenty two different models were trained employing different parameter configurations, which are detailed in Tables A1 and A2. The parameters modified were: Encoder Depth, optimizer, learning rate, weight decay, validation frequency and validation patience. Moreover, the image enhancement pre-process of contrast-limited adaptive histogram equalization (CLAHE) was also applied in training images for some models. The objective of this procedure was to emphasize the contrast of the image. The CLAHE pre-process was implemented in HSV colour space training images and then converted back to RGB. The data set (training set + validation set) was split in 90 and 10%, respectively. The distribution of pictures in training and validation set was randomly repeated 30 times to cover different configurations of the dataset. The best model was chosen according to the test accuracy.

Results and Predictions on Test Pictures
In order to compare the results between the different models generated, an accuracy indicator was defined as the percentage of pixels correctly classified over the total of the image: tp = true positive; tn = true negative; fp = false positive; fn = false negative The evaluation is made not only between the best models obtained for SVM and deep-learning, but also between several models generated with different parameters in both cases. For the SVM model, the results were obtained with different dataset sizes: 50, 122, 190 and 251 training images. SVM test accuracy results are presented in Table 1. For every size, the mean test accuracy of all 50 models iteratively generated with different dataset distributions was calculated, although only the mean test accuracy of the best model is presented in Table 1. Model number 4, which is the one that was trained with a larger dataset, was found to be the one with the best average accuracy (83.09%) and the best average accuracy for all the iterations of generation of the model (82.53%). Moreover, model 4 was determined as the best model, with the highest accuracy in 64.29% of the test pictures. The best model indicator was defined as the percentage of test pictures that present the best result with each model. This percentage was also reached considering 1% better test accuracy results for the model 4. Quantifying this evolution of percentages reveals small differences between model 4 and the rest in the cases it is not the best, presenting a higher accuracy or up to 1% lower in 89.29% of the test pictures. As expected, the accuracy of the model grows as the number of training examples increases. Focusing on the best SVM model obtained, whose learning curve is shown in Figure 4, a case of high bias is observed, which may indicate an underfitting problem. The model does not fit the data sufficiently, it lacks information and requires more parameters to reduce the error. Specifically, method limitations were observed with regard to the clustering and SVM structure. Clustering for data preparation and SVM training are strictly conducted by the colour space parameters of the pixels, which seem to be insufficient for the segmentation. The problem does not apparently be related to the selection of channels or colour spaces, neither with the absence to add one of them, but with the nature of the method. Additional procedures are required for a segmentation that performs an analysis based on regions and textures.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 7 of 18 was trained with a larger dataset, was found to be the one with the best average accuracy (83.09%) and the best average accuracy for all the iterations of generation of the model (82.53% Focusing on the best SVM model obtained, whose learning curve is shown in Figure 4, a case of high bias is observed, which may indicate an underfitting problem. The model does not fit the data sufficiently, it lacks information and requires more parameters to reduce the error. Specifically, method limitations were observed with regard to the clustering and SVM structure. Clustering for data preparation and SVM training are strictly conducted by the colour space parameters of the pixels, which seem to be insufficient for the segmentation. The problem does not apparently be related to the selection of channels or colour spaces, neither with the absence to add one of them, but with the nature of the method. Additional procedures are required for a segmentation that performs an analysis based on regions and textures.  In the case of the deep-learning model, trainings were performed with different dataset configurations, network architecture parameters and training options, which are presented in Tables A1  and A2. The results for each model are shown in Tables 2 and 3. Mean test accuracy of all 30 models generated iteratively with different dataset distributions was calculated, as well as the mean test accuracy of the best model. According to the results, model 13, with an accuracy of 85.05%, was found to be the best. Model 15 has a very similar accuracy (84.90%), obtained with a double value of validation patience parameter. Furthermore, the comparison of accuracy between models 11 (83.07%) and 10 (84.67%) demonstrates that the application of the image enhancement pre-process of CLAHE did not improve the results. Valuable information is obtained from the training curves, as presented in Figure 5 for model 13. A high variability in accuracy is observed during training due to the differences presented between training images. As previously stated, the objective was to generate a segmentation model capable of working with pictures that included complex backgrounds and regions with problematic lighting. These pictures with characteristics that are more difficult to discriminate are responsible for the fact that poor results are frequently produced during training.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 8 of 18 found to be the best. Model 15 has a very similar accuracy (84.90%), obtained with a double value of validation patience parameter. Furthermore, the comparison of accuracy between models 11 (83.07%) and 10 (84.67%) demonstrates that the application of the image enhancement pre-process of CLAHE did not improve the results.  Valuable information is obtained from the training curves, as presented in Figure 5 for model 13. A high variability in accuracy is observed during training due to the differences presented between training images. As previously stated, the objective was to generate a segmentation model capable of working with pictures that included complex backgrounds and regions with problematic lighting. These pictures with characteristics that are more difficult to discriminate are responsible for the fact that poor results are frequently produced during training. Once the best model was chosen, training was executed employing the same parameters with Once the best model was chosen, training was executed employing the same parameters with different dataset sizes in order to analyse the evolution of the test accuracy. The aim was to predict the possibility of model improvement in case we were to add new training pictures. The dataset size varied between 20 and 121 original images, i.e., 160 and 968 images after applying data augmentation, with 10 original images steps. The test accuracy indicator obtained from each case evidenced an enhancement as the dataset size increased, as shown in Figure 6. This trend suggests that by increasing the size of the training set, a more accurate model could be achieved.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 9 of 18 with 10 original images steps. The test accuracy indicator obtained from each case evidenced an enhancement as the dataset size increased, as shown in Figure 6. This trend suggests that by increasing the size of the training set, a more accurate model could be achieved. To perform a comparison between the best models found for SVM and deep-learning, the following indicators were defined.
where tp is the number of true positives, fp the number of false positives and fn the number of false negatives.
The deep-learning model has a better average result on accuracy, recall and F 1 score, whereas the SVM model presents a better average precision, as shown in Table 4. This means that the deep-learning model is less restrictive, generating a lower number of false negatives, which leads to a 5.16% higher recall. However, as it is less restrictive, more false positives are also found, which lowers the precision to 2.19%. Taking into account the accuracy and F 1 score, it can be determined that the best model is the deep learning one. Moreover, the percentages of the test pictures that have the best result with each model have also been obtained for each indicator and presented in the 'Best model' row of Table 4. In this case, the best result is achieved by the SVM model except for recall. However, as it can be seen in the following row, the percentages of best model vary significantly in favour of the deep-learning model if it is considered to be the best model with a higher result or up to 3% lower. In contrast, if we proceed in the same way by favouring the SVM model with the same percentage, the results are not improved so substantially. These improvements derived from the 3% favouring in each case are summarised in the last row of Table 4. The enhancement in the percentage of test pictures that obtain their best result for each model, on average for all indicators, would be 15.18% for SVM and 42.86% for deep learning. These results serve as an argument to define the deep-learning model as a priority when it comes to improving it in future work, since a small improvement in the indicators (3%) would lead to a significant improvement in the comparative results with the SVM model, in terms of percentage of best model (42.86%). In Figure 7, an example of the image segmentations made by the models for a test picture is shown. The segmentation mask of the model is overlaid on the original picture, assigning the green and yellow colours to "leaf" and "non-leaf" classes, respectively. The mask of the model's errors is overlaid on the original picture, assigning blue to the false positives (they are not leaves, but have been classified as such) and red to the false negatives (they are leaves, but have not been classified as such).
ci. 2020, 10, x FOR PEER REVIEW 12 of 18 n Figure 7, an example of the image segmentations made by the models for a test picture is n. The segmentation mask of the model is overlaid on the original picture, assigning the green ellow colours to "leaf" and "non-leaf" classes, respectively. The mask of the model's errors is aid on the original picture, assigning blue to the false positives (they are not leaves, but have classified as such) and red to the false negatives (they are leaves, but have not been classified as .

Prediction, Training and Data Preparation Time
Not only performance but time cost of the models should be accounted since economic and technical restrictions have to be considered. Prediction, training and data preparation times were analysed in order to evaluate the feasibility of the methods and to underscore the differences between them.

Prediction, Training and Data Preparation Time
Not only performance but time cost of the models should be accounted since economic and technical restrictions have to be considered. Prediction, training and data preparation times were analysed in order to evaluate the feasibility of the methods and to underscore the differences between them.
The prediction time of the model for image segmentation is crucial to determine the feasibility of its implementation in a future field application. Additionally, it must be noted that the computing power in this case would be significantly lower if it were not performed remotely. The average times for the segmentation of the test pictures executed by SVM models are presented in Table 5. According to results, SVM prediction time increases with the number of training images. The larger the dataset, the more complex and heavier the model is, so the prediction requires a higher computational cost. In the case of deep-learning models, prediction times are affected by the encoder depth parameter, which defines the number of network layers. Table 5 summarizes the average prediction times for trained deep-learning models with the same encoder depth values. From the average prediction times of the test pictures in the best SVM and deep-learning models, which are shown in Table 6, it is appreciated that the SVM model takes approximately 42 times more prediction time. The SVM model is simpler, but requires individual prediction of each of the pixels that make up the picture. In contrast, the deep-learning model with the SegNet network architecture based on the encoder-decoder structure is more agile in prediction. The time needed for models training is not a determining parameter to consider in the comparison between models, since it is a machine processing time and it is performed only once. However, it is interesting to take this into account as excessively high times could be a problem for future training with a greater number of pictures. Table 7 shows how the training time of SVM models rises as the dataset increases, as expected. In the case of the deep-learning model, the training time depends on the number of training images that compose the dataset, added to the validation frequency and validation patience parameters that define the stop criteria, as well as the learning rate and the regularization value. Based on the results of the best SVM and deep-learning models, which are indicated in Table 8, the SVM model takes approximately 67 times less training time. The SVM model is simpler and its training does not require the computational capacity that the deep-learning model does. The time it takes to prepare the data for training is a key factor in the process. This can be a bottleneck and the most determinant procedure, as the resulting model will be as good as the data we are training with. The data preparation times are then compared for both methods in Table 9.
In the case of SVM, the computation time of the clustering and the time of manual classification of the clusters by means of the GUI are taken into account. Instead, for the deep-learning model it is necessary to add to the time for mask generation in the previous process, the manual editing time to adjust it perfectly. It is obtained that for each picture the preparation time of the SVM model is significantly lower (27 times) than that of the deep learning one. The manual editing time of the ground-truth mask represents the biggest stumbling block in this process. If these times are considered for all the pictures used in both cases, the total time is approximately 10 hours for SVM (251 pictures) and 126 hours for deep learning (121 pictures). The deep learning model required 116 hours more time for half of the training images.

Discussion
In general, the segmentation methods performed accurately with elements such as trunks, branches, sky and clouds. False positives were found with green fruits and green branches. Besides, false negatives with leaves in regions of problematic illumination have also been reported.
SVM has been demonstrated to be limited in this application. Despite having a much shorter data preparation time, significantly improvement in prediction is not expected, no matter how much the dataset size is increased. Neither does it seem promising to study new hybrid colour spaces composed by other channels or to add new ones. This limitation lies in the clustering method itself, rather than in the data.
Better results were obtained with deep learning, even using few pictures for training. Due to the size of the dataset, which represents a limitation, it was not possible for the model to reach the whole ability to recognize the shapes and textures of the elements of interest. An accuracy improvement by the deep learning model is expected with a larger training set. Specifically, it is considered important to add cases for training in which the model has presented poorer results. The objective is to penalize by means of the dataset the convergence of the training process in a concrete sense. If it contains more regions with doubtful fruits or branches, the model will be forced to extract differentiating features that allow its discrimination to be improved in order to optimize the monitoring parameter.

Conclusions
A segmentation method to discriminate leaf and non-leaf regions in images has been presented in this paper. SVM and deep-learning models were proposed to achieve this objective and were found to have 83.11 and 86.27% F 1 score, respectively. The SVM model has shown limitations in terms of further improvement of results. However, a much shorter data preparation time must be employed.
The deep-learning model was selected as the best option and it is proposed as the one to be developed in a future work.
Next steps would involve the implementation of the developed model in a portable unit, together with a thermal camera to measure the leaves temperature at field conditions, and to compare with other methods. Additionally, future studies should investigate the addition of new segmentation classes in the model that represent the different elements we can find in the pictures, e.g. one class for branches, another one for fruits, the sky, etc. The fact of defining an exclusive class for these elements can facilitate the extraction of their specific characteristics, as opposed to encompassing them in the background. Moreover, the information to decide according to the relative position of the classes is also taken into account. Nevertheless, a significant increase of masks generation and classification time has to be considered. Other interesting options could be the definition of individual models for different phenotypes, since leaves shape is characteristic of each of them, or the use of a pre-trained model with initialized weights, which would reduce the necessity of increasing the dataset size, as previous experience is hoarded. The procedure presented could also be followed to generate a leaf segmentation model for other species. Image segmentation models proposed here could also be employed in other applications for the measurement of other ranges of the electromagnetic spectrum image-based parameters.