Application of Deep Learning Architectures for Accurate Detection of Olive Tree Flowering Phenophase

: The importance of monitoring and modelling the impact of climate change on crop phenology in a given ecosystem is ever-growing. For example, these procedures are useful when planning various processes that are important for plant protection. In order to proactively monitor the olive ( Olea europaea )’s phenological response to changing environmental conditions, it is proposed to monitor the olive orchard with moving or stationary cameras, and to apply deep learning algorithms to track the timing of particular phenophases. The experiment conducted for this research showed that hardly perceivable transitions in phenophases can be accurately observed and detected, which is a presupposition for the e ﬀ ective implementation of integrated pest management (IPM). A number of di ﬀ erent architectures and feature extraction approaches were compared. Ultimately, using a custom deep network and data augmentation technique during the deployment phase resulted in a ﬁvefold cross-validation classiﬁcation accuracy of 0.9720 ± 0.0057. This leads to the conclusion that a relatively simple custom network can prove to be the best solution for a speciﬁc problem, compared to more complex and very deep architectures.


Introduction
Climate change has had a significant influence on the allocation and population of various species on the planet, including the olive (Olea europaea L.) and olive pests [1,2]. The Mediterranean is the main olive-producing area of the world. It is therefore susceptible to increasing challenges in the field of integrated pest management (IPM). This also includes organic farming systems. Their practices are based in ecology, an example of which is biological pest management. Tang et al. [3] state that the optimal timing for pest management is of great importance here, together with taking into account the phenological stages of the treated plants, how dense the starting population of pests is, what their natural enemies' ratios and release timings are, how the insecticides are dosed and when they are applied, as well as what the rates of instantaneous killings of these pesticides are when looking at both the pests and their natural enemies. Therefore, there has been a lot of research interest in modelling a plant's adaptation to its surrounding climate, as well as its response to recent climate changes [4]. Olives are the topic of many such papers and models, for example, De Melo-Abreu et al. [5], Osborne et al. [6] and Aguilera et al. [7]. In [8], it has been concluded that the climate (especially the temperature and photoperiod) greatly influence olive flowering. However, these results differ greatly with varying geographical altitude and latitude. Similarly, Garcia-Mozo et al. [9] state that different olive cultivars have flowering phenology variations, which should also be taken into account.
Additionally, experience has shown that other factors within the same orchard can alter the beginning of the flowering phase up to multiple days [10]. These factors are more difficult to model, such as the age and general condition of trees or the microlocation (e.g., facing the east or having a higher altitude). This led Osborne et al. [6] to conclude that, in order to use any phenological model for a single olive population on the regional level, the model has to first be tested by using a great number of different factors, such as various locations and cultivars.
At the same time, certain pests have adapted to the phenology of olive trees exceptionally well. The olive moth (Prays oleae), for example, develops three generations within the same year [11]. Treatments have to be used against two generations: the anthophagous (flower) generation and the carpophagous generation. At the moment, there are a number of standard control methods for the olive moth, including conventional pesticide applications. However, it is not advised to control the spring generation of moths with insecticides, as this results in catastrophic consequences for the beneficial animal species in olive orchards as well [12]. An alternative and effective method of microbiological control is the application of Bacillus thuringiensis, with its optimal application date being when 5% of the flowers have opened [13]. However, due to the previously explained reasons, in larger olive orchards it is hard to determine the correct point in time for the application of this method. This is why the goal of this study is to test the possibilities of applying machine learning (ML) and, especially, deep learning (DL) technologies in the field of phenological stage classification, with the aim of achieving optimal pest management.
The term deep learning refers to the use of artificial neural network (ANN) architectures consisting of a large number of processing layers. With the development of hardware (especially GPUs), the aforementioned models have become computationally feasible, which resulted in many possible applications. They are used in the fields of image and voice recognition, remote sensing and other complex processes which deal with the analysis of large amounts of data. The use of deep learning techniques for agriculture began a number of years ago, but to a relatively limited degree. Kamilaris and Prenafeta-Boldú [14] offer a detailed overview of concrete issues in agriculture, the used frameworks, models, sources and data preprocessing, as well as the achieved performance data. It can be concluded that, in the context of use in agriculture, deep learning techniques are highly accurate, achieving better results than the techniques currently commonly used in image processing. This paper is structured as follows: Section 2 provides the basic data on the original dataset used in the experiment, as well as on the tested deep learning algorithms. Section 3 introduces additional improvements made during the deployment phase and includes a discussion of the results reached as part of the experiment. Finally, the main findings are summarized in Section 4.

Dataset
The study was carried out in southern Croatia. During April, May and June of 2019, hundreds of images were collected with stationary cameras in an olive orchard. The images were obtained during multiple days and at all times of the day. This means that they were taken in various lighting conditions. The weather conditions included clear, sunny weather to overcast and cloudy weather.
For the initial proof of concept, a Sony 16 MP entry-level mirrorless camera was used. Further on in the experiment, we used a C920 HD Pro webcam, which can take 15 MP photos. Both cameras were set at a distance of 40 to 50 cm from the tree canopy. The format used with both cameras was JPEG. In order to increase the sharpness and reduce potential noise due to lower lighting conditions, the images were resized to 1600 × 900 pixels.
For the purpose of machine learning, six image patches (256 × 256 pixels) were extracted from the central part of each original image. This ensured a balanced dataset which consisted of 1400 images, showing details of the tree canopies during various phenological stages. Histogram equalization was applied to each image patch to reduce the effect of varying lighting conditions. Following that, Remote Sens. 2020, 12, 2120 3 of 13 edge detection algorithms were applied to the images with the goal of finding the most suitable variant for the experiment. All of the mentioned preprocessing steps are shown in Figure 1.
Remote Sens. 2020, 12, x FOR PEER REVIEW 3 of 13 most suitable variant for the experiment. All of the mentioned preprocessing steps are shown in Figure 1. With the aid of an expert in this field, the images were classified as zero or one, depending on whether the start of the flowering phase was observed. Both classes have the same number of samples (700). Figure 2 shows examples of both classes (upper row-class 0, lower row-class 1).
The dataset was then divided into three parts-200 images were set aside for the model's final testing (test dataset), while the rest (1200 images) were divided into a training dataset of 1000 images and a validation dataset of 200 images. All datasets have the same class ratio as the original dataset. The validation dataset was used during the learning phase to tune the hyperparameters of a classifier (i.e., the architecture, number of epochs, etc.) and then for the final model selection [15]. The test dataset was never used during the training phase, so it could be used for a less biased final estimate of the model's ability to generalize. As expected during the course of the experiment, the dataset with 1000 images was insufficient to successfully carry out the learning phase. This is why data augmentation was used as the primary With the aid of an expert in this field, the images were classified as zero or one, depending on whether the start of the flowering phase was observed. Both classes have the same number of samples (700). Figure 2 shows examples of both classes (upper row-class 0, lower row-class 1). With the aid of an expert in this field, the images were classified as zero or one, depending on whether the start of the flowering phase was observed. Both classes have the same number of samples (700). Figure 2 shows examples of both classes (upper row-class 0, lower row-class 1).
The dataset was then divided into three parts-200 images were set aside for the model's final testing (test dataset), while the rest (1200 images) were divided into a training dataset of 1000 images and a validation dataset of 200 images. All datasets have the same class ratio as the original dataset. The validation dataset was used during the learning phase to tune the hyperparameters of a classifier (i.e., the architecture, number of epochs, etc.) and then for the final model selection [15]. The test dataset was never used during the training phase, so it could be used for a less biased final estimate of the model's ability to generalize. As expected during the course of the experiment, the dataset with 1000 images was insufficient to successfully carry out the learning phase. This is why data augmentation was used as the primary The dataset was then divided into three parts-200 images were set aside for the model's final testing (test dataset), while the rest (1200 images) were divided into a training dataset of 1000 images and a validation dataset of 200 images. All datasets have the same class ratio as the original dataset. The validation dataset was used during the learning phase to tune the hyperparameters of a classifier (i.e., the architecture, number of epochs, etc.) and then for the final model selection [15]. The test dataset was never used during the training phase, so it could be used for a less biased final estimate of the model's ability to generalize.
As expected during the course of the experiment, the dataset with 1000 images was insufficient to successfully carry out the learning phase. This is why data augmentation was used as the primary method for overfitting reduction in the context of a limited learning dataset [16]. Original images were used as templates to generate new artificial learning examples. For this purpose, various image distortion techniques were used at random, such as zooming, cropping, translating, flipping the image horizontally or vertically, rotation and color modification. This resulted in an increase in the learning dataset to 7000 images.
The initial experiment tested the possibilities of using technology for the detection of the flowering phase onset, i.e., for the recognition of open flowers. According to the Biologische Bundesanstalt, Bundesortamt, Chemiche Industrie (BBCH) scale, this is the transition from phase 59 to phase 60 or 61 [17]. The task is all but trivial as, visually, open flowers vary greatly, depending on the angle and distance of the camera, the lighting (light intensity, color temperature, etc.), objects obstructing the view of the flowers, and other conditions. Figure 3 shows various examples of open flowers. Note that the shown details were magnified and represent around 5% of the size of the images from the learning dataset.
Remote Sens. 2020, 12, x FOR PEER REVIEW 4 of 13 method for overfitting reduction in the context of a limited learning dataset [16]. Original images were used as templates to generate new artificial learning examples. For this purpose, various image distortion techniques were used at random, such as zooming, cropping, translating, flipping the image horizontally or vertically, rotation and color modification. This resulted in an increase in the learning dataset to 7000 images. The initial experiment tested the possibilities of using technology for the detection of the flowering phase onset, i.e., for the recognition of open flowers. According to the Biologische Bundesanstalt, Bundesortamt, Chemiche Industrie (BBCH) scale, this is the transition from phase 59 to phase 60 or 61 [17]. The task is all but trivial as, visually, open flowers vary greatly, depending on the angle and distance of the camera, the lighting (light intensity, color temperature, etc.), objects obstructing the view of the flowers, and other conditions. Figure 3 shows various examples of open flowers. Note that the shown details were magnified and represent around 5% of the size of the images from the learning dataset.
Approaches to learning from scratch were also compared to the application of transfer learning [22]. CNN training was implemented with the Python deep-learning framework, including Keras [23] and TensorFlow [24] libraries.
The hardware used to carry this out was an NVidia GeForce GTX 1080 Ti graphics processing unit (GPU) with 11GB memory, AMD Ryzen 5 CPU and 32 GB of RAM. The operating system was Ubuntu 16.04 Linux. We conducted our CNN training on the NVidia CUDA-nabled GeForce GTX 1080 Ti GPU.
As the aforementioned architectures did not reach the expected or satisfactory results, neither for learning speed, nor for classification accuracy, a custom model inspired by a VGG architecture was included in the experiment. The model is shown in Figure 4. A number of convolutional, pooling, normalization and fully connected layer combinations were tested with the goal of reaching the desired results for the test dataset. The research proved that, for this specific problem and amount of data, there is an efficient architecture which is far simpler than the general very deep models.
The results of the test are shown in Table 1, where the percentage of accuracy reached for the unseen test dataset is displayed. Of course, the training accuracy is significantly better, and is sometimes even close to 100%. However, this is a case of overfitting, i.e., a model's insufficient ability to generalize. This ability to generalize has proven to be practically the most important characteristic of a network for the application intended in this research.
Approaches to learning from scratch were also compared to the application of transfer learning [22]. CNN training was implemented with the Python deep-learning framework, including Keras [23] and TensorFlow [24] libraries.
The hardware used to carry this out was an NVidia GeForce GTX 1080 Ti graphics processing unit (GPU) with 11GB memory, AMD Ryzen 5 CPU and 32 GB of RAM. The operating system was Ubuntu 16.04 Linux. We conducted our CNN training on the NVidia CUDA-nabled GeForce GTX 1080 Ti GPU.
As the aforementioned architectures did not reach the expected or satisfactory results, neither for learning speed, nor for classification accuracy, a custom model inspired by a VGG architecture was included in the experiment. The model is shown in Figure 4. A number of convolutional, pooling, normalization and fully connected layer combinations were tested with the goal of reaching the desired results for the test dataset. The research proved that, for this specific problem and amount of data, there is an efficient architecture which is far simpler than the general very deep models. The mentioned state-of-the-art deep architectures were tested by starting the learning process from scratch with random weight initialization, as well as in transfer learning scenarios using ImageNet pretrained models. It is well known that with transfer learning, a significant domain discrepancy results in a significant drop in feature transferability for higher layers. This is why a large number of models with various numbers of pre-trained layers were tested. The results showed that some networks (especially those marked with *) suffer from a convergence problem due to insufficient training data. They also require quite a long time to converge. Tn the context of this issue and in relation to available data, an observation was made that contrasts prior experience: The initialization of the weights from the pretrained network and fine tuning do not significantly contribute to a higher accuracy of classification. As it is visible in Table 1, the best results were achieved with a variant of the custom (VGGinspired) deep convolutional network, with the use of an augmented training dataset. It was also The results of the test are shown in Table 1, where the percentage of accuracy reached for the unseen test dataset is displayed. Of course, the training accuracy is significantly better, and is sometimes even close to 100%. However, this is a case of overfitting, i.e., a model's insufficient ability to generalize. This ability to generalize has proven to be practically the most important characteristic of a network for the application intended in this research. The mentioned state-of-the-art deep architectures were tested by starting the learning process from scratch with random weight initialization, as well as in transfer learning scenarios using ImageNet pretrained models. It is well known that with transfer learning, a significant domain discrepancy results in a significant drop in feature transferability for higher layers. This is why a large number of models with various numbers of pre-trained layers were tested. The results showed that some networks (especially those marked with *) suffer from a convergence problem due to insufficient training data. They also require quite a long time to converge. Tn the context of this issue and in relation to available data, an observation was made that contrasts prior experience: The initialization of the weights from the pretrained network and fine tuning do not significantly contribute to a higher accuracy of classification.
As it is visible in Table 1, the best results were achieved with a variant of the custom (VGG-inspired) deep convolutional network, with the use of an augmented training dataset. It was also found that the optimal architecture includes 14 learnable weights layers. Tested models also included fewer or more layers. The results show architectures with 12, 13, 15 and 16 layers, which all reached somewhat poorer results with the test dataset. This shows that for this specific issue there is an optimal level of network complexity.
Details of the optimal architecture are shown in Table 2, where Conv2D(n) denotes the 2D convolution layer with n filters, ACT denotes the activation function, BN denotes the batch normalization layer, MP denotes the max pooling layer and DR denotes the dropout layer. Tested activation functions used in this model are rectified linear units (ReLU) and leakage-rectified linear units (LReLU) [25]. The size and configuration of the applied network is the result of a large number of experiments, including the grid search approach with the goal of determining certain parameters. These are the values of the more important hyperparameters: the number of epochs is 100, mini-batch size is 16, initial learning rate is 0.0001, the optimizer is RMSProp (Root Mean Square Propagation) [26]. During the Remote Sens. 2020, 12, 2120 7 of 13 learning phase, the callback mechanism was used to save the best model. This assessment was based on the model's performances with the validation dataset. The dynamic learning rate is controlled by the callback function that reduces the learning rate when a defined metric has stopped improving for a given number of epochs.
In order to prevent neural networks from overfitting, batch normalization (BN) was used [27]. This method, at the same time, reduces the internal covariate shift and accelerates the training of deep networks. However, experimentation has shown that, for this specific experiment, it is necessary to use the classic dropout method [28] for additional control, even though using BN practically eliminates the need to use dropout according to literature.
The input was a fixed-size 256 × 256 RGB image, which was processed in various ways during the experiment in order to find effective features. For example, it was attempted to use information about color changes in the flower bud at the beginning of the flowering phase. However, it was concluded that the shape of the flower is actually a more reliable piece of information. This is why multiple methods were used to find the necessary features, especially various edge detection methods, including classic approaches (such as Sobel and Canny methods).
However, the best results were achieved by the newer physics-inspired method-phase stretch transform (PST) [29]. The authors recommend using an algorithm that works through a nonlinear dispersive phase operation. It transforms the image by emulating the propagation of light through a physical medium with a specific warped diffractive property. The authors concluded that the output phase of the transforming reveals transitions in image intensity, which can be used for edge detection and feature extraction. In [30], the authors demonstrated that the PST algorithm, due to its natural equalization mechanism, is able to provide more information on contrast changes in both bright and dark areas of the image. Conventional edge derivative operators fail to visualize the sharp contrast changes in both cases. This algorithm has been successfully applied to various application areas. For example, in [31], authors have shown that the proposed algorithm exhibits image segmentation performance with an accuracy of 99.74%. Figure 5 shows the results of PST algorithm use. For the proposed PST method, the designed parameters are LPF = 0.2, phase strength S = 1, warp strength W = 10, minimum threshold T min = −0.5 and maximum threshold T max = 0.5.
Remote Sens. 2020, 12, x FOR PEER REVIEW 7 of 13 values of the more important hyperparameters: the number of epochs is 100, mini-batch size is 16, initial learning rate is 0.0001, the optimizer is RMSProp (Root Mean Square Propagation) [26]. During the learning phase, the callback mechanism was used to save the best model. This assessment was based on the model's performances with the validation dataset. The dynamic learning rate is controlled by the callback function that reduces the learning rate when a defined metric has stopped improving for a given number of epochs. In order to prevent neural networks from overfitting, batch normalization (BN) was used [27]. This method, at the same time, reduces the internal covariate shift and accelerates the training of deep networks. However, experimentation has shown that, for this specific experiment, it is necessary to use the classic dropout method [28] for additional control, even though using BN practically eliminates the need to use dropout according to literature.
The input was a fixed-size 256×256 RGB image, which was processed in various ways during the experiment in order to find effective features. For example, it was attempted to use information about color changes in the flower bud at the beginning of the flowering phase. However, it was concluded that the shape of the flower is actually a more reliable piece of information. This is why multiple methods were used to find the necessary features, especially various edge detection methods, including classic approaches (such as Sobel and Canny methods).
However, the best results were achieved by the newer physics-inspired method-phase stretch transform (PST) [29]. The authors recommend using an algorithm that works through a nonlinear dispersive phase operation. It transforms the image by emulating the propagation of light through a physical medium with a specific warped diffractive property. The authors concluded that the output phase of the transforming reveals transitions in image intensity, which can be used for edge detection and feature extraction. In [30], the authors demonstrated that the PST algorithm, due to its natural equalization mechanism, is able to provide more information on contrast changes in both bright and dark areas of the image. Conventional edge derivative operators fail to visualize the sharp contrast changes in both cases.
This algorithm has been successfully applied to various application areas. For example, in [31], authors have shown that the proposed algorithm exhibits image segmentation performance with an accuracy of 99.74%. Figure 5 shows the results of PST algorithm use. For the proposed PST method, the designed parameters are LPF = 0.2, phase strength S = 1, warp strength W = 10, minimum threshold Tmin = -0.5 and maximum threshold Tmax = 0.5.

Results and Discussion
The experiments showed that the recommended model reaches the best results on the validation dataset after 37 epochs. At that point in time, the validation accuracy is 0.940. Using the same model on the test dataset results in an accuracy of 0.945. Figure 6 shows details of the experiment realization results which helped find the optimal architecture. First, the impact of the large training dataset size is of note. Even though, using 1000 examples, the model can achieve an accuracy for the training dataset reaching up to 90%, the result of applying this model on the validation dataset (accuracy 60%) shows that this is a case of overfitting. A satisfactory level of generalization could only be reached after increasing the size of the training dataset to 7000 examples. This shows that, considering the limited amount of original data, data augmentation was a necessary step.
Remote Sens. 2020, 12, x FOR PEER REVIEW 8 of 13 The experiments showed that the recommended model reaches the best results on the validation dataset after 37 epochs. At that point in time, the validation accuracy is 0.940. Using the same model on the test dataset results in an accuracy of 0.945. Figure 6 shows details of the experiment realization results which helped find the optimal architecture. First, the impact of the large training dataset size is of note. Even though, using 1000 examples, the model can achieve an accuracy for the training dataset reaching up to 90%, the result of applying this model on the validation dataset (accuracy 60%) shows that this is a case of overfitting. A satisfactory level of generalization could only be reached after increasing the size of the training dataset to 7000 examples. This shows that, considering the limited amount of original data, data augmentation was a necessary step. Similarly, in the context of this issue and the available data, it was proven that the best results are achieved with batch sizes of eight and 16. Using a larger batch size can have a negative effect on the accuracy of the network, especially considering a model's ability to generalize. This is seen in the results achieved with the validation dataset. These are, in part, related to the findings in [32]: on the one hand, increasing the batch size reduces the range of learning rates that provide stable convergence and acceptable test performance, which was to be expected. On the other hand, small batch sizes provide more up-to-date gradient calculations, which results in more stable and reliable training.
When it comes to optimizers, all the applied adaptive optimizers (RMSProp, Adam [33], Nadam [34]) achieved similar results after a sufficient number of epochs. However, some differences in the speed of convergence can be noticed. The differences in the respective model's ability to generalize were not significant. A classic SGD optimizer was also tested, but its slow convergence as an issue, so its effect is not comparable to the effect of the applied adaptive optimizers.
Evolution of the accuracies and losses for training and test datasets are shown in Figure 7. The optimal architecture shown in Table 2 was used, applying the hyperparameters obtained by the experiments described in Figure 6. Similarly, in the context of this issue and the available data, it was proven that the best results are achieved with batch sizes of eight and 16. Using a larger batch size can have a negative effect on the accuracy of the network, especially considering a model's ability to generalize. This is seen in the results achieved with the validation dataset. These are, in part, related to the findings in [32]: on the one hand, increasing the batch size reduces the range of learning rates that provide stable convergence and acceptable test performance, which was to be expected. On the other hand, small batch sizes provide more up-to-date gradient calculations, which results in more stable and reliable training.
When it comes to optimizers, all the applied adaptive optimizers (RMSProp, Adam [33], Nadam [34]) achieved similar results after a sufficient number of epochs. However, some differences in the speed of convergence can be noticed. The differences in the respective model's ability to generalize were not significant. A classic SGD optimizer was also tested, but its slow convergence as an issue, so its effect is not comparable to the effect of the applied adaptive optimizers.
Remote Sens. 2020, 12, 2120 9 of 13 Evolution of the accuracies and losses for training and test datasets are shown in Figure 7. The optimal architecture shown in Table 2 was used, applying the hyperparameters obtained by the experiments described in Figure 6. Remote Sens. 2020, 12, x FOR PEER REVIEW 9 of 13 Figure 7. Accuracy and loss for training and test datasets. Figure 8 shows the confusion matrix, which summarizes the best classifier performance. As already mentioned, the accuracy is 0.9450. The model produced a similar number of false positives and false negatives, so the precision amounts to 0.9406, recall (sensitivity) is 0.9500, and the F1-score (the harmonic mean of the precision and recall) is 0.9453. Since a number of test samples (5.5%) were still wrongly classified, there is obviously space for the further performance improvement of the model. This is why the experiment shown in Figure 9 was conducted.
Our idea is based on the fact that, due to their limited size, the application of an image augmentation technique on learning datasets during the learning phase is of great importance-the random changes applied to training examples can reduce a model's dependency on individual features. This means that this methodology can increase the model's capability for generalization.   Figure 8 shows the confusion matrix, which summarizes the best classifier performance. As already mentioned, the accuracy is 0.9450. The model produced a similar number of false positives and false negatives, so the precision amounts to 0.9406, recall (sensitivity) is 0.9500, and the F1-score (the harmonic mean of the precision and recall) is 0.9453. Since a number of test samples (5.5%) were still wrongly classified, there is obviously space for the further performance improvement of the model. This is why the experiment shown in Figure 9 was conducted.
Our idea is based on the fact that, due to their limited size, the application of an image augmentation technique on learning datasets during the learning phase is of great importance-the random changes applied to training examples can reduce a model's dependency on individual features. This means that this methodology can increase the model's capability for generalization. This is why it can be concluded that the same technique, used in the final model application phase, could further improve the performance. Therefore, for every image in the test dataset, six new Since a number of test samples (5.5%) were still wrongly classified, there is obviously space for the further performance improvement of the model. This is why the experiment shown in Figure 9 was conducted. Remote Sens. 2020, 12, x FOR PEER REVIEW 10 of 13 Figure 9. Neural network architecture for phenophases detection. Figure 10 shows the original test image and six corresponding generated augmented images. The previously saved model was used on each of these artificially created examples, as well as on the original test image (a total of seven images), and the decision about the final classification of the test example was made based on the majority voting principle. Such use of a data augmentation technique during the deployment phase resulted in an improvement in the model's performances by another 3%. This means that the accuracy of the test dataset rose from 0.9450 to 0.9750. Precision amounts to 0.9703, recall (sensitivity) is 0.9800, and the F1-score (the harmonic mean of the precision and recall) is 0.9751. Finally, the model's robustness was evaluated by a variant of 5-fold stratified cross-validation. Available data was divided fivefold, with each fold having training, validation and test datasets. Validation datasets were used for regularization by the early stopping of learning process. We found only a small difference in the accuracy was reached for the test datasets. The average classification accuracy and standard deviation were 0.9720 ± 0.0057. Our idea is based on the fact that, due to their limited size, the application of an image augmentation technique on learning datasets during the learning phase is of great importance-the random changes applied to training examples can reduce a model's dependency on individual features. This means that this methodology can increase the model's capability for generalization. This is why it can be concluded that the same technique, used in the final model application phase, could further improve the performance. Therefore, for every image in the test dataset, six new examples were created by using the same data augmentation transformations as in the learning phase. Figure 10 shows the original test image and six corresponding generated augmented images. The previously saved model was used on each of these artificially created examples, as well as on the original test image (a total of seven images), and the decision about the final classification of the test example was made based on the majority voting principle.
Remote Sens. 2020, 12, x FOR PEER REVIEW 10 of 13 Figure 9. Neural network architecture for phenophases detection. Figure 10 shows the original test image and six corresponding generated augmented images. The previously saved model was used on each of these artificially created examples, as well as on the original test image (a total of seven images), and the decision about the final classification of the test example was made based on the majority voting principle. Such use of a data augmentation technique during the deployment phase resulted in an improvement in the model's performances by another 3%. This means that the accuracy of the test dataset rose from 0.9450 to 0.9750. Precision amounts to 0.9703, recall (sensitivity) is 0.9800, and the F1-score (the harmonic mean of the precision and recall) is 0.9751. Finally, the model's robustness was evaluated by a variant of 5-fold stratified cross-validation. Available data was divided fivefold, with each fold having training, validation and test datasets. Validation datasets were used for regularization by the early stopping of learning process. We found Such use of a data augmentation technique during the deployment phase resulted in an improvement in the model's performances by another 3%. This means that the accuracy of the test dataset rose from 0.9450 to 0.9750. Precision amounts to 0.9703, recall (sensitivity) is 0.9800, and the F1-score (the harmonic mean of the precision and recall) is 0.9751. Finally, the model's robustness was evaluated by a variant of 5-fold stratified cross-validation. Available data was divided fivefold, with each fold having training, validation and test datasets. Validation datasets were used for regularization by the early stopping of learning process. We found only a small difference in the accuracy was reached for the test datasets. The average classification accuracy and standard deviation were 0.9720 ± 0.0057.
To our knowledge, this is the first study that addresses the application of imaging sensors and deep learning techniques on the detection of olive tree flowering phenophases. This area has been the subject of several papers. However, authors used essentially different approaches, and therefore the results cannot be compared directly. For example, the authors in [35] examined the impact of weather-related variables on flowering phenophases and constructed models based on regression. Similarly, the authors in [36] investigated use of artificial neural networks in olive phenology modelling, but again using meteorological data. Aguilera et al. [7] developed pheno-meteorological regression models to forecast the main olive flowering phenological phases.

Conclusions
Our study assessed the efficiency of various models based on the CNN architectures in the context of the phenological state classification of plants-in this case, olive trees. The conducted experiment showed that the aforementioned classification can very efficiently be realized by using deep learning algorithms. The best results were reached with a custom VGG-inspired network, which includes 14 learnable weight layers, with the use of an augmented training dataset. The accuracy can be improved further by using data augmentation and majority voting procedures during the deployment phase, where the final classification accuracy amounts to 97.20%. This means that the entire process is viable and applicable under real conditions.
To illustrate the importance of the monitoring of plant phenology, it should be noted that the European Union requires the application of eight principles of IPM that are part of sustainable farm management [37]. Principle 3 is titled "Decision based on monitoring and thresholds", and it assumes "an opportunity to develop a new generation of decision support systems". The suggested approach allows for the optimized timing of chemical or biological crop protection applications. For example, a specific protection product may have an optimal application date when 5% of the flowers have opened. For an individual microlocation, the application time window can be limited to only 2 to 3 days, which requires the precise and timely detection of crop phenophases.
In the future, the proposed system will be applied to classify a greater range of phenological stages, which represent an important parameter when applying various agricultural procedures. We also plan to use images collected by a drone. Similarly, by securing the necessary learning datasets, the same model can be applied to other cultures as well.