Semantic Segmentation Using Deep Learning with Vegetation Indices for Rice Lodging Identiﬁcation in Multi-date UAV Visible Images

: A rapid and precise large-scale agricultural disaster survey is a basis for agricultural disaster relief and insurance but is labor-intensive and time-consuming. This study applies Unmanned Aerial Vehicles (UAVs) images through deep-learning image processing to estimate the rice lodging in paddies over a large area. This study establishes an image semantic segmentation model employing two neural network architectures, FCN-AlexNet, and SegNet, whose e ﬀ ects are explored in the interpretation of various object sizes and computation e ﬃ ciency. Commercial UAVs imaging rice paddies in high-resolution visible images are used to calculate three vegetation indicators to improve the applicability of visible images. The proposed model was trained and tested on a set of UAV images in 2017 and was validated on a set of UAV images in 2019. For the identiﬁcation of rice lodging on the 2017 UAV images, the F1-score reaches 0.80 and 0.79 for FCN-AlexNet and SegNet, respectively. The F1-score of FCN-AlexNet using RGB + ExGR combination also reaches 0.78 in the 2019 images for validation. The proposed model adopting semantic segmentation networks is proven to have better e ﬃ ciency, approximately 10 to 15 times faster, and a lower misinterpretation rate than that of the maximum likelihood method.


Introduction
Typhoon-associated strong winds and heavy rains frequently cause a considerable amount of crop damages that negatively impacted farmers' incomes and crop price balance on the agricultural market. Taiwan is located in an area that is one of the most susceptible to typhoons in the world. Based on the Taiwan council of agriculture (COA) agriculture statistics [1][2][3][4][5], the average annual crop damage cost US$352,482 in the past five years (2014)(2015)(2016)(2017)(2018). Additionally, the average crop loss accounts for approximately 28% of the total crop production in Taiwan, which affects 31,009 hectares in average. Accordingly, the Taiwan government established the Implementation Rules of Agricultural Natural Disaster Relief for more than 30 years and has been trying to implement agricultural insurances recently. In an ideal situation, farmers' incomes can be partially compensated by the emergency allowances based on the relief rules. However, limitations such as shortage of disaster relief funds, high administrative costs, and crop damage assessment disputes with the ongoing disaster relief rule urgently demand improvements.
Among the limitations, crop damage assessment disputes with associated high administrative costs are critical to the relief implementation. The current crop damage assessment relies heavily on in-situ accuracy. Fuentes-Pacheco et al. [26] used a SegNet-based CNN architecture to perform pixel-wise fig plant semantic segmentation and achieved a mean accuracy of 93.85%. Grinblat et al. [27] presented a successful work on utilizing deep CNNs to identify and classify species based on morphology patterns. However, to the best of our knowledge, the potential of using a semantic segmentation neural network for rice lodging identification of UAV imagery is not yet been accessed.
Therefore, to benefit from UAV data and deep learning network technologies, in this paper, a rice lodging assessment method is proposed by combining UAV images with deep learning techniques. Specifically, lodged rice visible spectrum information and vegetation indexes obtained from UAV images are combined to train a total of eight classification models in two semantic segmentation neural networks, SegNet and FCN-AlexNet. The performance of these eight models is evaluated by their image classification accuracy as well as the associated computation time. The overall objective of this paper is to achieve the following purposes: 1.
Rice lodging spectrum information is obtained from UAV images collected from a study area (about 40 hectares) in Taiwan to reduce the workload of in-situ manual visual field observations. 2.
UAV images are used to perform rice lodging classification by sematic segmentation neural network models, which aim to improve the accuracy of lodging assessment and serves as evidence of subsequent disaster subsidies.

3.
Multiple information, including visible light spectrum and vegetation index information, are involved in the proposed rice lodging assessment method to improve image classification accuracy.

4.
Two standard image semantic segmentation network models are tested. Their applicability is evaluated based on their computational speed and classification accuracy.

5.
Establish a rice lodging image dataset that can serve as a valuable resource for expert systems, disaster relief assistance, and agricultural insurance application data.

Data Description
Rice lodging UAV images of the Mozi Shield Park in Wufeng District, Taichung City, Taiwan, were collected by Sony QX100 (472 × 3648 pixels) and DJI Phantom 4 Pro (5472x3648 pixels) cameras in June 2017 and May 2019, respectively ( Figure 1 and Table 1). Both cameras capture images with three spectral channels: red, blue, and green. The total area covered 230 ha, in which a portion of approximately 40 ha was extracted for image inference model testing. Images were taken at a flight height of 230 meters and 200 meters in 2017 and 2019 with an associated ground resolution approximately 5.3 cm/pixel and 5.5 cm/pixel, respectively (Table 2). Agisoft Photscan (Agisoft LLC, St. Petersburg, Russia) was used to stitch images and obtain high-resolution orthomosaic images. A histogram matching process is implemented using 2017 images as the base to reduce the lighting discrepancy between two date images [28] (Figure 2). Images taken in 2017 were used for model training and validation, while the images of the same area (40 ha) from these two dates (2017 and 2019) were used for model testing.    Figure 3 depicts the research flow of this study, starting with UAV image capture. The captured UAV images, which consist of red, green, and blue spectrum information (RGB), are stitched to produce RGB orthomosaic images. The image tiles are created after ground-truth labeling, and training-validation and test data sets are created from RGB and labeled images. Other than the RGB spectrum information, the model training phase uses three vegetation indexes obtained from the images as an input feature. A total of eight classification models with two neural network architectures and four image information combinations are trained. Each classification model with the best weights is used for model evaluations. In the test phase, both 2017 and 2019 image data are used. The eight classification models are compared with the commonly used Maximum Likelihood Classification (MLC) [29] and evaluate associate performance.

Training-validation, and Testing Datasets
With a focus on rice lodging for the model training of semantic segmentation, the ground truth of UAV images was obtained by manual labeling using GIMP (GNU Image Manipulation Program) open-source program in a pixel-basis with five separate categories: rice paddy, rice lodging, road, ridge, and background ( Figure 4). Figure 5 highlights the rice lodging portion of images in white in a binary map. Additionally, the original UAV image size is 5472 × 3648, which could lead to the exhaustion of the GPU memory. In order to cope with GPU memory limitations and maintain the feature information and spatial resolution, each UAV image was split into 3485 tiles with size 480 × 480 pixels. Eighty percent of the samples were randomly selected as the training-validation dataset, in which 75% and 25% of the samples were randomly selected for training and validation, respectively, and the rest of 20% samples were used as the test dataset. As a result, a total of 2082 images were used for training, 694 images were used for validation, while 709 images were used for testing.

Vegetation Indices
Three vegetation indices (VIs), Excess Green index (ExG), Excess Red index (ExR), and Excess Green minus Excess Red index (ExGR), are calculated from UAV visible spectrum information to add into the process of model training and validation. Together with RGB information, three VIs were used to examine their correlations with rice lodging. The formulas of three VIs can be found in Table 3.

Semantic Segmentation Model Training
Two semantic segmentation models, SegNet and FCN-AlexNet, were utilized in the present study, and their network architectures are demonstrated in Figures 6 and 7, respectively. SegNet has asymmetric network architecture structure, including an encoder consisting of convolution layers and pooling layers, a decoder consisting of upsampling layers and convolution layers, and then a softmax layer. The encoder structure is identical to the 13 convolutional layers in the VGG16 network without the fully connected layers. Following the encoder, the decoder has a structure symmetric to that of the encoder but employs upsampling layers instead of transpose convolution.  The softmax layer is the layer that normalizes the input vector to the normalized probability distribution. The critical component of SegNet is a specific code and index structure of max pooling, which makes SegNet very useful for precise re-localization of features and fewer parameters needed for end-to-end training [32]. FCN-AlexNet, a customized model based on AlexNet architecture, has advantages of a deeper network and replaces the fully connected layers of AlexNet with a 1 × 1 convolution layer and a 63 × 63 upsampling layer for a pixel-wise end-to-end semantic segmentation. Additionally, FCN can accept input images with any size and retain the pixel spatial information in the original input image, which can classify each pixel on the feature map [33].
In all experiments, following the hyperparameter setting suggested by Kingma and Ba [34], an Adam optimizer with a learning rate of 0.001, β1 = 0.9, and β2 = 0.999 was used. Considering the network structures and GPU memories, the decay of 0.05, the batch size of 24, and the number of epochs equal to 50 were applied. A detailed model training, validation, and testing computing environment can be found in Table 4.

Evaluation Matrices
The performance of the eight proposed models and MLC was evaluated by adopting precision and recall concepts for each category, namely rice paddy, rice lodging, road, ridge, and background. As shown in Table 5, TP stands for true positive, FP represents false positive, TN denotes true negative, and FN means false negative. F β signifies the measurement on the balance of precision and recall with precision weighting coefficient β. While the precision weighting coefficient β equals to one, it is so-called F 1 -score. Table 5. Evaluation matrices with associated formulas.

Evaluation Matrices Formula
Precision Precision+Recall For a particular category c, precision was defined as the ratio of true positive (TP c ) instances to all positive results including the true and false positive (FP c ). The recall represents a sense of sensitivity, which is denoted by the fraction of TP to the sum of TP and false negative (FN c ). The accuracy defines the proportion of correct classification (TP plus TN) to all results for a particular category while the overall accuracy evaluates the percentage of true positive to all samples for all categories. Accordingly, the F β score quantifies the balance of precision and recall whereas F 1 score adopts a concept of taking precision and recall equally important. The closer the F 1 score to 1, the better the classification performance. Table 6 details the results of model validation accuracy using OA and F1-score. In general, both FCN-AlexNet and SegNet models have higher F1-scores over all categories while using a combination of RGB and vegetation indexes than using RGB information alone. For the rice lodging category, the highest validation accuracy reaches to 80.08% and 75.37% in FCN-AlexNet using RGB+ExGR and SegNet using RGB+ExG, respectively. As for rice paddy, bareland, and background categories, all eight models achieve above 90% accuracy except the SegNet model using RGB+ExG+ExGR achieves 88.06% (Figure 8). However, RGB+ExG+ExGR combination in both FCN-AlexNet and SegNet models performs worse than the combination of RGB with either ExG or ExGR. For the road category, all models have F1-scores lower than 70%, which can be explained by the limited training samples, and the spectrum and shape similarity between the road and the water channel in the fields (Figure 8).   Figure 9 shows the visual close-ups of FCN-AlexNet and SegNet results for five different cases from top to bottom. The original image, the ground truth, the results of FCN-AlexNet using four combinations of RGB and vegetation indexes, and the results of SegNet using four combinations of RGB and vegetation indexes are presented from left to right. As illustrated in Figure 9, FCN-AlexNet performs better on larger patches, while SegNet picks up more details of segmentation. However, FCN-AlexNet displays a tendency of overestimating on the edge of patches while SegNet contains more noise. The overestimating situation of FCN-AlexNet can be explained by using the 32x upsampling in the last stage of its network structure. For instance, in the second and the fourth cases of Figure 8, the results of SegNet shows a lot of noises around the lodged rice paddy, while FCN-AlexNet performs well. In the first case of Figure 8, the appearance of lodged rice is not visually distinguishable, so both FCN-AlexNet and AlexNet models perform poorly. However, both FCN-AlexNet and AlexNet models do well in the second case. The middle part of case 3 is a water channel, which has similar shape and spectrum characteristics with the road. The similarity hinders the performance of both models and displays a low F1-score. Based on the results of the validation dataset, the highest accuracy of FCN-AlexNet reaches 91.24%, which is achieved by using RGB+ExGR information. The best performance of SegNet reaches 89.70%, while RGB+ExG information is adopted. The accuracy improvement is again emphasizing the leverage effects of vegetation information on improving classification accuracy. In short, FCN-AlexNet performs slightly better in overall accuracy (about 1.54%) and obtains more stable results compared to SegNet, which is probably due to the simpler transpose convolution structure of FCN-AlexNet.

Testing Data Inference Evaluation
Both 2017 and 2019 datasets are used for testing the performance of FCN-AlexNet, SegNet, and MLC models. As a focus of this paper, the results for the rice lodging category are highlighted for results showing and discussion. Table 7 and Figure 10 demonstrates the results of the 2017 dataset while Table 8 and Figure 11 show the results of the 2019 dataset. A histogram matching process has been implemented for the 2019 dataset to minimize the light difference during imaging.
In general, as shown in Figures 10 and 11, FCN-AlexNet has strong confidence in its rice lodging identification in terms of accuracy and F1-score. The 2017 testing dataset shows better F1 scores than the 2019 testing dataset (Tables 7 and 8, Figures 10 and 11). FCN-AlexNet and SegNet reach higher precision and accuracy than MLC. Especially, FCN-AlexNet has F1-score >82% and accuracy >93% in the 2017 dataset (Table 7), which significantly overtakes SegNet and MLC. The highest F1-score 83.56% and accuracy 94.43% is achieved by FCN-AlexNet using RGB information. The worst F1-score 42.99% and accuracy 85.15% is observed in SegNet using RGB+ExG+ExGR information and MLC using +ExG information, respectively.
Besides, it is clear to see a significant accuracy improvement of adding vegetation information based on the recall results. In the 2017 dataset (Table 7), the recall value of SegNet using RGB information is 69.06%, while the recall value of SegNet using RGB and ExG combined information jumps to 89.64% with is a significant improvement of 20.58%. Moreover, the effects of vegetation information can also be observed from the improvement of the F1-score in Table 7. In the 2019 dataset, the F1-scores of FCN-AlexNet RGB+ExGR and SegNet RGB+ExGR are 78.27% and 68.12%, respectively, which is much higher than that of using only RGB information, 56.58% and 53.63%, respectively. Moreover, the traditional MLC classifier requires manual selection of area of interest (AOI) for the training sample, which becomes a barrier for identification automation. Thus, the computation time is different for every individual image. Nonetheless, the computation time of FCN-AlexNet and SegNet consists of memory operation and image inference, which is a fixed period due to their pixel-wise approach. In short, FCN-AlexNet and SegNet reduce the computation time by 10-15 times compared to the MLC classifier. Table 7. Results on the 2017 testing dataset for rice lodging category (the highest value is shown in bold and the colors shadow corresponds to three classifiers in Figures 10 and 11).     Green represents pixels being correctly classified, blue represents pixels with errors of omission, and red represents pixels with commission errors.

Classifier
As illustrated in Figure 12, it is clear to see that FCN-AlexNet has the most stable correct identification of lodged rice in the middle of the image with partial commission errors occurring on the left-hand side of the image. In Figure 13, a noticeable area of pixels along the highway shows omission errors. For the MLC, a large area with commission errors is detected on the left-hand side of the image in Figure 14. Additionally, the area of commission errors identified by MLC is much larger than those identified by the two deep learning networks.
The additional vegetation information does not improve the F1-score for FCN-AlexNet but shows their help for SegNet and MLC in terms of increasing F1-score. However, adding two vegetation indexes brings negative influence for rice lodging identification, which may indicate a confusion by too much information emphasizing similar features.
As shown in Figure 15 and referred to Table 8, the highest accuracy values of 94.33% and 91.57% are observed for both FCN-AlexNet and SegNet using RGB+ExGR information on the 2019 testing dataset. The corresponding F1-scores (78.27% and 68.12% for FCN-AlexNet and SegNet, respectively) indicate that FCN-AlexNet has a better balance between precision and recall. In Figure 15d, FCN-AlexNet produces more omission errors observed over patches' boundaries, which shows the washed-out effects by downsampling operations confronted by FCN [32]. For SegNet results in Figure 15e, a large percentage of commission (13.43%) and omission errors (41.41%) to the ground truth are detected, which reveals the sensitivity of SegNet may introduce more noise and confusion when the target classification objects are more homogeneous.     Figure 16 demonstrates the identification results for the 230-ha area covering most of the lodging paddies in the township. At a glance, both FCN-AlexNet and SegNet produce reasonable classification results in the 2017 dataset. However, the 2019 dataset shows different chromaticity due to the weather condition at that time of image acquisition [35], which may contribute to the lower identification performance. For instance, the highway in the middle part of the image is not fully captured in the 2019 dataset identification results (Figure 16f,g). Comparing the results of FCN-AlexNet and SegNet on two datasets, the discrepancy between the two networks is found smaller in the 2017 dataset. A large inconsistency of the rice lodging area is observed between two networks in the 2019 dataset. Nevertheless, the results obtained in the case of larger areas are very promising by taking into account the broader spatial coverage and the high computation efficiency.

Conclusions
To data, the rice lodging assessment still heavily relies on manual objective evaluation, which is time-consuming, labor-intensive, and problematic in terms of its poor efficiency and objectivity. The proposed rice lodging identification method aims to provide an effective and efficient scientific reference to assess rice lodging. In particular, two deep learning based semantic segmentation networks, FCN-AlexNet and SegNet, are implemented with vegetation indices for rice lodging identification in multi-date UAV visible images. As the testing dataset results show, FCN-AlexNet outperforms SegNet and MLC and reaches the highest F1-score of 83.56% and accuracy of 94.43%. The higher F1-score indicates that FCN-AlexNet has a better balance between precision and recall. The additional vegetation index information leverages the accuracy performance for both networks in terms of improved F1-scores and accuracy. Moreover, implementing FCN-AlexNet and SegNet can reduce the computation time by 10-15 times compared to using the traditional MLC. Furthermore, these two networks work well on the 230-ha image, which provides a great potential to broader area applications with promising rice lodging identification ability.
The proposed method also has a potential improvement space by providing more training data produced by implementing data argumentation process or employing other alternative network structures, such as E-Net or FC-DenseNet [36,37]. Meanwhile, to deal with a board area (up to hundred thousand ha) of an agricultural disaster survey in a temporal and spatial efficiency with economic benefit, parallel computation should be employed for deep-learning model execution in the future. On the other hand, edge computing techniques with a hierarchical image processing in UAV-equipped microcomputers can be applied to the deep-learning model to provide real-time agricultural disaster survey.