A Deep Learning-Based Approach for Automated Yellow Rust Disease Detection from High-Resolution Hyperspectral UAV Images

: Yellow rust in winter wheat is a widespread and serious fungal disease, resulting in signiﬁcant yield losses globally. E ﬀ ective monitoring and accurate detection of yellow rust are crucial to ensure stable and reliable wheat production and food security. The existing standard methods often rely on manual inspection of disease symptoms in a small crop area by agronomists or trained surveyors. This is costly, time consuming and prone to error due to the subjectivity of surveyors. Recent advances in unmanned aerial vehicles (UAVs) mounted with hyperspectral image sensors have the potential to address these issues with low cost and high e ﬃ ciency. This work proposed a new deep convolutional neural network (DCNN) based approach for automated crop disease detection using very high spatial resolution hyperspectral images captured with UAVs. The proposed model introduced multiple Inception-Resnet layers for feature extraction and was optimized to establish the most suitable depth and width of the network. Beneﬁting from the ability of convolution layers to handle three-dimensional data, the model used both spatial and spectral information for yellow rust detection. The model was calibrated with hyperspectral imagery collected by UAVs in ﬁve di ﬀ erent dates across a whole crop cycle over a well-controlled ﬁeld experiment with healthy and rust infected wheat plots. Its performance was compared across sampling dates and with random forest, a representative of traditional classiﬁcation methods in which only spectral information was used. It was found that the method has high performance across all the growing cycle, particularly at late stages of the disease spread. The overall accuracy of the proposed model (0.85) was higher than that of the random forest classiﬁer (0.77). These results showed that combining both spectral and spatial information is a suitable approach to improving the accuracy of crop disease detection with high resolution UAV hyperspectral images.


Introduction
Yellow rust, caused by Puccinia striiformis f. sp. Tritici (Pst), is a devastating foliar disease of wheat occurring in temperate climates across major wheat growing regions worldwide [1,2]. It is one of the Hyperspectral sensors are usually mounted on hand-held devices that can be used to obtain the spectrum at the leaf/canopy scale. With the development of technologies in unmanned aerial vehicles (UAVs) and hyperspectral sensors [22,[31][32][33], hyperspectral sensors can now be mounted on UAVs, which allows monitor to the crop on a large scale at a certain height above wheat fields. Compared to hand-held or ground-based devices, the hyperspectral sensor on UAVs can acquire both spatially and spectrally continuous data represented with three dimensions by adding spatial information. Spatial information has been proven to be a very important feature on object recognition with remote sensing imagery [34,35]. Focusing on hyperspectral data classification for different applications, several studies have shown significant improvement to the performance of the classification algorithms using both spectral and spatial information. Among them, deep convolutional neural network (DCNN)-based approaches using convolution layers to deal with joint spatial-spectral information achieved high performance [36]. However, existing studies based on deep learning approaches usually worked on a low spatial resolution image with a small region of neighboring pixels (3 × 3, 5 × 5 or 7 × 7) as model input [36][37][38]. Such small neighbor regions may not be wide enough to describe the context and texture features of the object in high spatial resolution images captured by UAVs, whose resolutions vary from 0.01 m to 0.1 m depending on the flight altitude. Moreover, high spatial resolution may lead to the increase of intraclass variation and the decrease of interclass variation, causing great difficulty in the pixel classification [39]. Therefore, we expect that a DCNN-based deep learning approach with a suitable larger region of neighbouring pixels as input can be a major improvement for the classification of high spectral and spatial resolution imagery.
In this paper, we proposed a new DCNN-based deep learning method for automated detection of yellow rust from hyperspectral images with high spatial resolution. The new DCNN structured approach handled the joint spatial-spectral information extracted from high-resolution hyperspectral images and introduced multiple Inception-Resnet layers for deep feature extraction. We tested the proposed model against a comprehensive dataset acquired from winter wheat fields under a controlled field experiment across a whole crop cycle. Finally, the performance of the DCNN model was compared with a random forest-based classifier, a representative of traditional spectral-based classification methods. The remaining part of this paper is organized as follows: Section 2 describes study area, data description and methods; Section 3 presents results; Section 4 provides discussions and; finally, Section 5 summarizes this work and highlights future works.

Study Area
Four controlled wheat plots at the Scientist Research and Experiment Station of China Academy of Agricultural Science in Langfang, Hebei Province, China (39 • 30'40"N, 116 • 36'20"E) were selected as the study area ( Figure 1). Each of the four plots occupied about 220 m 2 , two of them were infected with yellow rust wheat and the other two remained uninfected as healthy wheat. The average temperature during the wheat growing period was between 5 • C to 24 • C corresponding to the suitable occurrence environment of yellow rust [7]. A DJI S1000 UAV system (SZ DJI Technology Co Ltd., Gungdong, China) [40] with a snapshot hyperspectral sensor was used for data acquisition. The model of the hyperspectral sensor was a UHD 185 Firefly (Cubert GmbH, Ulm, Baden-Württemberg, Germany) which can obtain reflected radiation from visible to near-infrared bands between 450 and 950 nm. The spectral resolution was 4 nm. Raw data were recorded as a 1000 × 1000 px panchromatic image and a 50 × 50 px hyperspectral image with 125 bands. With a data fusion processed with Cubert software [41], the output was a 1000 × 1000 px image with 125 bands, and the image was also mosaicked and orthorectified. In this work, all the images were obtained at a flight height of 30 m, with a spatial resolution close to 2 cm per pixel. The data sizes covering all the four plots were around 16,279 × 14,762 px with 125 bands. Hyperspectral images were labelled for each pixel based on their corresponding plots and normalized difference vegetation index (NDVI) [29]. The NDVI which is calculated from the reflectance of the planet in near-infrared and red bands (Equation (1)), is a standardized way to assess whether a pixel observed is vegetation or not. In general, the value of NDVI ranging from 0.3-1.0 was considered as vegetation, otherwise it was considered bare soil or water. In this case, a pixel in rust or healthy plots with an NDVI value greater than 0.3 was labelled as rust or healthy, otherwise was labelled as other: A DJI S1000 UAV system (SZ DJI Technology Co Ltd., Gungdong, China) [40] with a snapshot hyperspectral sensor was used for data acquisition. The model of the hyperspectral sensor was a UHD 185 Firefly (Cubert GmbH, Ulm, Baden-Württemberg, Germany) which can obtain reflected radiation from visible to near-infrared bands between 450 and 950 nm. The spectral resolution was 4 nm. Raw data were recorded as a 1000 × 1000 px panchromatic image and a 50 × 50 px hyperspectral image with 125 bands. With a data fusion processed with Cubert software [41], the output was a 1000 × 1000 px image with 125 bands, and the image was also mosaicked and orthorectified. In this work, all the images were obtained at a flight height of 30 m, with a spatial resolution close to 2 cm per pixel. The data sizes covering all the four plots were around 16,279 × 14,762 px with 125 bands. Hyperspectral images were labelled for each pixel based on their corresponding plots and normalized difference vegetation index (NDVI) [29]. The NDVI which is calculated from the reflectance of the planet in near-infrared and red bands (Equation (1)), is a standardized way to assess whether a pixel observed is vegetation or not. In general, the value of NDVI ranging from 0.3-1.0 was considered as vegetation, otherwise it was considered bare soil or water. In this case, a pixel in rust or healthy plots with an NDVI value greater than 0.3 was labelled as rust or healthy, otherwise was labelled as other:

Methods
The aim of this work was to detect rust areas based on joint spectral and spatial information. This is a typical classification task, i.e., classifying a 3D hyperspectral block into one of three classes: rust, healthy or others. In this study, we proposed a DCNN-based approach for this aim, in which a new DCNN architecture was constructed and detailed in Section 2.2.2. As shown in Figure 2, it includes four major steps: (1) data preprocessing where 3D data blocks is extracted from original data with a sliding window method; (2) feature extraction and classification where the segmented 3D data blocks from the first step are fed to the proposed DCNN model; (3) post processing where a rust disease map is generated based on the mapping and aggregation of each predicted image block; and (4) result output and visualization. The aim of this work was to detect rust areas based on joint spectral and spatial information. This is a typical classification task, i.e., classifying a 3D hyperspectral block into one of three classes: rust, healthy or others. In this study, we proposed a DCNN-based approach for this aim, in which a new DCNN architecture was constructed and detailed in Section 2.2.2. As shown in Figure 2, it includes four major steps: (1) data preprocessing where 3D data blocks is extracted from original data with a sliding window method; (2) feature extraction and classification where the segmented 3D data blocks from the first step are fed to the proposed DCNN model; (3) post processing where a rust disease map is generated based on the mapping and aggregation of each predicted image block; and (4) result output and visualization.

Data Preprocessing
The sliding-window method [42] was used to extract spatial and spectral information from hyperspectral imagery. The sliding-window method is an exhaustive search image segmentation algorithm by moving a window with a fixed size at a fixed interval across an image. It was first used in object detection [43], and later used to extract spatial and spectral information for remote sensing classification [36]. With the sliding-window segmentation, 3D data blocks from original hyperspectral imagery were extracted and then fed into the proposed DCNN model. Normally, the input sizes of a DCNN classification model varied from 224 × 224 px to 299 × 299 px due to GPU RAM limitations [44,45]. In this work, Hyperspectral imagery had 125 bands, so we had to adjust the sizes of these blocks to adapt to GPUs. We chose 64 × 64 × 125 as the input size of the DCNN model. To train the DCNN model, these blocks were labelled with one of three classes based on the plots they belong to: (i) Rust area class, (ii) healthy area class, or (iii) other class (including bare soil and road labelled by the average vegetation index of each block) (see Figure 3).

Data Preprocessing
The sliding-window method [42] was used to extract spatial and spectral information from hyperspectral imagery. The sliding-window method is an exhaustive search image segmentation algorithm by moving a window with a fixed size at a fixed interval across an image. It was first used in object detection [43], and later used to extract spatial and spectral information for remote sensing classification [36]. With the sliding-window segmentation, 3D data blocks from original hyperspectral imagery were extracted and then fed into the proposed DCNN model. Normally, the input sizes of a DCNN classification model varied from 224 × 224 px to 299 × 299 px due to GPU RAM limitations [44,45]. In this work, Hyperspectral imagery had 125 bands, so we had to adjust the sizes of these blocks to adapt to GPUs. We chose 64 × 64 × 125 as the input size of the DCNN model. To train the DCNN model, these blocks were labelled with one of three classes based on the plots they belong to: (i) Rust area class, (ii) healthy area class, or (iii) other class (including bare soil and road labelled by the average vegetation index of each block) (see Figure 3). The aim of this work was to detect rust areas based on joint spectral and spatial information. This is a typical classification task, i.e., classifying a 3D hyperspectral block into one of three classes: rust, healthy or others. In this study, we proposed a DCNN-based approach for this aim, in which a new DCNN architecture was constructed and detailed in Section 2.2.2. As shown in Figure 2, it includes four major steps: (1) data preprocessing where 3D data blocks is extracted from original data with a sliding window method; (2) feature extraction and classification where the segmented 3D data blocks from the first step are fed to the proposed DCNN model; (3) post processing where a rust disease map is generated based on the mapping and aggregation of each predicted image block; and (4) result output and visualization.

Data Preprocessing
The sliding-window method [42] was used to extract spatial and spectral information from hyperspectral imagery. The sliding-window method is an exhaustive search image segmentation algorithm by moving a window with a fixed size at a fixed interval across an image. It was first used in object detection [43], and later used to extract spatial and spectral information for remote sensing classification [36]. With the sliding-window segmentation, 3D data blocks from original hyperspectral imagery were extracted and then fed into the proposed DCNN model. Normally, the input sizes of a DCNN classification model varied from 224 × 224 px to 299 × 299 px due to GPU RAM limitations [44,45]. In this work, Hyperspectral imagery had 125 bands, so we had to adjust the sizes of these blocks to adapt to GPUs. We chose 64 × 64 × 125 as the input size of the DCNN model. To train the DCNN model, these blocks were labelled with one of three classes based on the plots they belong to: (i) Rust area class, (ii) healthy area class, or (iii) other class (including bare soil and road labelled by the average vegetation index of each block) (see Figure 3).

Feature Extraction and Classification
Feature extraction and classification were performed on the 3D blocks extracted in the Data preprocessing step with a new DCNN architecture. Figure 4 shows the architecture of the proposed DCNN model, it includes multiple Inception-Resnet blocks combining and optimising two well-known architectures: Inception [46] and Resnet [44] for deep feature extraction. The number of Inception-Resnet blocks is used to control the depth of the model. After deep feature extraction with Inception-Resnet blocks, an average pooling layer and a fully connected layer are used to transform the feature maps into a three-class classifier: rust, healthy and other.

Feature Extraction and Classification
Feature extraction and classification were performed on the 3D blocks extracted in the Data preprocessing step with a new DCNN architecture. Figure 4 shows the architecture of the proposed DCNN model, it includes multiple Inception-Resnet blocks combining and optimising two wellknown architectures: Inception [46] and Resnet [44] for deep feature extraction. The number of Inception-Resnet blocks is used to control the depth of the model. After deep feature extraction with Inception-Resnet blocks, an average pooling layer and a fully connected layer are used to transform the feature maps into a three-class classifier: rust, healthy and other. The model design rationale for combining these two architectures included: (1) The Resnet block was designed to build a deep model as thin as possible in favour of increasing its depth and having fewer parameters for performance enhancement. Existing works [44] have shown that residual learning can ease the problem of vanishing/exploding gradients when a network goes deeper. (1) Since the width and kernel size of a filter also influenced the performance of a DCNN model, an Inception structure with multiple kernel sizes [46] was selected to address this issue.
The detailed architecture of an Inception-Resnet block is shown in Figure 5d. It takes advantages of convolution layer (Conv) (see Figure 5a), Resnet (see Figure 5b) and Inception (see Figure 5c). The Conv used here is a basic convolution layer [47] and its structure is shown in Figure 5a. This Conv layer structure begins with a 2D convolutional (Conv 2d) layer, and followed by a rectified linear unit (ReLU) layer [48] and a 2D batch normalization (BatchNorm2d) layer [49]. Using multiple Conv layers had been proved quite successful in improving the performance of the classification model [50]. However, with the increase in the number of layers, the number of parameters to be learned will also increase dramatically, which may lead to exploding gradients in the training stage. The Resnet block [46] was designed to ease the exploding gradient problem. As shown in Figure 5b, a basic Resnet block adds a 1 × 1 convolution layer before and after a 3 × 3 convolution layer to reduce the number of connections (parameters) without degrading the performance of a network too much. Furthermore, a shortcut connection is added to link the input with the output, thus the Resnet learns the residual of input. Inception [51] was designed to improve the utilization of computing resources inside a network and increase both the depth and width without getting into computational difficulties. As shown in Figure 5c, an Inception block performs convolution with three different sizes of filters (1 × 1, 3 × 3, 5 × 5) to increase the network width. To decrease training parameters so that more layers can be trained into one model, an extra 1 × 1 convolution is added for reducing the dimension of input before 3 × 3 and 5 × 5 convolutions. The model design rationale for combining these two architectures included: (1) The Resnet block was designed to build a deep model as thin as possible in favour of increasing its depth and having fewer parameters for performance enhancement. Existing works [44] have shown that residual learning can ease the problem of vanishing/exploding gradients when a network goes deeper. (2) Since the width and kernel size of a filter also influenced the performance of a DCNN model, an Inception structure with multiple kernel sizes [46] was selected to address this issue.
The detailed architecture of an Inception-Resnet block is shown in Figure 5d. It takes advantages of convolution layer (Conv) (see Figure 5a), Resnet (see Figure 5b) and Inception (see Figure 5c). The Conv used here is a basic convolution layer [47] and its structure is shown in Figure 5a. This Conv layer structure begins with a 2D convolutional (Conv 2d) layer, and followed by a rectified linear unit (ReLU) layer [48] and a 2D batch normalization (BatchNorm2d) layer [49]. Using multiple Conv layers had been proved quite successful in improving the performance of the classification model [50]. However, with the increase in the number of layers, the number of parameters to be learned will also increase dramatically, which may lead to exploding gradients in the training stage. The Resnet block [46] was designed to ease the exploding gradient problem. As shown in Figure 5b, a basic Resnet block adds a 1 × 1 convolution layer before and after a 3 × 3 convolution layer to reduce the number of connections (parameters) without degrading the performance of a network too much. Furthermore, a shortcut connection is added to link the input with the output, thus the Resnet learns the residual of input. Inception [51] was designed to improve the utilization of computing resources inside a network and increase both the depth and width without getting into computational difficulties. As shown in Figure 5c, an Inception block performs convolution with three different sizes of filters (1 × 1, 3 × 3, 5 × 5) to increase the network width. To decrease training parameters so that more layers can be trained into one model, an extra 1 × 1 convolution is added for reducing the dimension of input before 3 × 3 and 5 × 5 convolutions. Inception-Resnet block [45] was designed to take advantages of both Resnet and Inception blocks. As illustrated in Figure 5d, this block merges an Inception unit at the top and a shortcut connection in a Resnet block by concatenation. A 3 × 3 convolution layer in the Resnet block is replaced by a 3 × 3 and a 5 × 5 convolution layers in the Inception Block. A 1 × 1 convolution layer is added immediately after multiscale convolution layers, which is used to control the number of trained parameters and output channels.

Post Processing and Visualization
After training the proposed DCNN model with 3D hyperspectral image blocks extracted in the data preprocessing step, the trained model was used for yellow rust detection with full-hyperspectral images. Each image was divided into blocks with a size of 64 × 64 by using the sliding window method. Then the blocks were identified by the trained model and the predicted rust infected blocks were mapped based on their locations in the original data for visualization.

Experimental Design
To evaluate the proposed approach, a series of experiments were conducted, focusing on the following three aspects: (1) The DCNN model sensitivity to the depth and width of the DCNN network; (2) A comparison between a representative of traditional spectral-based machine learning classification methods and the proposed DCNN method based on joint spatial-spectral information (3) The accuracy of the model for yellow rust detection in different observation periods across the whole growing season.
To investigate the effect of the depth and width of the network on the classification accuracy, we firstly changed the number of Inception-Resnet blocks in the proposed model to control the depth of the model. Then, we compared a model with multiple Resnet blocks and a model with multiple Inception-Resnet blocks for evaluating the effect of the network width. The configurations of convolution layers for Resnet block and Inception-Resnet blocks have been presented in Figure 5b,d, respectively. An Inception-Resnet block is wider than a Resnet block in terms of feature spaces extracted with multiscale convolution kernels. For each configuration, we trained ten times and used the model showing the best accuracy.
To investigate the effect of joint spatial-spectral information on yellow rust detection, we compared a representative of traditional machine learning classification methods only considering spectral information in datasets and the proposed DCNN based model considering both spatial and Inception-Resnet block [45] was designed to take advantages of both Resnet and Inception blocks. As illustrated in Figure 5d, this block merges an Inception unit at the top and a shortcut connection in a Resnet block by concatenation. A 3 × 3 convolution layer in the Resnet block is replaced by a 3 × 3 and a 5 × 5 convolution layers in the Inception Block. A 1 × 1 convolution layer is added immediately after multiscale convolution layers, which is used to control the number of trained parameters and output channels.

Post Processing and Visualization
After training the proposed DCNN model with 3D hyperspectral image blocks extracted in the data preprocessing step, the trained model was used for yellow rust detection with full-hyperspectral images. Each image was divided into blocks with a size of 64 × 64 by using the sliding window method. Then the blocks were identified by the trained model and the predicted rust infected blocks were mapped based on their locations in the original data for visualization.

Experimental Design
To evaluate the proposed approach, a series of experiments were conducted, focusing on the following three aspects: (1) The DCNN model sensitivity to the depth and width of the DCNN network; (2) A comparison between a representative of traditional spectral-based machine learning classification methods and the proposed DCNN method based on joint spatial-spectral information (3) The accuracy of the model for yellow rust detection in different observation periods across the whole growing season.
To investigate the effect of the depth and width of the network on the classification accuracy, we firstly changed the number of Inception-Resnet blocks in the proposed model to control the depth of the model. Then, we compared a model with multiple Resnet blocks and a model with multiple Inception-Resnet blocks for evaluating the effect of the network width. The configurations of convolution layers for Resnet block and Inception-Resnet blocks have been presented in Figure 5b,d, respectively. An Inception-Resnet block is wider than a Resnet block in terms of feature spaces extracted with multiscale convolution kernels. For each configuration, we trained ten times and used the model showing the best accuracy.
To investigate the effect of joint spatial-spectral information on yellow rust detection, we compared a representative of traditional machine learning classification methods only considering spectral information in datasets and the proposed DCNN based model considering both spatial and spectral information. Here, one of the most popular traditional machine learning method, random forest [52], was chosen as a representative. In this work, the random forest model used the central pixel value of each block and took a 125-dimensional data as input, while the proposed DCCN model used the values of each block with a size of 64 × 64 × 125 as input. After training, the performance of both models on yellow rust detection was evaluated on test datasets.
Timeliness and accuracy are two most important indicators for crop disease monitoring. Detecting the disease in early stages can effectively allow farmers to be prepared to reduce losses. Therefore, we also tested the performance of the proposed model on yellow rust detection in different observation periods during the whole growing season.

Training Network
In this work, we extracted a total of 15,000 blocks with a size of 64 × 64 × 125 from five hyperspectral images covering the whole growing season of winter wheat through the sliding window method. A total of 10,000 of these blocks were randomly chosen for training and validation (80% for training and the rest for validation), and the remaining 5000 blocks were used as test data for evaluating the performance of the proposed network. To prevent overfitting due to a limited supply of data and improve the model's generalization, data augmentations through small random transformations with rotate, flip and mirror, were used on blocks for each epoch. Adam [53], a stochastic optimization algorithm, with a batch size of 64 samples, was used for optimization to train the proposed network. We initially set a base learning rate as 1 × 10 −3 . The base learning rate was decreased to 1 × 10 −6 with increased iterations. CrossEntropy was selected as the loss function for this task which was commonly used for multi-class classification by combining LogSoftmax and negative log likelihood loss (NLLLoss) [54]. All the experiments were implemented based on pytorch 1.0 (Paszke et al., 2017) and executed on a PC with an Intel(R) Xeon(R) CPU E5-2650, NVIDIA TITAN × (Pascal) and 64 GB memory.

Performance Metrics
To evaluate the classification performance of the proposed architecture, overall accuracy, recall, precision and F1 scores were selected as the accuracy performance metrics. The overall accuracy is the ratio of the total number of correctly classified samples to the total number of samples of all classes. In this study, the samples are blocks extracted from hyperspectral images. Recall, precision and F1 scores can be calculated from the true positives (TP), the true negatives (TN), the false positives (FP) and the false negatives (FN). The metrics were calculated as follows:

Results
As described in Section 2.3.2, we randomly selected 10,000 blocks with a size of 64 × 64 × 125 extracted from five hyperspectral images as training datasets (80% for training and 20% for validation) for model training, and the remaining 5000 blocks as test datasets for evaluating the performance of models.  Figure 6 shows a comparison of accuracy with different configurations of the number of Inception-Resnet blocks in the proposed model. It can be observed that there is no further obvious improvement in accuracy after the number of Inception-Resnet blocks reaches 4. Therefore, four Inception-Resnet blocks were chosen in our proposed model.    Figure 8 provides the results of accuracy and confusion matrix for two models. One is a representative of spectral-based traditional machine learning classification method, random forest,      Figure 8 provides the results of accuracy and confusion matrix for two models. One is a representative of spectral-based traditional machine learning classification method, random forest,  Figure 8 provides the results of accuracy and confusion matrix for two models. One is a representative of spectral-based traditional machine learning classification method, random forest, and the other is our model. The random forest model achieves an accuracy of 0.77 while the proposed model achieves an accuracy of 0.85. The performance of the proposed model considering joint spatial-spectral information is better than the random forest model only considering spectral information. and the other is our model. The random forest model achieves an accuracy of 0.77 while the proposed model achieves an accuracy of 0.85. The performance of the proposed model considering joint spatialspectral information is better than the random forest model only considering spectral information.   Table 1 lists the classification results across different periods. All metrics for class "other" are much higher (>0.95) than class "Rust" and class "Healthy". Over 85% rust area were detected from the datasets collected on 15 May 2018 and 18 May 2018 (the recall rates of rust class reach 0.86 and 0.85, respectively).    Table 1 lists the classification results across different periods. All metrics for class "other" are much higher (>0.95) than class "Rust" and class "Healthy". Over 85% rust area were detected from the datasets collected on 15 May 2018 and 18 May 2018 (the recall rates of rust class reach 0.86 and 0.85, respectively). and the other is our model. The random forest model achieves an accuracy of 0.77 while the proposed model achieves an accuracy of 0.85. The performance of the proposed model considering joint spatialspectral information is better than the random forest model only considering spectral information.   Table 1 lists the classification results across different periods. All metrics for class "other" are much higher (>0.95) than class "Rust" and class "Healthy". Over 85% rust area were detected from the datasets collected on 15 May 2018 and 18 May 2018 (the recall rates of rust class reach 0.86 and 0.85, respectively).

Discussion
In this paper, we proposed a new DCNN based approach for automated yellow dust detection, which could exploit both spatial and spectral information of very high-resolution hyperspectral images captured with UAVs. Since the depth, width and filter size of a DCNN-based network [44,50,55,56] could affect its performance, we introduced multiple Inception-Resnet layers to consider all three factors in the proposed neural network architecture. To ensure the accuracy and the computing efficiency, the effects of the depth, width and filter size on network performance were investigated based on a series of experiments. The results showed that there was no further obvious improvement in accuracy after the model depth (i.e., the number of Inception-Resnet layers) reached 4. We also found after Inception-Resnet layers were replaced with Resnet layers in our model, that is, reducing the width and the variety in filter size, the model performance was reduced. This demonstrated that increasing the network width and using multi-scale filters could improve the classification performance on high-resolution hyperspectral imagery, which was consistent with previous studies [45,46,51,57,58].
Previous studies have shown significant improvement in performance by using joint spatial-spectral information for plant disease detection [35][36][37][38]59]. To investigate how the yellow rust detection could benefit from using joint spatial-spectral information of high-resolution UAV hyperspectral imagery, we compared our model with random forest, a representative of traditional machine learning methods considering only spectral information. An accuracy of 0.85 was achieved for our model versus 0.77 for the random forest classifier. To understand why using joint spatial-spectral information was better than only using spectral information, we analysed the spectral profiles of hyperspectral images. Figure 10 illustrates the spectrums of 10,000 pixels randomly chosen from rust, healthy and other areas of hyperspectral images, respectively. We can observe that the spectral profiles of hyperspectral data captured with UAVs are highly variable. Therefore, it would be difficult to identity rust and healthy fields only from spectral information. However, a high-resolution hyperspectral image contains crucial spatial information, it was a very important feature for object recognition in remote sensing images [34,35]. To visually display the benefit of joint spatial-spectral information for rust detection, we also compared the mapping results after detection from our DCNN model and the random-forest. Figure  11 shows the rust detection mapping results of two plots from the two models. The detection results of the rust infected areas are overlaid on the original images. The two images were captured on the 18 May 2018, one from a wheat plot with rust disease (see the image at the first row of Figure 11a) and the other from a healthy wheat plot (see the image at the second row of Figure 11a). The accuracy of rust detection on the rust plot was 0.85 for our DCNN model and 0.77 for the random-forest classifier, the mapping results of the two models were similar (see the image at the first row of both Figure 11b,c). The accuracy of rust detection on the healthy wheat plot was 0.86 for our DCNN model and 0.71 for the random-forest classifier. A total of 29% of areas in the image of the healthy wheat plot (see the image at the second row of Figure 11b) are misclassified as rust-infected areas by the random-forest classifier due to higher variances in spectrum in healthy wheat regions (see Figure 10). However, the misclassification of our DCNN model (see the image at the second row of Figure 11c) is much less than that of the random-forest classifier (see the image at the second row of Figure 11b). Overall, benefiting from the joint spatial-spectral information, our DCNN model performed better on yellow rust detection than the random-forest classifier. This further confirmed that using the joint spatial-spectral information could potentially improve the accuracy of yellow rust detection from high-resolution hyperspectral images [37,38,59].  To visually display the benefit of joint spatial-spectral information for rust detection, we also compared the mapping results after detection from our DCNN model and the random-forest. Figure 11 shows the rust detection mapping results of two plots from the two models. The detection results of the rust infected areas are overlaid on the original images. The two images were captured on the 18 May 2018, one from a wheat plot with rust disease (see the image at the first row of Figure 11a) and the other from a healthy wheat plot (see the image at the second row of Figure 11a). The accuracy of rust detection on the rust plot was 0.85 for our DCNN model and 0.77 for the random-forest classifier, the mapping results of the two models were similar (see the image at the first row of both Figure 11b,c). The accuracy of rust detection on the healthy wheat plot was 0.86 for our DCNN model and 0.71 for the random-forest classifier. A total of 29% of areas in the image of the healthy wheat plot (see the image at the second row of Figure 11b) are misclassified as rust-infected areas by the random-forest classifier due to higher variances in spectrum in healthy wheat regions (see Figure 10). However, the misclassification of our DCNN model (see the image at the second row of Figure 11c) is much less than that of the random-forest classifier (see the image at the second row of Figure 11b). Overall, benefiting from the joint spatial-spectral information, our DCNN model performed better on yellow rust detection than the random-forest classifier. This further confirmed that using the joint spatial-spectral information could potentially improve the accuracy of yellow rust detection from high-resolution hyperspectral images [37,38,59]. To visually display the benefit of joint spatial-spectral information for rust detection, we also compared the mapping results after detection from our DCNN model and the random-forest. Figure  11 shows the rust detection mapping results of two plots from the two models. The detection results of the rust infected areas are overlaid on the original images. The two images were captured on the 18 May 2018, one from a wheat plot with rust disease (see the image at the first row of Figure 11a) and the other from a healthy wheat plot (see the image at the second row of Figure 11a). The accuracy of rust detection on the rust plot was 0.85 for our DCNN model and 0.77 for the random-forest classifier, the mapping results of the two models were similar (see the image at the first row of both Figure 11b,c). The accuracy of rust detection on the healthy wheat plot was 0.86 for our DCNN model and 0.71 for the random-forest classifier. A total of 29% of areas in the image of the healthy wheat plot (see the image at the second row of Figure 11b) are misclassified as rust-infected areas by the random-forest classifier due to higher variances in spectrum in healthy wheat regions (see Figure 10). However, the misclassification of our DCNN model (see the image at the second row of Figure 11c) is much less than that of the random-forest classifier (see the image at the second row of Figure 11b). Overall, benefiting from the joint spatial-spectral information, our DCNN model performed better on yellow rust detection than the random-forest classifier. This further confirmed that using the joint spatial-spectral information could potentially improve the accuracy of yellow rust detection from high-resolution hyperspectral images [37,38,59].  Previous studies [60,61] showed that the detection accuracy of yellow rust at a leaf scale could reach around 0.88. In general, at a field scale, not all the leaves in infected fields had yellow rust, hence the accuracy of labelling pixels representing healthy leaves in infected fields was limited. This may partially explain why the accuracy at the field scale from our model (0.85) was slightly lower than the accuracy at the leaf scale reported before.

Conclusions
In this work, we have proposed a deep convolutional neural network (DCNN)-based approach for automated detection of yellow rust in winter wheat fields from UAV hyperspectral images. We have designed a new DCNN model by introducing multiple Inception-Resnet layers for deep feature extraction, and the model was optimized to establish the most suitable depth and width. Benefiting from the ability of convolution layers to handle three-dimensional data, the model could use both spatial and spectral information for yellow rust detection. The model has been validated with real ground truth data and compared with random forest, a representative of the traditional spectral-based machine learning classification method. The experimental results have demonstrated that combining both spectral and spatial information could significantly improve the accuracy of yellow dust detection on very high spatial resolution hyperspectral images across the whole growing stages of winter wheat. This study further confirmed that the proposed deep learning architecture has potential for crop disease detection. The future work will be to validate the proposed model on more UAV hyperspectral image datasets with various crop fields and different types of crop diseases. In addition, new dimensionality reduction algorithms on large hyperspectral images will also be further developed for efficient data analysis.