Wheat Lodging Detection from UAS Imagery Using Machine Learning Algorithms

The current mainstream approach of using manual measurements and visual inspections for crop lodging detection is inefficient, time-consuming, and subjective. An innovative method for wheat lodging detection that can overcome or alleviate these shortcomings would be welcomed. This study proposed a systematic approach for wheat lodging detection in research plots (372 experimental plots), which consisted of using unmanned aerial systems (UAS) for aerial imagery acquisition, manual field evaluation, and machine learning algorithms to detect the occurrence or not of lodging. UAS imagery was collected on three different dates (23 and 30 July 2019, and 8 August 2019) after lodging occurred. Traditional machine learning and deep learning were evaluated and compared in this study in terms of classification accuracy and standard deviation. For traditional machine learning, five types of features (i.e. gray level co-occurrence matrix, local binary pattern, Gabor, intensity, and Hu-moment) were extracted and fed into three traditional machine learning algorithms (i.e., random forest (RF), neural network, and support vector machine) for detecting lodged plots. For the datasets on each imagery collection date, the accuracies of the three algorithms were not significantly different from each other. For any of the three algorithms, accuracies on the first and last date datasets had the lowest and highest values, respectively. Incorporating standard deviation as a measurement of performance robustness, RF was determined as the most satisfactory. Regarding deep learning, three different convolutional neural networks (simple convolutional neural network, VGG-16, and GoogLeNet) were tested. For any of the single date datasets, GoogLeNet consistently had superior performance over the other two methods. Further comparisons between RF and GoogLeNet demonstrated that the detection accuracies of the two methods were not significantly different from each other (p > 0.05); hence, the choice of any of the two would not affect the final detection accuracies. However, considering the fact that the average accuracy of GoogLeNet (93%) was larger than RF (91%), it was recommended to use GoogLeNet for wheat lodging detection. This research demonstrated that UAS RGB imagery, coupled with the GoogLeNet machine learning algorithm, can be a novel, reliable, objective, simple, low-cost, and effective (accuracy > 90%) tool for wheat lodging detection.


Introduction
Ranked as one of the top three staple food crops worldwide, wheat is a major source of starch and energy, as well as of essential and beneficial components to health, such as B vitamins, dietary fiber, and

Test Fields
The experimental field used in this study was located near Thompson, ND (UTM WGS 84 14 N), US, which consisted of 372 plots, as shown in Figure 2. The plots belonged to different research trials and their dimensions were as follows: 1.5 m × 3.6 m (51 rows × 4 columns = 204 plots, ID1), 1.5 m × 5.4 m (10 rows × 12 columns = 120 plots, ID2), and 1.5 m × 14.6 m (12 rows × 4 columns = 48 plots, ID3). The field was planted on 15 May 2019, at a rate of ~100 kg/ha, with a row spacing of 0.19 m. Immediately after planting, eight ground control points (GCP) were installed in the field, as shown in Figure 2.

Test Fields
The experimental field used in this study was located near Thompson, ND (UTM WGS 84 14 N), US, which consisted of 372 plots, as shown in Figure 2. The plots belonged to different research trials and their dimensions were as follows: 1.5 m × 3.6 m (51 rows × 4 columns = 204 plots, ID1), 1.5 m × 5.4 m (10 rows × 12 columns = 120 plots, ID2), and 1.5 m × 14.6 m (12 rows × 4 columns = 48 plots, ID3). The field was planted on 15 May 2019, at a rate of~100 kg/ha, with a row spacing of 0.19 m. Immediately after planting, eight ground control points (GCP) were installed in the field, as shown in Figure 2.
Remote Sens. 2020, 12, x FOR PEER REVIEW 4 of 20 compared, with the most desirable one(s) recommended. All of the processes involved in the study are subsequently presented in detail in the appropriate sections.

Test Fields
The experimental field used in this study was located near Thompson, ND (UTM WGS 84 14 N), US, which consisted of 372 plots, as shown in Figure 2. The plots belonged to different research trials and their dimensions were as follows: 1.5 m × 3.6 m (51 rows × 4 columns = 204 plots, ID1), 1.5 m × 5.4 m (10 rows × 12 columns = 120 plots, ID2), and 1.5 m × 14.6 m (12 rows × 4 columns = 48 plots, ID3). The field was planted on 15 May 2019, at a rate of ~100 kg/ha, with a row spacing of 0.19 m. Immediately after planting, eight ground control points (GCP) were installed in the field, as shown in Figure 2.

Data Collection
Followed by seed germination (one week after sowing), crop growth conditions were monitored on weekly basis. Only after mid-July of 2019, were the first symptoms of wheat lodging noticed because of heavy rain and strong winds. The data collection was then started using a DJI Phantom 4D RTK UAS (DJI-Innovations, Inc., ShenZhen, China). The UAS is outfitted with a 20 megapixel (5472 × 3648 pixels) color camera, mounted on a three-axis stabilization gimbal. This study did not use a multispectral camera for data collection, due to its high cost, and our focus of developing an affordable technology for farmers. DJI's ground station app (DJI GS RTK, V2.1.1) was used to set up the flight mission map and parameters. Instead of 3D Photogrammetry, 2D Photogrammetry was applied for setting up the mission because the experimental plots were flat. The UAS was flown 25 m AGL (image resolution of~0.7 cm/pixel), and the speed was set to 2.5 m/s. The following settings were applied: shooting mode was set to "Timed Shooting"; photo ratio was 3:2; white balance mode was "sunny"; gimbal angle was -90 • (nadir position); both side and forward overlap were set at 80%; the margin setting was kept at "Auto" mode. Three flights were carried out on 23 July 2019, 30 July 2019, and 8 August 2019. The georeferenced images were stored in a SD card during flight, and they were transferred to a desktop computer back in the office for processing. After the UAS imagery collection, a group of inspectors visited the field and manually classified each plot as lodging or non-lodging based on visual observations, and the results were recorded, as shown in Figure 3. If the wheat stems were off their original vertical position in a permanent manner, they were judged as lodged crops.

Data Collection
Followed by seed germination (one week after sowing), crop growth conditions were monitored on weekly basis. Only after mid-July of 2019, were the first symptoms of wheat lodging noticed because of heavy rain and strong winds. The data collection was then started using a DJI Phantom 4D RTK UAS (DJI-Innovations, Inc., ShenZhen, China). The UAS is outfitted with a 20 megapixel (5472 × 3648 pixels) color camera, mounted on a three-axis stabilization gimbal. This study did not use a multispectral camera for data collection, due to its high cost, and our focus of developing an affordable technology for farmers. DJI's ground station app (DJI GS RTK, V2.1.1) was used to set up the flight mission map and parameters. Instead of 3D Photogrammetry, 2D Photogrammetry was applied for setting up the mission because the experimental plots were flat. The UAS was flown 25 m AGL (image resolution of ~0.7 cm/pixel), and the speed was set to 2.5 m/s. The following settings were applied: shooting mode was set to "Timed Shooting"; photo ratio was 3:2; white balance mode was "sunny"; gimbal angle was -90° (nadir position); both side and forward overlap were set at 80%; the margin setting was kept at "Auto" mode. Three flights were carried out on 23 July 2019, 30 July 2019, and 8 August 2019. The georeferenced images were stored in a SD card during flight, and they were transferred to a desktop computer back in the office for processing. After the UAS imagery collection, a group of inspectors visited the field and manually classified each plot as lodging or nonlodging based on visual observations, and the results were recorded, as shown in Figure 3. If the wheat stems were off their original vertical position in a permanent manner, they were judged as lodged crops.

Image Preprocessing
After each UAS flight, images were stitched together using PIX4DMapper (Pix4D V4.3.33, S.A., Prilly, Switzerland) to generate an orthomosaic map. Eight GCPs established in the field, shown in Figure 2, were used as references to overlap the orthomosaic maps of the three dates, using the first date's imagery as a reference for the other two dates. The plot images datasets (372 plot images) for each of the dates were created by manually cropping the individual plot image, and the recorded results (lodged or not-lodged plot) for each plot were organized to be associated with the individual image. All three data collections occurred in the morning time between 9:00-11:00 am in sunny weather conditions, and each flight mission lasted for about 12~15 min. The consistent illumination

Image Preprocessing
After each UAS flight, images were stitched together using PIX4DMapper (Pix4D V4.3.33, S.A., Prilly, Switzerland) to generate an orthomosaic map. Eight GCPs established in the field, shown in Figure 2, were used as references to overlap the orthomosaic maps of the three dates, using the first date's imagery as a reference for the other two dates. The plot images datasets (372 plot images) for each of the dates were created by manually cropping the individual plot image, and the recorded results (lodged or not-lodged plot) for each plot were organized to be associated with the individual image. All three data collections occurred in the morning time between 9:00-11:00 am in sunny weather conditions, and each flight mission lasted for about 12~15 min. The consistent illumination conditions (measured by MQ-200, Apogee Instruments, Logan, UT, USA) during the flight mission avoided the calibration among the datasets of different dates. Image classification using RF, NN, and SVM requires individual images to be represented by a number of discriminative features [33]. The domain knowledge suggested that textural features (the second-order statistics of an image domain) should be proper indicators for lodging and non-lodging plot classifications [20,21].
The right photo in Figure 3 shows two lodging and two non-lodging plots. A sample of lodging and non-lodging plot images illustrates easily observable differences in the color and textural characteristics of these two categories, as is visible in Figure 3. A variety of image characteristics could be used in machine learning algorithms for the accurate and efficient distinction of lodging or non-lodging. Color features may represent differences, such as leaves being darker green, and stems being lighter green [41]. Texture features could also be desirable indicators for lodging and non-lodging plot distinctions [21]. Compared to the lodging plots, shown in Figure 3 as having non-uniform and heterogeneous patters, the non-lodging plots have a more uniform and homogeneous format. Considering the fact that color characteristics are highly related to a variety of factors, such as growth stages, crop variety, and nutrients of soil, textural features tend to be more consistent. Therefore, five types of image texture features, namely, Haralick, local binary pattern (LBP), Gabor, intensity, and Hu-moment were extracted and used in the algorithms. Haralick features (measurement of the variation in the intensity of an image) were calculated from a Gray Level Co-occurrence Matrix (GLCM) for the mean, sum, variance, and standard deviation of different textural measures [41]. A total of 88 Haralick textural features (22 features in four directions) were extracted [42,43]. For LBP, 59 features (labeling the pixels of an image by thresholding the neighborhood of each pixel) were extracted by combining one non-uniform LBP and 58 uniform LBP features. The LBP uniformity was determined on the occurrence of two bitwise transitions when circularly sampled by a 3 × 3 filter [44]. For the Gabor features (analyzing if there are any specific frequency contents in the image in specific directions), each image was filtered with a bank of Gabor filters generated with four dilations and four rotations as a spatial mask of a 4 × 4 pixel square [45]. In total, 160 Gabor features were extracted from one plot image. Additionally, six basic intensity features of the individual image were extracted, including mean, standard deviation, kurtosis, skewness, average gradient, and Laplacian mean [46]. Furthermore, seven invariant descriptors for an image were extracted as Hu-moment features [47]. After all of the features were extracted, they were concatenated in tandem for every plot image to form a matrix of features (also referred to as master dataset), and by doing so, one image was represented by 320 features.

Classification
Image classification was conducted separately on three dates datasets. Each date dataset was partitioned through a random selection process (without replacement) to obtain a training and test dataset. While the training dataset consisted of 70% (260 images) of instances from the master dataset, the test dataset consisted of 30% (112 images) of the instances from the master dataset.
The RF is an ensemble learning algorithm that predicts the class labels of an unlabeled instance by aggregating the results from multiple decision trees [48]. Unlike a single decision tree-based classification model, in which a complete training dataset is chosen to determine the decision rules, a subset of the training dataset via boot strapping (i.e., via random sampling with replacement) is chosen to build multiple decision trees in the case of RF, as shown in Figure 4. Details of the decision tree algorithms are not provided in this paper, but can be found in other publications [34,49]. Decision trees built in the RF use only a subset of features (drawn randomly) to obtain the splitting of a node instead of all the features [50]. The number of features in the subset is approximately chosen to be √ m, where m = 320, which is the number of features for this study. Each tree of the RF then provides the outcome (i.e., class label) of the unlabeled instance [49]. Attributed to the fact that each tree of the RF Remote Sens. 2020, 12, 1838 7 of 19 may result in different outcomes, the majority vote of the total outcomes is carried out to determine the final class label [49], as shown in Figure 4.
Remote Sens. 2020, 12, x FOR PEER REVIEW 7 of 20 the outcome (i.e., class label) of the unlabeled instance [49]. Attributed to the fact that each tree of the RF may result in different outcomes, the majority vote of the total outcomes is carried out to determine the final class label [49], as shown in Figure 4. The prediction accuracies of RF depend on the number of decision trees used to train the model. To assess the performance of the RF, typically the out-of-bag error (OOBE) is evaluated with respect to the number of decision trees. Performance in this study refers to the ability of the model to predict correct class labels. The OOBE refers to the percentage of the misclassified class labels on the data instances that are left out after bootstrapping samples. The number of decision trees at which the OOBE drops significantly and stabilizes is generally chosen to build a RF model [51,52]. From the multiple preliminary tests carried out in this study, it was observed that OOBE decreased rapidly and then stabilized when the tree number exceeded 80, as shown in Figure 5. Note that no prior feature selection was performed in this study. A NN is a computational system that is made of several highly interconnected process elements (neurons), which receive, process, and transmit information to other neurons [53]. In other words, The prediction accuracies of RF depend on the number of decision trees used to train the model. To assess the performance of the RF, typically the out-of-bag error (OOBE) is evaluated with respect to the number of decision trees. Performance in this study refers to the ability of the model to predict correct class labels. The OOBE refers to the percentage of the misclassified class labels on the data instances that are left out after bootstrapping samples. The number of decision trees at which the OOBE drops significantly and stabilizes is generally chosen to build a RF model [51,52]. From the multiple preliminary tests carried out in this study, it was observed that OOBE decreased rapidly and then stabilized when the tree number exceeded 80, as shown in Figure 5. Note that no prior feature selection was performed in this study. the outcome (i.e., class label) of the unlabeled instance [49]. Attributed to the fact that each tree of the RF may result in different outcomes, the majority vote of the total outcomes is carried out to determine the final class label [49], as shown in Figure 4. The prediction accuracies of RF depend on the number of decision trees used to train the model. To assess the performance of the RF, typically the out-of-bag error (OOBE) is evaluated with respect to the number of decision trees. Performance in this study refers to the ability of the model to predict correct class labels. The OOBE refers to the percentage of the misclassified class labels on the data instances that are left out after bootstrapping samples. The number of decision trees at which the OOBE drops significantly and stabilizes is generally chosen to build a RF model [51,52]. From the multiple preliminary tests carried out in this study, it was observed that OOBE decreased rapidly and then stabilized when the tree number exceeded 80, as shown in Figure 5. Note that no prior feature selection was performed in this study. A NN is a computational system that is made of several highly interconnected process elements (neurons), which receive, process, and transmit information to other neurons [53]. In other words, A NN is a computational system that is made of several highly interconnected process elements (neurons), which receive, process, and transmit information to other neurons [53]. In other words, NN is a mathematical model that maps a given input to an output. In the current study, with 320 features extracted, NN aims to predict if the features belong to "lodging" or "non-lodging". A multi-layer perception feed-forward neural network was used in this study, which is configured with two hidden Remote Sens. 2020, 12, 1838 8 of 19 layers-10 and 5 neurons each for the first and second hidden layers, respectively, as shown in Figure 6. The first (input) layer consists of 320 features, and the last (output) layer represents the class labels (i.e., lodging and non-lodging). The neurons in the hidden layers represent a computational unit that performs a transformation of the linear sum of inputs that is received from preceding neurons. Transformation was carried out using a non-linear activation function, as shown in Figure 6. Training NN involves computing the weights (w in Figure 6) and bias (b in Figure 6) of the model to minimize the classification error in an iterative manner. In this study, a MATLAB®built-in program "net" was used to train the multi-layer feed-forward neural network model where a number of 1000 iterations are used [53]. 'Sigmoid' was used as an activation function in the hidden neurons, and 'softmax' function was applied to determine the class label in the output layer, as shown in Figure 6.
Remote Sens. 2020, 12, x FOR PEER REVIEW 8 of 20 NN is a mathematical model that maps a given input to an output. In the current study, with 320 features extracted, NN aims to predict if the features belong to "lodging" or "non-lodging". A multilayer perception feed-forward neural network was used in this study, which is configured with two hidden layers-10 and 5 neurons each for the first and second hidden layers, respectively, as shown in Figure 6. The first (input) layer consists of 320 features, and the last (output) layer represents the class labels (i.e., lodging and non-lodging). The neurons in the hidden layers represent a computational unit that performs a transformation of the linear sum of inputs that is received from preceding neurons. Transformation was carried out using a non-linear activation function, as shown in Figure 6. Training NN involves computing the weights (w in Figure 6) and bias (b in Figure 6) of the model to minimize the classification error in an iterative manner. In this study, a MATLAB® builtin program "net" was used to train the multi-layer feed-forward neural network model where a number of 1000 iterations are used [53]. 'Sigmoid' was used as an activation function in the hidden neurons, and 'softmax' function was applied to determine the class label in the output layer, as shown in Figure 6. The SVM is a supervised machine learning algorithm that has gained a lot of attention in recent years and is used to perform the task of classification [54]. The objective of the SVM is to find a hyperplane in a F-dimensional space (F is the number of extracted features) that can distinctly classify the data points. Given that there are many possible hyperplanes that can be chosen, it uses the one that has the maximum margin-maximum distance between different class points, as denoted by 'Margin' in Figure 7. Support vectors are data points that are closer to the hyperplane, which determine the position and orientation of the hyperplane. To facilitate the distinction of data that are non-linearly separable, kernel functions that can transform the original data into higher dimensions are generally adopted, shown in Figure 7C. There are various kernel functions, such as "linear", "polynomial", "Gaussian or Radial Basis Function" (RBF), and "Sigmoid", among which RBF was applied in this study, because of its robustness and proven performance [54,55]. As there were only two classes (lodging and non-lodging) in this study, a binary SVM was selected for image classification [56]. To perform classification, a built-in MATLAB ® function "fitcsvm" was executed where the auto kernel scale was chosen as the input arguments. Applying an "auto" kernel scale indicates that the algorithm automatically selected an appropriate scale factor using a heuristic procedure. Figure 6. Pattern recognition neural network used in this study for wheat lodging detection, consisting of two hidden layers (10 and 5 neurons) with input (320 features), and output (lodging and non-lodging). w and b stand for weight matrix and bias, respectively.
The SVM is a supervised machine learning algorithm that has gained a lot of attention in recent years and is used to perform the task of classification [54]. The objective of the SVM is to find a hyperplane in a F-dimensional space (F is the number of extracted features) that can distinctly classify the data points. Given that there are many possible hyperplanes that can be chosen, it uses the one that has the maximum margin-maximum distance between different class points, as denoted by 'Margin' in Figure 7. Support vectors are data points that are closer to the hyperplane, which determine the position and orientation of the hyperplane. To facilitate the distinction of data that are non-linearly separable, kernel functions that can transform the original data into higher dimensions are generally adopted, shown in Figure 7C. There are various kernel functions, such as "linear", "polynomial", "Gaussian or Radial Basis Function" (RBF), and "Sigmoid", among which RBF was applied in this study, because of its robustness and proven performance [54,55]. As there were only two classes (lodging and non-lodging) in this study, a binary SVM was selected for image classification [56]. To perform classification, a built-in MATLAB ® function "fitcsvm" was executed where the auto kernel scale was chosen as the input arguments. Applying an "auto" kernel scale indicates that the algorithm automatically selected an appropriate scale factor using a heuristic procedure. NN is a mathematical model that maps a given input to an output. In the current study, with 320 features extracted, NN aims to predict if the features belong to "lodging" or "non-lodging". A multilayer perception feed-forward neural network was used in this study, which is configured with two hidden layers-10 and 5 neurons each for the first and second hidden layers, respectively, as shown in Figure 6. The first (input) layer consists of 320 features, and the last (output) layer represents the class labels (i.e., lodging and non-lodging). The neurons in the hidden layers represent a computational unit that performs a transformation of the linear sum of inputs that is received from preceding neurons. Transformation was carried out using a non-linear activation function, as shown in Figure 6. Training NN involves computing the weights (w in Figure 6) and bias (b in Figure 6) of the model to minimize the classification error in an iterative manner. In this study, a MATLAB® builtin program "net" was used to train the multi-layer feed-forward neural network model where a number of 1000 iterations are used [53]. 'Sigmoid' was used as an activation function in the hidden neurons, and 'softmax' function was applied to determine the class label in the output layer, as shown in Figure 6. The SVM is a supervised machine learning algorithm that has gained a lot of attention in recent years and is used to perform the task of classification [54]. The objective of the SVM is to find a hyperplane in a F-dimensional space (F is the number of extracted features) that can distinctly classify the data points. Given that there are many possible hyperplanes that can be chosen, it uses the one that has the maximum margin-maximum distance between different class points, as denoted by 'Margin' in Figure 7. Support vectors are data points that are closer to the hyperplane, which determine the position and orientation of the hyperplane. To facilitate the distinction of data that are non-linearly separable, kernel functions that can transform the original data into higher dimensions are generally adopted, shown in Figure 7C. There are various kernel functions, such as "linear", "polynomial", "Gaussian or Radial Basis Function" (RBF), and "Sigmoid", among which RBF was applied in this study, because of its robustness and proven performance [54,55]. As there were only two classes (lodging and non-lodging) in this study, a binary SVM was selected for image classification [56]. To perform classification, a built-in MATLAB ® function "fitcsvm" was executed where the auto kernel scale was chosen as the input arguments. Applying an "auto" kernel scale indicates that the algorithm automatically selected an appropriate scale factor using a heuristic procedure. All plot images were re-sized to 80 × 250 pixels before feature extraction. The above procedures for image resizing, feature extraction, and model training/testing with RF, NN, and SVM were performed in MATLAB ® R2019a (The Mathworks, Inc., Natick, Mass., USA).

Deep Learning
In the deep learning CNN algorithm, wherein the whole image is fed as inputs and the various aspects/objects in the image are assigned importance (e.g., learnable weights and biases) for establishing a distinction between different objects [37]. Compared to the traditional machine learning methods, which require the manual feature extraction of individual images, the CNN deep learning approach requires minimal image pre-processing [57]. In CNN, an image is passed through a sequence of convolutional layers or kernel filters to extract the features (not directly accessible to users). Kernel filters are composed of weights that are determined through an iteration process. Multiple convolutional layers may be required, with different layers extracting different levels of features-low-level features include edges, colors, and gradient, while high-level features have a wholesome understanding of images.
Followed by convolution, pooling operation was carried out in each layer with the purpose of reducing the spatial size of the convolved features. Simultaneously, this procedure reduced the computational power requirement for further data processing. There are two commonly used pooling approaches-max pooling and average pooling. Max pooling is generally superior as it suppresses noise, which was applied in this study. The third layer is a fully connected layer-a classification layer using the Softmax classification technique to provide the predicted label of the image.
Similar to a majority of other machine learning algorithms, CNN needs a large set of training images to avoid overfitting. Considering the relatively small data samples in this study (372 samples), data augmentation (increasing the data sample) was performed. There are two popular ways for data augmentation. The first approach is to physically increase the dataset-the current samples number is 372 and it could be increased to 1116 samples (three times). Then, the physically augmented dataset is used for training and testing. The other method is to conduct data augmentation before each epoch, which means slightly different versions of images are fed into the algorithm for training. Since avoiding physically increasing the dataset (saving disk space), the second data augmentation approach was implemented by applying a variety of geometric transformation to the original images [58]. These geometric transformations include reflection, translation, rotation, horizontal/vertical scaling, zooming, and flipping, among which the first four were applied. Figure 8 shows a sample of the images generated by these transformations for image data augmentation. All plot images were re-sized to 80 × 250 pixels before feature extraction. The above procedures for image resizing, feature extraction, and model training/testing with RF, NN, and SVM were performed in MATLAB ® R2019a (The Mathworks, Inc., Natick, Mass., USA).

Deep Learning
In the deep learning CNN algorithm, wherein the whole image is fed as inputs and the various aspects/objects in the image are assigned importance (e.g., learnable weights and biases) for establishing a distinction between different objects [37]. Compared to the traditional machine learning methods, which require the manual feature extraction of individual images, the CNN deep learning approach requires minimal image pre-processing [57]. In CNN, an image is passed through a sequence of convolutional layers or kernel filters to extract the features (not directly accessible to users). Kernel filters are composed of weights that are determined through an iteration process. Multiple convolutional layers may be required, with different layers extracting different levels of features-low-level features include edges, colors, and gradient, while high-level features have a wholesome understanding of images.
Followed by convolution, pooling operation was carried out in each layer with the purpose of reducing the spatial size of the convolved features. Simultaneously, this procedure reduced the computational power requirement for further data processing. There are two commonly used pooling approaches-max pooling and average pooling. Max pooling is generally superior as it suppresses noise, which was applied in this study. The third layer is a fully connected layer-a classification layer using the Softmax classification technique to provide the predicted label of the image.
Similar to a majority of other machine learning algorithms, CNN needs a large set of training images to avoid overfitting. Considering the relatively small data samples in this study (372 samples), data augmentation (increasing the data sample) was performed. There are two popular ways for data augmentation. The first approach is to physically increase the dataset-the current samples number is 372 and it could be increased to 1116 samples (three times). Then, the physically augmented dataset is used for training and testing. The other method is to conduct data augmentation before each epoch, which means slightly different versions of images are fed into the algorithm for training. Since avoiding physically increasing the dataset (saving disk space), the second data augmentation approach was implemented by applying a variety of geometric transformation to the original images [58]. These geometric transformations include reflection, translation, rotation, horizontal/vertical scaling, zooming, and flipping, among which the first four were applied. Figure 8 shows a sample of the images generated by these transformations for image data augmentation.

Simple Convolutional Neural Network for Classification
A simple convolutional neural network (SCNN) consisting of three convolutional layers, two pooling layers, and one fully connected layer was generated and trained. For the first convolutional-pooling layer, eight convolution filters (resulting in 24 feature maps) of 3 × 3 pixels with a stride step of one pixel were applied, followed by a 2 × 2 max pooling layer, as shown in Figure 9.
A second convolutional-pooling layer consisted of 16 convolution filters of 3 × 3 pixels with one stride step and a 2 × 2 max pooling layer, shown in Figure 9 as a hidden layer. Then, it was followed by another convolution layer, consisting of 32 convolution filters and a fully connected layer. A Softmax layer activation function normalized the output of the fully connected layer, and the classification layers (final layers) used parameters from Softmax activation to make a classification. A rectified linear unit was applied as the activation function in all the hidden layers.
Remote Sens. 2020, 12, x FOR PEER REVIEW 10 of 20 2.5.1. Simple Convolutional Neural Network for Classification A simple convolutional neural network (SCNN) consisting of three convolutional layers, two pooling layers, and one fully connected layer was generated and trained. For the first convolutionalpooling layer, eight convolution filters (resulting in 24 feature maps) of 3 × 3 pixels with a stride step of one pixel were applied, followed by a 2 × 2 max pooling layer, as shown in Figure 9. A second convolutional-pooling layer consisted of 16 convolution filters of 3 × 3 pixels with one stride step and a 2×2 max pooling layer, shown in Figure 9 as a hidden layer. Then, it was followed by another convolution layer, consisting of 32 convolution filters and a fully connected layer. A Softmax layer activation function normalized the output of the fully connected layer, and the classification layers (final layers) used parameters from Softmax activation to make a classification. A rectified linear unit was applied as the activation function in all the hidden layers. In this study, the network was trained using a stochastic gradient descent algorithm with a momentum (an initial learning rate) of 0.01. Four epochs were applied, and the data were shuffled and geometrically transformed before being fed into every epoch (data augmentation).

VGG-16
The VGG-16 is a pre-trained CNN architecture whose weights were determined through the training conducted on approximately one million images from the ImageNet Dataset (http://imagenet.org/index). The VGG architecture, shown in Figure 10, consists of 13 convolutional layers (extracting image features with a 3 × 3 size filter), five max pooling layers (reducing the spatial size of images), and three fully connected layers (classifying images into labels). The model can classify images into 1000 object categories (e.g., keyboard, mouse, and pencil). Compared to the SCNN, which requires the training of the whole network with randomly initialized weights, pre-trained VGG-16 saves model training time as the weights were already determined [38]. The procedures of using VGG-16 for lodging detection are described in Figure 11, starting with loading pre-training VGG-16. The late layers (e.g., "loss3-classifier" and "output" of loaded VGG-16), which function to combine network extracted features into class probabilities, were replaced with new layers adapting to the datasets for this study. Before re-training the network, the weights of earlier layers in the network were frozen by setting the learning rates to 0. In addition, to significantly reduce the time required for network training, freezing the weights of initial layers can prevent their In this study, the network was trained using a stochastic gradient descent algorithm with a momentum (an initial learning rate) of 0.01. Four epochs were applied, and the data were shuffled and geometrically transformed before being fed into every epoch (data augmentation).

VGG-16
The VGG-16 is a pre-trained CNN architecture whose weights were determined through the training conducted on approximately one million images from the ImageNet Dataset (http://image-net. org/index). The VGG architecture, shown in Figure 10, consists of 13 convolutional layers (extracting image features with a 3 × 3 size filter), five max pooling layers (reducing the spatial size of images), and three fully connected layers (classifying images into labels). The model can classify images into 1000 object categories (e.g., keyboard, mouse, and pencil). Compared to the SCNN, which requires the training of the whole network with randomly initialized weights, pre-trained VGG-16 saves model training time as the weights were already determined [38].

Simple Convolutional Neural Network for Classification
A simple convolutional neural network (SCNN) consisting of three convolutional layers, two pooling layers, and one fully connected layer was generated and trained. For the first convolutionalpooling layer, eight convolution filters (resulting in 24 feature maps) of 3 × 3 pixels with a stride step of one pixel were applied, followed by a 2 × 2 max pooling layer, as shown in Figure 9. A second convolutional-pooling layer consisted of 16 convolution filters of 3 × 3 pixels with one stride step and a 2×2 max pooling layer, shown in Figure 9 as a hidden layer. Then, it was followed by another convolution layer, consisting of 32 convolution filters and a fully connected layer. A Softmax layer activation function normalized the output of the fully connected layer, and the classification layers (final layers) used parameters from Softmax activation to make a classification. A rectified linear unit was applied as the activation function in all the hidden layers. In this study, the network was trained using a stochastic gradient descent algorithm with a momentum (an initial learning rate) of 0.01. Four epochs were applied, and the data were shuffled and geometrically transformed before being fed into every epoch (data augmentation).

VGG-16
The VGG-16 is a pre-trained CNN architecture whose weights were determined through the training conducted on approximately one million images from the ImageNet Dataset (http://imagenet.org/index). The VGG architecture, shown in Figure 10, consists of 13 convolutional layers (extracting image features with a 3 × 3 size filter), five max pooling layers (reducing the spatial size of images), and three fully connected layers (classifying images into labels). The model can classify images into 1000 object categories (e.g., keyboard, mouse, and pencil). Compared to the SCNN, which requires the training of the whole network with randomly initialized weights, pre-trained VGG-16 saves model training time as the weights were already determined [38]. The procedures of using VGG-16 for lodging detection are described in Figure 11, starting with loading pre-training VGG-16. The late layers (e.g., "loss3-classifier" and "output" of loaded VGG-16), which function to combine network extracted features into class probabilities, were replaced with new layers adapting to the datasets for this study. Before re-training the network, the weights of earlier layers in the network were frozen by setting the learning rates to 0. In addition, to significantly reduce the time required for network training, freezing the weights of initial layers can prevent their The procedures of using VGG-16 for lodging detection are described in Figure 11, starting with loading pre-training VGG-16. The late layers (e.g., "loss3-classifier" and "output" of loaded VGG-16), which function to combine network extracted features into class probabilities, were replaced with new layers adapting to the datasets for this study. Before re-training the network, the weights of earlier layers in the network were frozen by setting the learning rates to 0. In addition, to significantly reduce the time required for network training, freezing the weights of initial layers can prevent their overfitting effect on the new dataset. The training of the network starts from resizing all images into a standard size of 224 × 224 × 3, which was realized by the function of imageAugmenter. MaxEpochs and InitialLearningRate, as the two key options for network training, were setup as 6 and 3 × 10 −4 in this study, respectively. Then the testing dataset was fed into the newly trained VGG-16 network and the predicted results were used to calculate the accuracy of the updated VGG-16 model. overfitting effect on the new dataset. The training of the network starts from resizing all images into a standard size of 224 × 224 × 3, which was realized by the function of imageAugmenter. MaxEpochs and InitialLearningRate, as the two key options for network training, were setup as 6 and 3e -4 in this study, respectively. Then the testing dataset was fed into the newly trained VGG-16 network and the predicted results were used to calculate the accuracy of the updated VGG-16 model. Figure 11. Procedure of updating VGG-16 for lodging detection. * Early layers learn low-level features (e.g., edges and colors); # late layers learn task-specific features; *# new layers learn features related to lodging; ** training dataset are 260 randomly selected plots (70%) of the entire dataset; ## options set for training (e.g., data augmenter and epochs); #* testing dataset is the other part (112 plot images) of the entire dataset.

GoogLeNet
Compared to the VGG-16 net, in which convolution layers are stacked linearly for better performance [39], GoogLeNet applies the inception module for feature extraction [59]. The inception module is a block of parallel convolutional layers with three differently sized filters (i.e., 1 × 1, 3 × 3, and 5 × 5) and a 3 × 3 max pooling layer, and the results are concatenated, as shown in Figure 12 [60]. Since large (5 × 5) and small (1 × 1) size filters extract general and local features, respectively, the inception module extracts features in a more inclusive manner. This study takes advantage of a pretrained 22-layer deep GoogLeNet for detecting wheat lodging. Other than loading GoogLeNet instead of VGG-16, the procedure of applying GoogLeNet is exactly the same as applying VGG-16, described earlier and shown in Figure 11. In this study, we tested three CNNs for their accuracy performance on detecting lodging plots, and for all these three methods, the datasets were randomly portioned into training and testing sets according to a ratio of 7:3 before they were fed into algorithms. The above procedures for constructing, loading, modifying, and running three different CNNs were performed in MATLAB ® R2019a (The Mathworks, Inc., Natick, Mass.).

Accuracy Evaluation and Model Comparison
Image classification results for the test dataset are most commonly assessed using the following three performance metrics: precision (PRE), recall (REC), and overall accuracy (OAC) [34]. Besides Figure 11. Procedure of updating VGG-16 for lodging detection. * Early layers learn low-level features (e.g., edges and colors); # late layers learn task-specific features; *# new layers learn features related to lodging; ** training dataset are 260 randomly selected plots (70%) of the entire dataset; ## options set for training (e.g., data augmenter and epochs); #* testing dataset is the other part (112 plot images) of the entire dataset.

GoogLeNet
Compared to the VGG-16 net, in which convolution layers are stacked linearly for better performance [39], GoogLeNet applies the inception module for feature extraction [59]. The inception module is a block of parallel convolutional layers with three differently sized filters (i.e., 1 × 1, 3 × 3, and 5 × 5) and a 3 × 3 max pooling layer, and the results are concatenated, as shown in Figure 12 [60]. Since large (5 × 5) and small (1 × 1) size filters extract general and local features, respectively, the inception module extracts features in a more inclusive manner. This study takes advantage of a pre-trained 22-layer deep GoogLeNet for detecting wheat lodging. Other than loading GoogLeNet instead of VGG-16, the procedure of applying GoogLeNet is exactly the same as applying VGG-16, described earlier and shown in Figure 11.
Remote Sens. 2020, 12, x FOR PEER REVIEW 11 of 20 overfitting effect on the new dataset. The training of the network starts from resizing all images into a standard size of 224 × 224 × 3, which was realized by the function of imageAugmenter. MaxEpochs and InitialLearningRate, as the two key options for network training, were setup as 6 and 3e -4 in this study, respectively. Then the testing dataset was fed into the newly trained VGG-16 network and the predicted results were used to calculate the accuracy of the updated VGG-16 model. Figure 11. Procedure of updating VGG-16 for lodging detection. * Early layers learn low-level features (e.g., edges and colors); # late layers learn task-specific features; *# new layers learn features related to lodging; ** training dataset are 260 randomly selected plots (70%) of the entire dataset; ## options set for training (e.g., data augmenter and epochs); #* testing dataset is the other part (112 plot images) of the entire dataset.

GoogLeNet
Compared to the VGG-16 net, in which convolution layers are stacked linearly for better performance [39], GoogLeNet applies the inception module for feature extraction [59]. The inception module is a block of parallel convolutional layers with three differently sized filters (i.e., 1 × 1, 3 × 3, and 5 × 5) and a 3 × 3 max pooling layer, and the results are concatenated, as shown in Figure 12 [60]. Since large (5 × 5) and small (1 × 1) size filters extract general and local features, respectively, the inception module extracts features in a more inclusive manner. This study takes advantage of a pretrained 22-layer deep GoogLeNet for detecting wheat lodging. Other than loading GoogLeNet instead of VGG-16, the procedure of applying GoogLeNet is exactly the same as applying VGG-16, described earlier and shown in Figure 11. In this study, we tested three CNNs for their accuracy performance on detecting lodging plots, and for all these three methods, the datasets were randomly portioned into training and testing sets according to a ratio of 7:3 before they were fed into algorithms. The above procedures for constructing, loading, modifying, and running three different CNNs were performed in MATLAB ® R2019a (The Mathworks, Inc., Natick, Mass.).

Accuracy Evaluation and Model Comparison
Image classification results for the test dataset are most commonly assessed using the following three performance metrics: precision (PRE), recall (REC), and overall accuracy (OAC) [34]. Besides In this study, we tested three CNNs for their accuracy performance on detecting lodging plots, and for all these three methods, the datasets were randomly portioned into training and testing sets according to a ratio of 7:3 before they were fed into algorithms. The above procedures for constructing, loading, modifying, and running three different CNNs were performed in MATLAB ® R2019a (The Mathworks, Inc., Natick, Mass.).

Accuracy Evaluation and Model Comparison
Image classification results for the test dataset are most commonly assessed using the following three performance metrics: precision (PRE), recall (REC), and overall accuracy (OAC) [34]. Besides these, another performance metric called F-measurement (F1 score) is also used, which combines precision and recall [61].
where # stands for "number of", TP is true positive (lodging plots classified as lodging), TN is true negative (non-lodging plots classified as non-lodging), FP is false positive (non-lodging plots classified as lodging), and FN is false negative (lodging plots classified as non-lodging). The F1 was calculated based on PRE and REC. The four metrics, Equations (1)-(4), were calculated as an average from 10 replications. Model accuracies were compared among machine learning models as well as among the three dates within each group, and finally between selected models from the groups. Tukey's test was performed at 0.05 significance level using SAS (Version 9.4, SAS Institute Inc., Cary, NC, USA) for the comparison.

Traditional Machine Learning for Lodging Detection
The model performance metrics, PRE, REC, OAC, and F1 results for the three classifiers (RF, NN, and SVM) on the three individual date datasets are given in Figure 13. Detection accuracies, as measured by PRE, REC, OAC, and F1, varied with different classifiers and performance metrics. The REC ranged from 73% to 87%, 67% to 92%, and 70% to 92% for RF, NN, and SVM, respectively, and overall, it ranked the lowest accuracy among the four parameters for all three classifiers. The PRE ranges were 87%-88%, 77%-85%, 87%-92%, and F1 were 79%-88%, 71%-88%, 77%-91%, for RF, NN, and SVM, respectively. Additionally, the OAC values ranges were 85%-88%, 85%-91%, 88%-93%, for RF, NN, and SVM, respectively. Compared to PRE, REC and F1, OAC performed more desirably. In addition to ranking the highest accuracy among the three parameters, the OAC resulted in the smallest averaged fluctuation among the three classifiers, as shown in Table 1, indicating its robust performance. Therefore, only OAC was chosen in the following discussion for model performance comparisons, and hereafter, accuracy denotes OAC.
Remote Sens. 2020, 12, x FOR PEER REVIEW 12 of 20 these, another performance metric called F-measurement (F1 score) is also used, which combines precision and recall [61]. where # stands for "number of", TP is true positive (lodging plots classified as lodging), TN is true negative (non-lodging plots classified as non-lodging), FP is false positive (non-lodging plots classified as lodging), and FN is false negative (lodging plots classified as non-lodging). The F1 was calculated based on PRE and REC. The four metrics, equations 1-4, were calculated as an average from 10 replications.
Model accuracies were compared among machine learning models as well as among the three dates within each group, and finally between selected models from the groups. Tukey's test was performed at 0.05 significance level using SAS (Version 9.4, SAS Institute Inc., Cary, NC, USA) for the comparison.

Traditional Machine Learning for Lodging Detection
The model performance metrics, PRE, REC, OAC, and F1 results for the three classifiers (RF, NN, and SVM) on the three individual date datasets are given in Figure 13. Detection accuracies, as measured by PRE, REC, OAC, and F1, varied with different classifiers and performance metrics. The REC ranged from 73% to 87%, 67% to 92%, and 70% to 92% for RF, NN, and SVM, respectively, and overall, it ranked the lowest accuracy among the four parameters for all three classifiers. The PRE ranges were 87%-88%, 77%-85%, 87%-92%, and F1 were 79%-88%, 71%-88%, 77%-91%, for RF, NN, and SVM, respectively. Additionally, the OAC values ranges were 85%-88%, 85%-91%, 88%-93%, for RF, NN, and SVM, respectively. Compared to PRE, REC and F1, OAC performed more desirably. In addition to ranking the highest accuracy among the three parameters, the OAC resulted in the smallest averaged fluctuation among the three classifiers, as shown in Table 1, indicating its robust performance. Therefore, only OAC was chosen in the following discussion for model performance comparisons, and hereafter, accuracy denotes OAC. Figure 13. Classification accuracies by random forest, neural network, and support vector machine for lodging detection on three different dates datasets, where PER, REC, OAC, and F1 denote precision, recall, overall accuracy, and F1 measurement, respectively. For all three classifiers, it was observed that the overall trend of accuracy increased with time, from 89% to 91%, 85% to 91%, and 88% to 93%, for RF, NN, and SVM, respectively, as shown in Figure 14. For all three classifiers, their accuracies ranked highest on the last date dataset (August 8th 2019), and the first and the last dates were significantly different (p < 0.05), as shown in Figure 14. This can be explained that the lodging was a dynamic process, which required time to complete. In addition, the average standard deviation for RF, NN, and SVM on the three dates datasets were calculated as 0.005, 0.046, and 0.027, respectively, with RF producing the least deviations.
Remote Sens. 2020, 12, x FOR PEER REVIEW 13 of 20 Figure 13. Classification accuracies by random forest, neural network, and support vector machine for lodging detection on three different dates datasets, where PER, REC, OAC, and F1 denote precision, recall, overall accuracy, and F1 measurement, respectively. For all three classifiers, it was observed that the overall trend of accuracy increased with time, from 89% to 91%, 85% to 91%, and 88% to 93%, for RF, NN, and SVM, respectively, as shown in Figure  14. For all three classifiers, their accuracies ranked highest on the last date dataset (August 8th 2019), and the first and the last dates were significantly different (p < 0.05), as shown in Figure 14. This can be explained that the lodging was a dynamic process, which required time to complete. In addition, the average standard deviation for RF, NN, and SVM on the three dates datasets were calculated as 0.005, 0.046, and 0.027, respectively, with RF producing the least deviations. Further comparisons of three classifiers' performances on the individual date datasets are shown in Figure 15. For any of the three dates datasets, the classifiers' accuracies were not significantly different from each other (p > 0.05), shown in Figure 15, indicating the selection of the classifiers does not affect the accuracy. However, RF was determined as the most satisfactory approach for its lowest standard deviation, shown in Figure 14. Generally, to achieve higher accuracy, it is desirable to avoid the use of the images collected immediately after lodging occurred. Further comparisons of three classifiers' performances on the individual date datasets are shown in Figure 15. For any of the three dates datasets, the classifiers' accuracies were not significantly different from each other (p > 0.05), shown in Figure 15, indicating the selection of the classifiers does not affect the accuracy. However, RF was determined as the most satisfactory approach for its lowest standard deviation, shown in Figure 14. Generally, to achieve higher accuracy, it is desirable to avoid the use of the images collected immediately after lodging occurred.

Deep Learning for Lodging Detection
For the deep learning algorithms of SCNN and GoogLeNet, the statistical accuracy comparisons of different date datasets showed that the last date (August 8th 2019) was significantly higher than the second date (July 30th 2019), but not significantly higher than the first date (July 23rd 2019), as shown in Figure 16. This was probably because the second date was a transitional lodging stage, and the automatically extracted features did not perform well during this transitional stage. The VGG-16 deep learning algorithm's accuracy was not significantly different for the individual date dataset, indicating its robust performance on all three dates. In addition, GoogLeNet resulted in the lowest average value on standard deviation (0.027), followed by SCNN (0.032) and VGG-16 (0.044). Similar to the results of the previous section, it is generally recommended to apply deep learning algorithms on the last date dataset for lodging detection for the purpose of higher accuracy.

Deep Learning for Lodging Detection
For the deep learning algorithms of SCNN and GoogLeNet, the statistical accuracy comparisons of different date datasets showed that the last date (August 8th 2019) was significantly higher than the second date (July 30th 2019), but not significantly higher than the first date (July 23rd 2019), as shown in Figure 16. This was probably because the second date was a transitional lodging stage, and the automatically extracted features did not perform well during this transitional stage. The VGG-16 deep learning algorithm's accuracy was not significantly different for the individual date dataset, indicating its robust performance on all three dates. In addition, GoogLeNet resulted in the lowest average value on standard deviation (0.027), followed by SCNN (0.032) and VGG-16 (0.044). Similar to the results of the previous section, it is generally recommended to apply deep learning algorithms on the last date dataset for lodging detection for the purpose of higher accuracy.

Deep Learning for Lodging Detection
For the deep learning algorithms of SCNN and GoogLeNet, the statistical accuracy comparisons of different date datasets showed that the last date (August 8th 2019) was significantly higher than the second date (July 30th 2019), but not significantly higher than the first date (July 23rd 2019), as shown in Figure 16. This was probably because the second date was a transitional lodging stage, and the automatically extracted features did not perform well during this transitional stage. The VGG-16 deep learning algorithm's accuracy was not significantly different for the individual date dataset, indicating its robust performance on all three dates. In addition, GoogLeNet resulted in the lowest average value on standard deviation (0.027), followed by SCNN (0.032) and VGG-16 (0.044). Similar to the results of the previous section, it is generally recommended to apply deep learning algorithms on the last date dataset for lodging detection for the purpose of higher accuracy. Further comparisons of the results of three deep learning algorithms at all three dates showed a similar accuracy patter, shown in Figure 17. The GoogLeNet ranked the highest based on accuracy and was significantly different from the other two (SCNN and VGG-16), which in turn were not significantly different from each other. The GoogLeNet accuracies for July 23rd 2019, July 30th 2019, and August 8th 2019 were 91%, 89%, and 93%, respectively. This result indicates that it is preferable to use GoogLeNet on the August 8th 2019 dataset for higher detection accuracy.
Further comparisons of the results of three deep learning algorithms at all three dates showed a similar accuracy patter, shown in Figure 17. The GoogLeNet ranked the highest based on accuracy and was significantly different from the other two (SCNN and VGG-16), which in turn were not significantly different from each other. The GoogLeNet accuracies for July 23rd 2019, July 30th 2019, and August 8th 2019 were 91%, 89%, and 93%, respectively. This result indicates that it is preferable to use GoogLeNet on the August 8th 2019 dataset for higher detection accuracy. Crop lodging detection accuracies through the application of machine learning can also be found in the literature. Yang et al. [20] achieved a 96% rice lodging detection by using visible light and spectral images. Kumpumäki et al. [62] reported a detection accuracy of 73% for rye by using Sentinel-2 images. Chauhan et al. [16] and Rajapaksa et al. [32] achieved 90% and 92% accuracies, respectively, on wheat lodging detection based on UAS multispectral data. Rajapaksa et al. [32] further reported canola lodging detection accuracies of 90% and 87% using five channel spectral imagery (red, blue, green, near infrared, and red edge). In our study, the 93% accuracy of GoogLeNet on wheat lodging detection can be considered as an improvement compared to the reported results and a satisfactory performance. In addition, we only used visible light images, which significantly decreased the technology cost.

Comparison of RF and GoogLeNet
RF and GoogLeNet were identified as the most desirable algorithms for detecting lodging plots from the traditional machine learning and deep learning approach, respectively. The accuracy comparisons of RF and GoogLeNet on the wheat lodging detection of the individual date dataset are shown in Table 2. For any of the three dates, the detection accuracies of the two classifiers were not significantly different from each other (p > 0.05). Additionally, both methods had the highest accuracies on the last date datasets (August 8th 2019). It can be concluded that the choice of the methods (either RF or GoogLeNet) would not affect the accuracies. However, considering that GoogLeNet resulted in a higher average accuracy value over RF (93% > 91%), it is recommended to use GoogLeNet for wheat lodging detection. Crop lodging detection accuracies through the application of machine learning can also be found in the literature. Yang et al. [20] achieved a 96% rice lodging detection by using visible light and spectral images. Kumpumäki et al. [62] reported a detection accuracy of 73% for rye by using Sentinel-2 images. Chauhan et al. [16] and Rajapaksa et al. [32] achieved 90% and 92% accuracies, respectively, on wheat lodging detection based on UAS multispectral data. Rajapaksa et al. [32] further reported canola lodging detection accuracies of 90% and 87% using five channel spectral imagery (red, blue, green, near infrared, and red edge). In our study, the 93% accuracy of GoogLeNet on wheat lodging detection can be considered as an improvement compared to the reported results and a satisfactory performance. In addition, we only used visible light images, which significantly decreased the technology cost.

Comparison of RF and GoogLeNet
RF and GoogLeNet were identified as the most desirable algorithms for detecting lodging plots from the traditional machine learning and deep learning approach, respectively. The accuracy comparisons of RF and GoogLeNet on the wheat lodging detection of the individual date dataset are shown in Table 2. For any of the three dates, the detection accuracies of the two classifiers were not significantly different from each other (p > 0.05). Additionally, both methods had the highest accuracies on the last date datasets (August 8th 2019). It can be concluded that the choice of the methods (either RF or GoogLeNet) would not affect the accuracies. However, considering that GoogLeNet resulted in a higher average accuracy value over RF (93% > 91%), it is recommended to use GoogLeNet for wheat lodging detection.

Future Research Direction
This research focused on distinguishing lodging plots from non-lodging plots in a binary sense, and the future research direction should specifically look into determining quantitative information on lodging severity expressed on a scale ranging from non-lodging to complete lodging. Additionally, the versatility of this method should be tested and validated on other crops susceptible to lodging, such as canola, corn, and soybean. Since the current method is entirely off-line, future efforts should be directed to near real-or real-time lodging detection for the benefits of end-users. Embedded systems attached to UAS, coupled with RF or GoogLeNet algorithms, should be another promising area of research [63,64]. In this study, UAS was flown 25 m AGL, with an image resolution of 0.7 cm. With a larger AGL, the UAS would have a higher area coverage per flight but would only obtain low resolution images. More studies are needed to check the performance of these machine learning algorithms with a higher AGL (e.g., 35 m and 40 m). Considering the current model prediction accuracy is >90% by only using textual features, incorporating visible color and spectral information has the potential to further improve the model performance and this can be addressed through additional research.

Conclusions
In this study, UAS imagery was collected over wheat plots (372 individual plots) on three different dates, along with ground truth data (lodging or non-lodging) for each plot. After the stitching process, individual plot images were manually cropped from orthomosaic imagery to create datasets for each date. For the traditional machine learning approach, 320 extracted features were fed into three algorithms (random forest, neural network, and support vector machine). Though the three algorithms did not perform significantly different detection accuracies on any of the three dates datasets (p > 0.05), RF was determined as the most satisfactory and robust method, due to its lowest standard deviation value. For the deep learning method, which had the advantage of avoiding the manual feature extraction and using the images directly, the detection accuracies of GoogLeNet was consistently higher than the other two algorithms, and it had the highest detection accuracy (93%) on the last date (August 8th 2019) dataset, where the crop lodging was complete. Detection accuracy comparisons between RF and GoogLeNet demonstrated that there were no significant differences (p > 0.05) in the two methods on any of the three dates datasets. It can be recommended that the users could choose either of the two methods (RF or GoogLeNet) based on their preference or the availability of resources. Future research should investigate the quantification of lodging severity expressed in numerical scale, extending to other crops susceptible to lodging, and the use of real-time detection and embedded systems. It should be noted that the UAS used in this study can fly a mission with maximum of~25 min, and this significantly limits its application in large-area fields. In addition, the flight is significantly dependent on weather conditions, and the UAS cannot carry out scout work in rainy and windy weather conditions. This study had demonstrated that UAS imagery, coupled with machine learning algorithms, has the potential to be used as a novel, objective, and a promising ready-to-use tool for wheat lodging detection because of its simplicity and efficiency (accuracy > 90%). This developed technology could benefit wheat breeders and growers, insurance loss adjusters, as well as agronomists and plant physiologists.
Funding: This research is financially supported by the United States Department of Agriculture (USDA).