1. Introduction
Lodging negatively affects yield in many crops [
1,
2,
3]. This damage, defined as the displacement of plant stems from an upright position [
4], is mainly induced by the strong winds of extreme weather events during the growing season of the crop. Lodging negatively impacts plant morphology [
5], physiological processes [
6], and growth [
7], as well as impeding harvesting activities, to significantly reduce the final above-ground yield of the crop [
8,
9,
10,
11]. Lodging damage can also introduce the error or loss of phenotypic data used for selection in breeding, genetic analysis or management decisions [
9,
10]. Due to its rapid vertical growth [
12] and tall stature of up to 3-4 m, biomass-type
Sorghum bicolor (L.) Moench is prone to lodging, especially in the second part of the growing season [
13]. The development of digital tools that provide the precise and timely [
14] evaluation of lodging would be beneficial for allowing plant breeders to more accurately assess larger populations of crops to select lines that suffer less from lodging and accelerate crop improvement [
15] In addition, agronomists and farmers would benefit from the ability to evaluate lodging damage and associated yield loss when making decisions about the return-on-investment from harvesting versus insurance claims [
16].
Most studies on lodging have relied on in situ visual inspection of the crop by domain experts, i.e., agronomists or breeders [
7,
17]. However, such methods require intensive manual labor and are limited to small spatial areas. Lodging has been monitored using satellites [
14] and aircraft [
18], recently including UAVs [
3]. Advanced data-based modeling techniques benefit from the increased resolution of spatial data collected by UAVs, as they are able to exploit these data to learn relevant features on a given task [
19]. Moreover, high-quality data collection by UAVs is less constrained by cloud cover than when satellites or higher altitude aircraft are used, enabling more timely field inspections [
4]. Multispectral imagery is the most common source of information for lodging assessment using remote sensing imagery [
4]. Under lodged conditions, the signal of different wavelengths that is reflected or backscattered to the sensor is altered by the changes in plant geometry and structure [
2,
20]. Earlier work [
21] successfully demonstrated the integration of spectral and textural feature images to detect lodged areas of wheat with an 82–91% accuracy. Other studies [
22,
23,
24] showed the value of integrating remote sensing imagery and machine learning to assess lodging damage. The first step in these approaches is a laborious manual search by domain experts for meaningful color- and texture-based features, which are then marked to provide inputs for a classification algorithm to perform the pixel-level mapping of lodging in certain areas [
3]. While they represent important advances, there is potential to increase the accuracy and
Precision of the analyses, while also reducing labor. More recently, the ability of Convolution Neural Networks (CNNs) to maximize information gained from high-resolution UAV imagery was demonstrated [
25,
26].
A CNN automatically generates representations of complex features, which provide an advantage over traditional machine learning and are suitable for automated image-based inference [
26]. While CNNs proved to be a powerful image analysis technique, most studies have only exploited variations along two dimensions in the data, e.g., the two spatial dimensions of a single image (2D-CNNs) [
27]. There is a growing interest in adopting CNN models capable of exploiting a third dimension of variation, e.g., along a time series (3D-CNNs) [
28]. Additionally, the recent exponential increase in the number of datasets that feature high-resolution imagery collected at frequent intervals over a growing season for a single field trial now make this a practical possibility. The method of 3D-CNN architecture was used to predict the end-of-season yield of wheat, barley, and oat at field scale by integrating UAV time series, weather, and yield monitoring [
19]. Sequential CNN-Long Short-Term Memory (LSTM) network architecture was used to predict soybean yield at county-scale by integrating MODIS satellite time series, weather, and national yield data [
29]. In a different context, deep learning 3D-convolution-based architectures were utilized and benchmarked against traditional 2D-convolution architectures for hyperspectral imagery scene classification [
29]. Together, these results across different use cases suggest that 3D CNN architectures can perform better than other architectures by exploiting their ability to use time as an additional descriptor of the target trait.
In this study, we propose an examination of the effect of time, as an additional feature, on lodging damage prediction. Specifically, we focus on determining the value of persistent aerial monitoring using spatio-temporal 3D-CNN time series models based on UAV time series imagery as inputs to determine both lodging detection and lodging severity.
The first objective is to assess the importance of the visual and near infrared spectral and geometric features derived from the UAV imagery in the performance of the CNN models. Second, we test if 3D-CNN model architectures applied to time series imagery can surpass the performance of a 2D-CNN analysis of images from single dates. Third, we perform a comparative evaluation of different time sequence configurations to understand the impact of the number of flights in the predictive sensitivity of 3D-CNN models. Fourth, we consider the computation time needed for each model as part of the trade-off models evaluation. Lastly, we utilize activation maps to visualize and interpret CNN learning and predictive ability in these tasks [
30].
2. Materials and Methods
2.1. Experimental Site
A biomass sorghum experiment with an augmented incomplete block design of 1118 plots was planted on 31 May 2019, at the University of Illinois Energy Farm research facility, Urbana-Champaign (40.065789° N, −88.208477° W). The trial included (866) diverse accessions of biomass sorghum in single, four-row plots. In addition, six additional “check” accessions were planted in 16 four-row plots distributed across each of the 16 blocks, as was previously described in detail [
11]. A storm with strong winds produced lodging in plots across the experiment 82 days after planting (DAP), as shown in
Figure 1. The ground-truthing survey was implemented immediately after the lodging event between 83–85 DAPs.
2.2. Lodging Assessment
We utilized a two-stage visual assessment of lodging to provide the data needed for model training and validation. The steps were to determine: (1) whether each plot was lodged or non-lodged; and (2) to estimate the severity of lodging, in the plots where it occurred, in terms of a lodging severity score initially suggested by [
10] and adapted in [
25]:
Lodging severity was scored on a three-point scale based on the visual evaluation of the average inclination angle from vertical plant stems in the
lodged area of the plot, where a value of 1 corresponded to mild lodging (0–30°), 2 corresponded to moderate lodging (30–60°), and 3 corresponded with severe lodging (60–90°), as shown in
Figure 1.
Lodging scores had a continuous distribution once the categorical
lodging severity was multiplied by the fraction (%) of the plot that was estimated to be lodged.
2.3. Aerial Data Collection and Preprocessing
A RedEdge-M (Micasense, Seattle, WA, USA) multispectral sensor (MSI) was used for imaging to acquire both geometric and multispectral features. The camera included five spectral bands in the blue (465 to 485 nm), green (550 to 570 nm), red (663 to 673 nm), rededge (712 to 722 nm), and near-infrared (820 to 860 nm) regions of the electromagnetic spectrum. The aerial platform utilized was a Matrice 600 Pro hexacopter (DJI, Shenzhen, China), equipped with a Gremsy T1 gimbal (Gremsy, Ho Chi Minh, Vietnam) used to mount the MSI sensor. Flights were conducted 11 times in the season between DAPs 31 and 97 under clear sky conditions and around solar noon. Flight altitude was set to 40 m above ground level, resulting in a ground sampling distance (GSD) of 2–3 cm/pixel. Flight planning considered a 90% forward and 80% side overlapping during image acquisition to ensure successful image postprocessing. Nine black and white square panels (70 cm × 70 cm) were distributed in the field and utilized as ground control points (GCPs). A RTK (real-time kinematic) survey was implemented using a Trimble R8 global navigation satellite system (GNSS) integrated with CORS-ILUC local station to survey the GCPs for accurate geo-referencing and co-registration of imagery throughout the season. A standard Micasense calibration panel was imaged on the ground before and after each flight for spectral calibration. Images were further imported into Metashape version 1.7.4 (Agisoft, St. Petersburg, Russia) to generate the two types of aerial image features: (1) multispectral and (2) geometric 3D reconstruction of the canopy known as crop surface models (CSMs) orthophotos.
An early season flight before plant emergence was utilized as ground level reference for the extraction of the absolute height in the CSM files on subsequent sampling dates (
n = 11). The resulting multispectral and CSM orthophotos were resampled to a common 5 cm/pixel resolution and stacked into one multi-feature and multi-date file object. Image chips for each plot were generated by clipping the orthophoto stack object using plot polygons shapefile corresponding to the experimental plot dimensions (2.8 m × 2.8 m), as shown in
Figure 2. Three variations of the image chips were produced: (1) spectral features from RGB images (RGB); (2) spectral features from RGB, red-edge and NIR bands (MS); and (3) spectral and geometric features from RGB, red-edge and NIR bands along with a crop surface model (CSM_MS) (
Table 1). These steps were implemented via
geopandas,
shapely, and
rioxarray libraries in Python 3.7.11.
2.4. CNN Modelling
The core component of the CNN algorithm is the convolution operation, where a set of trainable kernels are applied to the input image to generate a set of spatial features that describe the target predictor [
31]. The model learns basic features in the first layers and more complex feature representations at deeper layers. The final output of a CNN is a set of features maps; depending on the use case, these can be directly utilized or fed to a fully connected layer either in classification or regression tasks. The 3D-CNN architecture is a variation of the traditional 2D architecture [
32]. It utilizes all the standard CNN architecture features but adopts a third dimension, which in this case was the temporal descriptor of the target trait across the stack of images. The 3D-convolution layer operation was applied with a learnable three-dimensional kernel considering the spatio-temporal characteristics of the image chip as width (number of pixels) × height (number of pixels) × depth (time sequence length).
2.5. CNN Implementation and Metrics
Customized CNN architectures were utilized in this study. After trial and error for lodging detection, the backbone feature generator of the 2D-CNN architecture included two 2D-convolution layers and one flatten layer (
Figure 3a). Meanwhile, for lodging severity, it considered three 2D-convolution layers and one flatten layer (
Figure S1a). The rest of the architecture included a fully connected sigmoidal (classification) or linear (regression) layers heads, respectively. The 2D-convolution’s kernel filter size was set to (3 pixels × 3 pixels × N features). The 3D-CNN architecture considered two 3D-convolution layers (
Figure 3a and
Figure S1b). The 3D-convolution’s kernel filter moved in three dimensions instead of two, as with the 2D-convolution filter, not only from left to right and from top to bottom, but also forward and backward. Thus, its kernel’s size was set to (3 × 3 × 2–9-time sequence lengths) operating at the sequence of each feature over time. The following settings were common to all CNN architectures: convolution layers set to 32 features, zero padding, stride equal to one with no overlapping, and rectified linear unit as activation function were considered (
Figure 3 and
Figure S1). Batch normalization in the first convolution layer, 2D-max-pooling was set to two following each convolution, with 3D-max-pooling equal to two over the spatial dimensions, equal to one over the depth dimension, and equal to two following the last convolution operation of the 3D architecture, and 50 percent features dropout prior to the flatten layer step was also adopted.
A 2D-CNN model analysis was conducted on data from individual dates, either before or after the lodging event. This tested the importance of the timing between the lodging event, ground-truthing, and aerial data collection. The following four dates of flights were tested:
DAP 75: 7 days prior to lodging, labeled as ‘75 (1)’;
DAP 83: 1 day after lodging, labeled as ‘83 (1)’;
DAP 88: 6 days after lodging, labeled as ‘88 (1)’;
DAP 97: 15 days after lodging, labeled as ‘97 (1)’.
where the values of 1 in parentheses indicate the analysis of data from a single date.
In addition, various configurations of input data from sequences of dates were analyzed using 3D-CNN architecture to test the sensitivity of predictions to variations in the length of the time sequence and the contribution of dates before and after the lodging event. The following six sequence configurations were built:
DAPs 69,83: 13 days prior plus 1 day after lodging, labeled as ‘69_83 (2)’;
DAPs 75,83: 7 days prior plus 1 day after lodging, labeled as ‘75_83 (2)’;
DAPs 69,75,83: 13 and 7 days prior plus 1 day after lodging, labeled as ‘69_75_83 (3)’;
DAPs 83,88,97: 1, 6 and 15 days after lodging, labeled as ‘83_88_97 (3)’;
DAPs 62,69,75,83: 20, 13 and 7 days prior plus 1 day after lodging, labeled as ‘62_69_75_83 (4)’;
DAPs 31,38,42,49,55,62,69,75,83: all available dates prior to lodging plus 1 day after lodging, labeled as ‘31:83 (9)’.
where the values of 2 to 9 in parentheses indicate the analysis incorporating data from between two and nine dates.
CNN models were implemented in Python Keras API, Tensorflow-GPU version 2.7.0 using a NVIDIA GeForce RTX 3070 (8GB of GPU memory). The combination of different types of features, CNN architectures, and dates of flights or time sequence lengths resulted in the implementation of 60 models (
Table S1). Half of these models were used to assess lodging detection, while the other half assessed lodging severity. In each of these tasks, 12 models were based on the 2D-CNN architecture, as a result of combining three types of features (
Table 1) and the four dates of aerial data collection. The remaining 18 models were based on the 3D-CNN architecture, as a result of combining three types of features (
Table 1) and six different configurations based on different lengths of time and the inclusion of data from dates before or after the lodging event. Each model fitting was iterated 10 times using a random training and testing partition to ease the convergence of the models’ prediction metrics.
The full set of 1118 image chips were used to assess lodging detection, with 662 plots labeled as non-lodged, and 456 plots labeled as lodged during on-the-ground field surveying. The 456 plots labeled as lodged were further considered to assess lodging severity. In both tasks, training dataset was split (70:30) into training and testing datasets. The training dataset was split further (80:20) into training and validation datasets. The validation dataset was used to optimize the models’ performance and prevent overfitting during training. Binary cross-entropy and mean square error were used as loss functions in each of the tasks, respectively. Models were trained up to 400 epochs using the early stop setting with lambda value 0.01 to ensure efficient training time. Learning rate and decay parameters were initialized with 0.01 and 0.001 values, respectively. The test dataset was utilized to expose the models to unseen data to evaluate the generalization ability of the models.
The overall accuracy (
OA),
Precision, and
Recall were utilized as evaluation metrics to determine the performance of the models on lodging detection. The metrics are described in Equations (2)–(4):
In Equations (2)–(4), TP (true positive) is defined as lodged plots correctly classified as lodged plots. TN (true negative) is defined as non-lodged plots correctly classified as non-lodged plots. FP (false positive) is defined as non-lodged plots incorrectly classified as lodged plots. FN (false negative) is defined as non-lodged plots incorrectly classified as lodged plots.
The mean absolute error (
MAE), the relative root mean square error (
RMSE) and coefficient of determination (
R2) were utilized as evaluation metrics to determine the performance of the CNNs in lodging severity score. Each metric is described in Equations (5) and (6):
In Equations (5)–(7), y, , and are the observed, predicted, and observed mean lodging severity scores values of the ith plot, and n is the total number of samples within the study area. The computation time and GPU memory used were computed using a custom callback function and custom calls to NVIDIA-smi in Python during model implementation, respectively. Time was computed as seconds per epoch during model training, and the GPU memory usage was computed as the maximum percentage of GPU used during model implementation.
Activation mapping visualization [
33] was implemented to better understand the ability of the CNNs to use the information contained in the image arrays to deliver the output prediction. The importance of the image regions in the prediction can be identified by projecting back the weights of the output layer on to the convolutional feature maps, using this technique [
30]. The following steps were used to generate activation maps in both tasks, as described in [
34]. First, the CNN model was utilized to map the input image to the activations of the last convolution layer as well as the output predictions. Then, the gradient of the predicted class (classification) or value (regression) for the input image, with respect to the activations of the last convolution layer, was computed. Lastly, each image channel in the feature map array was multiplied by how important this channel was with regard to the predicted class or value, then all the channels were summed to generate the corresponding activation map. In essence, this provides a measure of how strongly portions of the image contribute to the predictions made by the CNN, and it can be expressed and visualized as a false-color scale.
4. Discussion
This study analyzed two spatial dimensions and a temporal dimension in data from a time course of UAV remote sensing images, with a 3D-CNN architecture to facilitate improvements in lodging detection and the prediction of lodging severity relative to standard 2D-CNN architecture (
Figure 4 and
Figure 5). Modest gains were made in the qualitative task of lodging detection through the use of a 3D-CNN compared to a 2D-CNN, which had an overall accuracy of 0.79–0.88 depending on the number of spectral and geometric features of the input data. In contrast, when estimating quantitative variation in lodging severity, there could be significant benefits to using a 3D-CNN over a 2D-CNN. These gains in prediction ability were realized when a 3D-CNN was used on feature-rich data containing both geometric and spectral information, i.e., when the CSM or red-edge and NIR wavebands were dropped from the input data, the benefits of 3D-CNN over 2D-CNN were limited. These findings highlight how leveraging the extra information captured in remote sensing data that include more spectral information and multiple timepoints can depend on use of more complex analytical approaches. The successful application of this approach for lodging severity screening in biomass sorghum provides proof-of-concept for a tool to rapidly assess a trait that is of significant interest to breeders because it impacts harvestable yield, but it is very challenging to assess on a large scale by traditional methods of visual inspection [
15]. Therefore, significant gains in the efficiency of crop improvement could be achieved. In addition, the effective analyses of imagery from UAVs would provide farmers with a tool that is accurate, while also much faster than manual field inspections of large acreages for assessing lodging damage and deciding on the relative value of harvesting a crop or generating evidence to support an insurance claim [
16]. Otherwise, farmers are often advised to harvest early to avoid lodging, causing a reduction in potential productivity by shortening the growing season. More broadly, the lessons learned here may be relevant to other applications of time series remote sensing imagery, such as the assessment and evaluation of damage from seismic activity [
35].
In recent years, agricultural studies have increasingly made use of a variety of imaging methods based on UAV-based remote sensing data. In the domains of plant breeding [
36] and
Precision agriculture [
37], this growing popularity reflects the need for the accurate assessment of crop traits (1) across many genotypes and at spatial scales that are too large for manual measurement to be efficient, yet (2) at spatial scales where the resolution and number of image features (spectral and/or geometric) from piloted aircraft or satellites can be inadequate. The independent analysis of data from individual dates of image collection from UAVs has been the most common approach to date. For example, quantitative thresholding approaches have been applied in this way to aerial images to assess lodging severity in barley (
R2 = 0.94) [
38] and lodging rate in maize (
R2 = 0.50) [
39]. Success has also been achieved in estimating lodging severity by training and applying CNN analysis to a combined set of images from two developmental stages of wheat [
22]. These studies demonstrate the potential for the rapid assessment of lodging by UAV, but are limited by the need for manual intervention to select optimal threshold values in workflow and by not leveraging the temporal variation in the analysis of data. They may also provide an early indication that estimating lodging severity is more challenging in tall, lower density, and coarsely textured canopies such as maize relative to shorter, more uniformly dense and fine-textured canopies such as rice, wheat and barley. If so, biomass sorghum is a good subject for testing methods that can cope with a difficult use case.
The results of the 2D-CNN analysis of images from different individual dates during the growing season provided valuable context for the interpretation of the 3D-CNN results. Lodging severity could be predicted with reasonable accuracy when using images collected one day after the lodging event (
Figure 5a). Unsurprisingly, there was almost no useful information to be gained from the 2D-CNN analysis of images collected prior to the lodging event. Additionally, the predictive ability of 2D-CNN analysis declined significantly and progressively when applied to images collected six, and then fifteen, days after the lodging event. This pattern is consistent with the common observation that the continued growth of lodged plants is driven by phototropism to be vertical nature and, therefore, less easy to distinguish from an undisturbed canopy in aerial imagery, even if much of the stem and biomass remains disturbed.
The superior performance of a 3D-CNN over a 2D-CNN for estimating lodging severity appears to depend on leveraging a “before and after” comparison of the event, i.e., prediction accuracy measured in terms of
R2,
MAE or
RMSE was best when data from one or two dates before and after the lodging event were analyzed (
Figure 5). This is logical given that a change in the CSM (i.e., height profile) or spectral reflectance of the canopy before and after the lodging event would provide valuable additional information beyond the variation in geometric and spectral signals across plots on a single date once lodging had occurred. This interpretation is supported by the 3D-CNN analysis of a time-course of images from three dates after the lodging event, failing to outperform the best 2D-CNN analysis (
Figure 5).
Nevertheless, provided that the “before and after” comparison was made, the performance of the 3D-CNN was relatively insensitive to the number of sampling dates used prior to the lodging event (
Figure 5). Even the analysis of data from nine dates between 31 to 83 DAP found a lodging severity prediction that was almost as accurate (
MAE = 9.99%,
RMSE = 12.68%,
R2 = 0.72), and the same was true when only the two dates before and after the lodging event were analyzed (
MAE = 9.65%,
RMSE = 12.40%,
R2 = 0.75). This suggests that significant error is not introduced into the 3D-CNN analysis when data from early in the season that have a low information value with respect to lodging are included. Additionally, as a practical consideration, it allows for the effective analysis of lodging without manual intervention in order to select dates of data collection nearest to the lodging event.
The notion that a 3D-CNN analysis would improve UAV-based phenotyping beyond what 2D-CNN is capable of is based, in part, on recent satellite remote sensing research. Satellites are an ideal source of automatically generated time courses of remote sensing data [
40]. Additionally, as an example, the inclusion of a third convolution layer in CNN architecture to integrate spatial and time dimensions in satellite data enabled improved crop classification at a regional scale [
28]. The 3D-CNN analysis of spatio-temporal models was applied to UAV remote sensing for only a limited number of traits and crops. The applicability of the 3D-CNN analysis of spatio-temporal data to predict yield [
18] is consistent with yield being a product of compound growth and dynamic interactions with the environment over the entire growing season. However, the black-box nature of CNN analysis makes it hard to pinpoint which elements of the 3D-CNN and time course of imagery drive enhancements in performance. By contrast, the abrupt nature of lodging allows the current study to provide a case study of how 3D-CNN analysis can enhance power through the inclusion of only two timepoints of data “before and after” an event of interest in the life of the crop.
This study reveals the value of combining spectral and geometric information in feature-rich image data, which are necessary to better leverage the benefits of 3D-CNN over 2D-CNN for the estimation of lodging severity (
Figure 5). Spectral information previously proved to be a relevant descriptor of lodging in crops such maize [
38], wheat [
25], and barley [
9]. The visual and near-infrared regions of the spectrum have been the most common proxies for lodging detection [
24] and lodging severity [
23]. Geometric features derived from the photogrammetric reconstruction of the canopy have been less commonly used in these tasks [
2,
4,
41]. However, the “before and after” comparison of geometric information in CSM was crucial for the enhanced predictive ability of 3D-CNN over 2D-CNN. This parallels and extends the successful integration of geometric and spectral information to improve predictions of traits such as biomass [
12,
42] and plant height [
43].
It was also notable that the benefit of integrating geometric and spectral features was greater for lodging severity than for lodging detection [
36,
39], regardless of the CNN architecture considered. This is consistent with previous findings [
44,
45], where the spectral information alone proved highly efficient in detecting lodging as a pixel-wise classification [
22]. Additionally, this is likely, in part, explained by the fact that the geometric features are highly sensitive and less prone to signal saturation than spectral features [
46], which becomes relevant when considering small differences in the degree of lodging severity between plots under dense canopy conditions. Even though, the computational cost of both traits cannot be directly compared given the fact that the sample sizes and their 2D architectures differ slightly. The improvement in predictive performance by 3D-CNN implied a higher computational cost. Due to the relative higher improvement of these models on lodging severity rather than lodging detection, the relative computation cost was lower for this trait. The higher performance in this study did not fall into the larger time sequence lengths, but the exponential increase in the computation cost of the 3D architectures at long time series cannot be overlooked, and this is still a computational challenge and active area of research [
47,
48].
According to the activation maps, both CNN architectures utilized similar regions of images to determine lodging detection and severity. However, the greater predictive power of the 3D-CNN did coincide with larger and more refined activation regions in the images compared to the 2D models maps. In this regard, it can be argued that the best 3D-model is able to extract additional informative descriptors from the spatial–temporal features during 3D-convolution operations and, thereby, produce an improved prediction.
Even though we utilized manual ground truths to build the models, we recognize that a visual observation and quantification, by walking through the alleys and plots after lodging, was subject to errors that may affect the performance of the model. These could be improved by utilizing ground-based imaging methods such as LIDAR for ground-truthing, but this was beyond the scope of our study. Future research might further exploit the additional ground truth data of a quantitative nature for the use of spatio-temporal CNN architectures to better understand the capacity of certain genotypes to recover after lodging damage. Additional analyses are also needed to determine the impact of the spatial resolution of imagery and how to reduce the processing time of data preprocessing and modelling implementation of 3D-CNN architectures.