Implementing Spatio-Temporal 3D-Convolution Neural Networks and UAV Time Series Imagery to Better Predict Lodging Damage in Sorghum

: Unmanned aerial vehicle (UAV)-based remote sensing is gaining momentum in a variety of agricultural and environmental applications. Very-high-resolution remote sensing image sets collected repeatedly throughout a crop growing season are becoming increasingly common. Analytical methods able to learn from both spatial and time dimensions of the data may allow for an improved estimation of crop traits, as well as the effects of genetics and the environment on these traits. Multispectral and geometric time series imagery was collected by UAV on 11 dates, along with ground-truth data, in a ﬁeld trial of 866 genetically diverse biomass sorghum accessions. We compared the performance of Convolution Neural Network (CNN) architectures that used image data from single dates (two spatial dimensions, 2D) versus multiple dates (two spatial dimensions + temporal dimension, 3D) to estimate lodging detection and severity. Lodging was detected with 3D-CNN analysis of time series imagery with 0.88 accuracy, 0.92 Precision , and 0.83 Recall . This outperformed the best 2D-CNN on a single date with 0.85 accuracy, 0.84 Precision , and 0.76 Recall . The variation in lodging severity was estimated by the best 3D-CNN analysis with 9.4% mean absolute error ( MAE ), 11.9% root mean square error ( RMSE ), and goodness-of-ﬁt ( R 2 ) of 0.76. This was a signiﬁcant improvement over the best 2D-CNN analysis with 11.84% MAE , 14.91% RMSE , and 0.63 R 2 . The success of the improved 3D-CNN analysis approach depended on the inclusion of “before and after” data, i.e., images collected on dates before and after the lodging event. The integration of geometric and spectral features with 3D-CNN architecture was also key to the improved assessment of lodging severity, which is an important and difﬁcult-to-assess phenomenon in bioenergy feedstocks such as biomass sorghum. This demonstrates that spatio-temporal CNN architectures based on UAV time series imagery have signiﬁcant potential to enhance plant phenotyping capabilities in crop breeding and Precision agriculture applications.


Introduction
Lodging negatively affects yield in many crops [1][2][3]. This damage, defined as the displacement of plant stems from an upright position [4], is mainly induced by the strong winds of extreme weather events during the growing season of the crop. Lodging negatively impacts plant morphology [5], physiological processes [6], and growth [7], as well as impeding harvesting activities, to significantly reduce the final above-ground yield of the crop [8][9][10][11]. Lodging damage can also introduce the error or loss of phenotypic data used for selection in breeding, genetic analysis or management decisions [9,10]. Due to its rapid vertical growth [12] and tall stature of up to 3-4 m, biomass-type Sorghum bicolor (L.) A biomass sorghum experiment with an augmented incomplete block design of 1118 plots was planted on 31 May 2019, at the University of Illinois Energy Farm research facility, Urbana-Champaign (40.065789° N, −88.208477° W). The trial included (866) diverse accessions of biomass sorghum in single, four-row plots. In addition, six additional "check" accessions were planted in 16 four-row plots distributed across each of the 16 blocks, as was previously described in detail [11]. A storm with strong winds produced lodging in plots across the experiment 82 days after planting (DAP), as shown in Figure 1. The ground-truthing survey was implemented immediately after the lodging event between 83-85 DAPs.

Lodging Assessment
We utilized a two-stage visual assessment of lodging to provide the data needed for model training and validation. The steps were to determine: 1) whether each plot was lodged or non-lodged; and 2) to estimate the severity of lodging, in the plots where it occurred, in terms of a lodging severity score initially suggested by [10] and adapted in [25]: Lodging score (1-100%) = Lodged area of plot (%) x Lodging severity/3 (1) Lodging severity was scored on a three-point scale based on the visual evaluation of the average inclination angle from vertical plant stems in the lodged area of the plot, where a value of 1 corresponded to mild lodging (0-30°), 2 corresponded to moderate lodging (30-60°), and 3 corresponded with severe lodging (60-90°), as shown in Figure  1. Lodging scores had a continuous distribution once the categorical lodging severity was multiplied by the fraction (%) of the plot that was estimated to be lodged.

Lodging Assessment
We utilized a two-stage visual assessment of lodging to provide the data needed for model training and validation. The steps were to determine: (1) whether each plot was lodged or non-lodged; and (2) to estimate the severity of lodging, in the plots where it occurred, in terms of a lodging severity score initially suggested by [10] and adapted in [25]: Lodging score (1 − 100%) = Lodged area of plot (%) × Lodging severity/3 (1) Lodging severity was scored on a three-point scale based on the visual evaluation of the average inclination angle from vertical plant stems in the lodged area of the plot, where a value of 1 corresponded to mild lodging (0-30 • ), 2 corresponded to moderate lodging (30-60 • ), and 3 corresponded with severe lodging (60-90 • ), as shown in Figure 1. Lodging scores had a continuous distribution once the categorical lodging severity was multiplied by the fraction (%) of the plot that was estimated to be lodged.

Aerial Data Collection and Preprocessing
A RedEdge-M (Micasense, Seattle, WA, USA) multispectral sensor (MSI) was used for imaging to acquire both geometric and multispectral features. The camera included five spectral bands in the blue (465 to 485 nm), green (550 to 570 nm), red (663 to 673 nm), rededge (712 to 722 nm), and near-infrared (820 to 860 nm) regions of the electromagnetic spectrum. The aerial platform utilized was a Matrice 600 Pro hexacopter (DJI, Shenzhen, China), equipped with a Gremsy T1 gimbal (Gremsy, Ho Chi Minh, Vietnam) used to mount the MSI sensor. Flights were conducted 11 times in the season between DAPs 31 and 97 under clear sky conditions and around solar noon. Flight altitude was set to 40 m above ground level, resulting in a ground sampling distance (GSD) of 2-3 cm/pixel. Flight planning considered a 90% forward and 80% side overlapping during image acquisition to ensure successful image postprocessing. Nine black and white square panels (70 cm × 70 cm) were distributed in the field and utilized as ground control points (GCPs). A RTK (real-time kinematic) survey was implemented using a Trimble R8 global navigation satellite system (GNSS) integrated with CORS-ILUC local station to survey the GCPs for accurate geo-referencing and co-registration of imagery throughout the season. A standard Micasense calibration panel was imaged on the ground before and after each flight for spectral calibration. Images were further imported into Metashape version 1.7.4 (Agisoft, St. Petersburg, Russia) to generate the two types of aerial image features: (1) multispectral and (2) geometric 3D reconstruction of the canopy known as crop surface models (CSMs) orthophotos.
An early season flight before plant emergence was utilized as ground level reference for the extraction of the absolute height in the CSM files on subsequent sampling dates (n = 11). The resulting multispectral and CSM orthophotos were resampled to a common 5 cm/pixel resolution and stacked into one multi-feature and multi-date file object. Image chips for each plot were generated by clipping the orthophoto stack object using plot polygons shapefile corresponding to the experimental plot dimensions (2.8 m × 2.8 m), as shown in Figure 2. Three variations of the image chips were produced: (1) spectral features from RGB images (RGB); (2) spectral features from RGB, red-edge and NIR bands (MS); and (3) spectral and geometric features from RGB, red-edge and NIR bands along with a crop surface model (CSM_MS) ( Table 1). These steps were implemented via geopandas, shapely, and rioxarray libraries in Python 3.7.11.

CNN Modelling
The core component of the CNN algorithm is the convolution operation, where a set of trainable kernels are applied to the input image to generate a set of spatial features

CNN Modelling
The core component of the CNN algorithm is the convolution operation, where a set of trainable kernels are applied to the input image to generate a set of spatial features that describe the target predictor [31]. The model learns basic features in the first layers and more complex feature representations at deeper layers. The final output of a CNN is a set of features maps; depending on the use case, these can be directly utilized or fed to a fully connected layer either in classification or regression tasks. The 3D-CNN architecture is a variation of the traditional 2D architecture [32]. It utilizes all the standard CNN architecture features but adopts a third dimension, which in this case was the temporal descriptor of the target trait across the stack of images. The 3D-convolution layer operation was applied with a learnable three-dimensional kernel considering the spatio-temporal characteristics of the image chip as width (number of pixels) × height (number of pixels) × depth (time sequence length).

CNN Implementation and Metrics
Customized CNN architectures were utilized in this study. After trial and error for lodging detection, the backbone feature generator of the 2D-CNN architecture included two 2Dconvolution layers and one flatten layer ( Figure 3a). Meanwhile, for lodging severity, it considered three 2D-convolution layers and one flatten layer ( Figure S1a). The rest of the architecture included a fully connected sigmoidal (classification) or linear (regression) layers heads, respectively. The 2D-convolution's kernel filter size was set to (3 pixels × 3 pixels × N features). The 3D-CNN architecture considered two 3D-convolution layers (Figure 3a and Figure S1b). The 3D-convolution's kernel filter moved in three dimensions instead of two, as with the 2D-convolution filter, not only from left to right and from top to bottom, but also forward and backward. Thus, its kernel's size was set to (3 × 3 × 2-9-time sequence lengths) operating at the sequence of each feature over time. The following settings were common to all CNN architectures: convolution layers set to 32 features, zero padding, stride equal to one with no overlapping, and rectified linear unit as activation function were considered (Figure 3 and Figure S1). Batch normalization in the first convolution layer, 2D-max-pooling was set to two following each convolution, with 3D-max-pooling equal to two over the spatial dimensions, equal to one over the depth dimension, and equal to two following the last convolution operation of the 3D architecture, and 50 percent features dropout prior to the flatten layer step was also adopted.
A 2D-CNN model analysis was conducted on data from individual dates, either before or after the lodging event. This tested the importance of the timing between the lodging event, ground-truthing, and aerial data collection. The following four dates of flights were tested: where the values of 1 in parentheses indicate the analysis of data from a single date. where the values of 2 to 9 in parentheses indicate the analysis incorporating data from between two and nine dates. CNN models were implemented in Python Keras API, Tensorflow-GPU version 2.7.0 using a NVIDIA GeForce RTX 3070 (8GB of GPU memory). The combination of different types of features, CNN architectures, and dates of flights or time sequence lengths resulted in the implementation of 60 models (Table S1). Half of these models were used to assess lodging detection, while the other half assessed lodging severity. In each of In addition, various configurations of input data from sequences of dates were analyzed using 3D-CNN architecture to test the sensitivity of predictions to variations in the length of the time sequence and the contribution of dates before and after the lodging event.
The following six sequence configurations were built: • DAPs 69,83: 13 days prior plus 1 day after lodging, labeled as '69_83 (2) where the values of 2 to 9 in parentheses indicate the analysis incorporating data from between two and nine dates. CNN models were implemented in Python Keras API, Tensorflow-GPU version 2.7.0 using a NVIDIA GeForce RTX 3070 (8GB of GPU memory). The combination of different types of features, CNN architectures, and dates of flights or time sequence lengths resulted in the implementation of 60 models (Table S1). Half of these models were used to assess lodging detection, while the other half assessed lodging severity. In each of these tasks, 12 models were based on the 2D-CNN architecture, as a result of combining three types of features (Table 1) and the four dates of aerial data collection. The remaining 18 models were based on the 3D-CNN architecture, as a result of combining three types of features (Table 1) and six different configurations based on different lengths of time and the inclusion of data from dates before or after the lodging event. Each model fitting was iterated 10 times using a random training and testing partition to ease the convergence of the models' prediction metrics.
The full set of 1118 image chips were used to assess lodging detection, with 662 plots labeled as non-lodged, and 456 plots labeled as lodged during on-the-ground field surveying. The 456 plots labeled as lodged were further considered to assess lodging severity. In both tasks, training dataset was split (70:30) into training and testing datasets. The training dataset was split further (80:20) into training and validation datasets. The validation dataset was used to optimize the models' performance and prevent overfitting during training. Binary cross-entropy and mean square error were used as loss functions in each of the tasks, respectively. Models were trained up to 400 epochs using the early stop setting with lambda value 0.01 to ensure efficient training time. Learning rate and decay parameters were initialized with 0.01 and 0.001 values, respectively. The test dataset was utilized to expose the models to unseen data to evaluate the generalization ability of the models.
The overall accuracy (OA), Precision, and Recall were utilized as evaluation metrics to determine the performance of the models on lodging detection. The metrics are described in Equations (2)-(4): In Equations (2)-(4), TP (true positive) is defined as lodged plots correctly classified as lodged plots. TN (true negative) is defined as non-lodged plots correctly classified as non-lodged plots. FP (false positive) is defined as non-lodged plots incorrectly classified as lodged plots. FN (false negative) is defined as non-lodged plots incorrectly classified as lodged plots.
The mean absolute error (MAE), the relative root mean square error (RMSE) and coefficient of determination (R 2 ) were utilized as evaluation metrics to determine the performance of the CNNs in lodging severity score. Each metric is described in Equations (5) and (6): In Equations (5)- (7), y,ŷ, and y are the observed, predicted, and observed mean lodging severity scores values of the ith plot, and n is the total number of samples within the study area. The computation time and GPU memory used were computed using a custom callback function and custom calls to NVIDIA-smi in Python during model implementation, respectively. Time was computed as seconds per epoch during model training, and the GPU memory usage was computed as the maximum percentage of GPU used during model implementation.
Activation mapping visualization [33] was implemented to better understand the ability of the CNNs to use the information contained in the image arrays to deliver the output prediction. The importance of the image regions in the prediction can be identified by projecting back the weights of the output layer on to the convolutional feature maps, using this technique [30]. The following steps were used to generate activation maps in both tasks, as described in [34]. First, the CNN model was utilized to map the input image to the activations of the last convolution layer as well as the output predictions. Then, the gradient of the predicted class (classification) or value (regression) for the input image, with respect to the activations of the last convolution layer, was computed. Lastly, each image channel in the feature map array was multiplied by how important this channel was with regard to the predicted class or value, then all the channels were summed to generate the corresponding activation map. In essence, this provides a measure of how strongly portions of the image contribute to the predictions made by the CNN, and it can be expressed and visualized as a false-color scale.

Lodging Detection
The overall accuracy (OA) of 2D-CNN predictions of the lodging that occurred during the storm 82 DAP was very limited when analyzing images collected seven days prior to the lodging event used as a reference baseline (75 DAP; 0.61 to 0.73). After this, it increased substantially for images collected one day after the lodging event (83 DAP; 0.82 to 0.85), and then declined to moderate levels for images collected six to fifteen days after the lodging event (88 and 97 DAP; 0.73 to 0.75; Figure 4a). This variation in OA was associated with parallel changes in both Precision (Figure 4c) and Recall (Figure 4e).

Lodging Severity
When predicting lodging severity, the patterns of variation in the performance of 3D-CNN and 2D-CNN analyses across the various data input scenarios mirrored those observed for lodging detection, but the benefits of 3D-CNN over 2D-CNN were greater (Figures 5a-f). This was specifically the case when the additional information found in the geometric and spectral features of CSM_MS data were leveraged over the progressively There were only modest effects on the OA of the 2D-CNN analysis when the number of image features was increased by adding multispectral or geometric information to RGB, except when analyzing pre-lodging images (75 DAP; Figure 4a). In that specific case, adding CSM data increased OA mainly by increasing Precision (Figure 4c), but also through gains in Recall (Figure 4e).
Applying a 3D-CNN analysis to images from multiple sampling dates produced modest improvement in the OA for lodging detection compared to a 2D-CNN analysis of data from single dates (Figure 4a). OA was greatest (0.86-0.88) when the 3D-CNN only analyzed imagery collected from 2 to 3 dates that straddled the lodging event (Figure 4b). However, performance only declined marginally when images from more dates, both before and after the lodging event, were included, and results were especially robust when geometric features were combined with spectral features (CSM_MS) compared to RGB or multispectral (MS) features alone (Figure 4a). This was consistent with the greatest gains in the performance of 3D-CNN over 2D-CNN occurring when using CSM_MS data compared with MS or RGB. The geometric features included in CSM_MS were particularly important for maintaining high levels of Precision and Recall, which decreased for 3D-CNN analysis of RGB and MS as the number of dates of data collection increased (Figure 4d-f). When considering the computation resources, the time required for 2D-CNN (Figure 4g) was significantly lower than 3D-CNN (Figure 4h). The computation time of 3D-models followed an exponential response when increasing the number of flights, with a maximum of 9 seconds per epoch (Figure 4h). A progressive increase in computation time between 3D-RGB, MS and CSM_MS using 3-9 numbers of flights was also notable (Figure 4g). The GPU usage was 40-50% for 2D-CNN ( Figure S2a) and markedly higher 75-94% for 3D-CNN models with a slow increase in GPU usage when increasing the number of flights in the models ( Figure S2b).

Lodging Severity
When predicting lodging severity, the patterns of variation in the performance of 3D-CNN and 2D-CNN analyses across the various data input scenarios mirrored those observed for lodging detection, but the benefits of 3D-CNN over 2D-CNN were greater (Figure 5a-f). This was specifically the case when the additional information found in the geometric and spectral features of CSM_MS data were leveraged over the progressively simpler image features of MS, and then RGB, data by the 3D-CNN (Figure 5b,d,f). The predictive ability of the 3D-CNN analyses of CSM_MS data was comprehensive (MAE = 9.44-10.06%, RMSE = 11.88-12.62%, R 2 = 0.72-0.76) whenever imagery from dates before and after the lodging event (83 DAP) were used. Predictive ability declined only slightly from its maximum when images from only two dates were used or when images from across nine dates were used (Figure 5b,d,f). By comparison, when CSM_MS data from only the three dates following the lodging event were used, R 2 decreased and error increased (Figure 5a-f).
Despite being significantly less capable than the best 3D-CNN models, for the 2D-CNN analysis of images from one day after the lodging event that occurred, 82 DAP was able to predict lodging severity moderately well (MAE = 11.84%, RMSE = 14.90%, R 2 = 0.63; Figure 5a,c,e). However, the ability to predict lodging severity progressively declined using imagery that was collected six, and then fifteen, days after the lodging event (Figure 5a,c,e). Additionally, almost no variance in lodging severity across the sorghum population could be explained by a 2D-CNN using images collected 7 days prior to the lodging event. As with 3D-CNN models, the extra spectral and geometric features in the CSM_MS dataset, compared to the MS and RGB data, aided improved predictions of lodging severity. The computation time required for 2D-CNN (Figure 5g) was significantly lower than 3D-CNN (Figure 5h). The computation time of 3D models followed an exponential response when increasing the number of flights, with a maximum of 3.5 s per epoch (Figure 5h). A progressive increase in computation time between 3D-RGB, MS and CSM_MS using 3-9 number of flights was also notable (Figure 5g). The GPU usage was 24-43% for 2D-CNN ( Figure S2c) and markedly higher (68-91%) for 3D-CNN models, with a more markedly increased allocation of GPU usage when increasing the number of flights in the models ( Figure S2d).

Interpretation of CNNs Predictions via Activation Mapping Visualization
The superior performance of 3D-CNN analysis over 2D-CNN analysis can be visualized for the scenarios of each type with greatest predictive power (i.e., 2D-83 and 3D-69:83 for CSM_MS) using receiver operating characteristic (ROC) curves, observed versus predicted values for both lodging traits, and by the interpretation of the activation maps (Figure 6a-e).  MAE (a,b), RMSE (c,d) and R 2 (e,f) metrics. Computation time for 2D models (g) and 3D models (h) with a batch size of 32 samples during model training step.

Interpretation of CNNs Predictions via Activation Mapping Visualization
The superior performance of 3D-CNN analysis over 2D-CNN analysis can be visualized for the scenarios of each type with greatest predictive power (i.e., 2D-83 and 3D-69:83 for CSM_MS) using receiver operating characteristic (ROC) curves, observed versus predicted values for both lodging traits, and by the interpretation of the activation maps (Figure 6a-e). The ability of 3D-CNN analysis to classify lodged plots correctly (true positives) without incorrectly considering plots to be lodged (false positives) is clear in the ROC curves (Figure 6a). Similarly, 3D-CNN analysis predicted lodging severity with less error and bias than 2D-CNN models (Figure 6a,b). Both types of models tend to overestimate the predicted values in plots where mild lodging occurred (0-30% of lodging severity), but this pattern was less of an issue for the 3D-CNN models.
Activation maps indicated that 2D-CNN and 3D-CNN models relied on similar regions within images for lodging detection or prediction of lodging severity (Figure 6d,e). Areas of high activation were visibly located in the portions of images where lodging was most severe (Figure 6d). However, it is notable that 3D-CNN models, for both detection and severity estimation, were able to activate a larger region around the area where plants were damaged. Additionally, in the case of severe lodging (100%), the 3D-CNN model tended to fine-tune the activation level in accordance with heterogeneity in lodging within the plot (Figure 6e). The ability of 3D-CNN analysis to classify lodged plots correctly (true positives) without incorrectly considering plots to be lodged (false positives) is clear in the ROC curves (Figure 6a). Similarly, 3D-CNN analysis predicted lodging severity with less error and bias than 2D-CNN models (Figure 6a,b). Both types of models tend to overestimate the predicted values in plots where mild lodging occurred (0-30% of lodging severity), but this pattern was less of an issue for the 3D-CNN models.
Activation maps indicated that 2D-CNN and 3D-CNN models relied on similar regions within images for lodging detection or prediction of lodging severity (Figure 6d,e). Areas of high activation were visibly located in the portions of images where lodging was most severe (Figure 6d). However, it is notable that 3D-CNN models, for both detection and severity estimation, were able to activate a larger region around the area where plants were damaged. Additionally, in the case of severe lodging (100%), the 3D-CNN model tended to fine-tune the activation level in accordance with heterogeneity in lodging within the plot (Figure 6e).

Discussion
This study analyzed two spatial dimensions and a temporal dimension in data from a time course of UAV remote sensing images, with a 3D-CNN architecture to facilitate improvements in lodging detection and the prediction of lodging severity relative to standard 2D-CNN architecture (Figures 4 and 5). Modest gains were made in the qualitative task of lodging detection through the use of a 3D-CNN compared to a 2D-CNN, which had an overall accuracy of 0.79-0.88 depending on the number of spectral and geometric features of the input data. In contrast, when estimating quantitative variation in lodging severity, there could be significant benefits to using a 3D-CNN over a 2D-CNN. These gains in prediction ability were realized when a 3D-CNN was used on feature-rich data containing both geometric and spectral information, i.e., when the CSM or red-edge and NIR wavebands were dropped from the input data, the benefits of 3D-CNN over 2D-CNN were limited. These findings highlight how leveraging the extra information captured in remote sensing data that include more spectral information and multiple timepoints can depend on use of more complex analytical approaches. The successful application of this approach for lodging severity screening in biomass sorghum provides proof-of-concept for a tool to rapidly assess a trait that is of significant interest to breeders because it impacts harvestable yield, but it is very challenging to assess on a large scale by traditional methods of visual inspection [15]. Therefore, significant gains in the efficiency of crop improvement could be achieved. In addition, the effective analyses of imagery from UAVs would provide farmers with a tool that is accurate, while also much faster than manual field inspections of large acreages for assessing lodging damage and deciding on the relative value of harvesting a crop or generating evidence to support an insurance claim [16]. Otherwise, farmers are often advised to harvest early to avoid lodging, causing a reduction in potential productivity by shortening the growing season. More broadly, the lessons learned here may be relevant to other applications of time series remote sensing imagery, such as the assessment and evaluation of damage from seismic activity [35].
In recent years, agricultural studies have increasingly made use of a variety of imaging methods based on UAV-based remote sensing data. In the domains of plant breeding [36] and Precision agriculture [37], this growing popularity reflects the need for the accurate assessment of crop traits (1) across many genotypes and at spatial scales that are too large for manual measurement to be efficient, yet (2) at spatial scales where the resolution and number of image features (spectral and/or geometric) from piloted aircraft or satellites can be inadequate. The independent analysis of data from individual dates of image collection from UAVs has been the most common approach to date. For example, quantitative thresholding approaches have been applied in this way to aerial images to assess lodging severity in barley (R 2 = 0.94) [38] and lodging rate in maize (R 2 = 0.50) [39]. Success has also been achieved in estimating lodging severity by training and applying CNN analysis to a combined set of images from two developmental stages of wheat [22]. These studies demonstrate the potential for the rapid assessment of lodging by UAV, but are limited by the need for manual intervention to select optimal threshold values in workflow and by not leveraging the temporal variation in the analysis of data. They may also provide an early indication that estimating lodging severity is more challenging in tall, lower density, and coarsely textured canopies such as maize relative to shorter, more uniformly dense and fine-textured canopies such as rice, wheat and barley. If so, biomass sorghum is a good subject for testing methods that can cope with a difficult use case.
The results of the 2D-CNN analysis of images from different individual dates during the growing season provided valuable context for the interpretation of the 3D-CNN results. Lodging severity could be predicted with reasonable accuracy when using images collected one day after the lodging event (Figure 5a). Unsurprisingly, there was almost no useful information to be gained from the 2D-CNN analysis of images collected prior to the lodging event. Additionally, the predictive ability of 2D-CNN analysis declined significantly and progressively when applied to images collected six, and then fifteen, days after the lodging event. This pattern is consistent with the common observation that the continued growth of lodged plants is driven by phototropism to be vertical nature and, therefore, less easy to distinguish from an undisturbed canopy in aerial imagery, even if much of the stem and biomass remains disturbed.
The superior performance of a 3D-CNN over a 2D-CNN for estimating lodging severity appears to depend on leveraging a "before and after" comparison of the event, i.e., prediction accuracy measured in terms of R 2 , MAE or RMSE was best when data from one or two dates before and after the lodging event were analyzed ( Figure 5). This is logical given that a change in the CSM (i.e., height profile) or spectral reflectance of the canopy before and after the lodging event would provide valuable additional information beyond the variation in geometric and spectral signals across plots on a single date once lodging had occurred. This interpretation is supported by the 3D-CNN analysis of a time-course of images from three dates after the lodging event, failing to outperform the best 2D-CNN analysis ( Figure 5).
Nevertheless, provided that the "before and after" comparison was made, the performance of the 3D-CNN was relatively insensitive to the number of sampling dates used prior to the lodging event ( Figure 5). Even the analysis of data from nine dates between 31 to 83 DAP found a lodging severity prediction that was almost as accurate (MAE = 9.99%, RMSE = 12.68%, R 2 = 0.72), and the same was true when only the two dates before and after the lodging event were analyzed (MAE = 9.65%, RMSE = 12.40%, R 2 = 0.75). This suggests that significant error is not introduced into the 3D-CNN analysis when data from early in the season that have a low information value with respect to lodging are included. Additionally, as a practical consideration, it allows for the effective analysis of lodging without manual intervention in order to select dates of data collection nearest to the lodging event.
The notion that a 3D-CNN analysis would improve UAV-based phenotyping beyond what 2D-CNN is capable of is based, in part, on recent satellite remote sensing research. Satellites are an ideal source of automatically generated time courses of remote sensing data [40]. Additionally, as an example, the inclusion of a third convolution layer in CNN architecture to integrate spatial and time dimensions in satellite data enabled improved crop classification at a regional scale [28]. The 3D-CNN analysis of spatio-temporal models was applied to UAV remote sensing for only a limited number of traits and crops. The applicability of the 3D-CNN analysis of spatio-temporal data to predict yield [18] is consistent with yield being a product of compound growth and dynamic interactions with the environment over the entire growing season. However, the black-box nature of CNN analysis makes it hard to pinpoint which elements of the 3D-CNN and time course of imagery drive enhancements in performance. By contrast, the abrupt nature of lodging allows the current study to provide a case study of how 3D-CNN analysis can enhance power through the inclusion of only two timepoints of data "before and after" an event of interest in the life of the crop.
This study reveals the value of combining spectral and geometric information in feature-rich image data, which are necessary to better leverage the benefits of 3D-CNN over 2D-CNN for the estimation of lodging severity ( Figure 5). Spectral information previously proved to be a relevant descriptor of lodging in crops such maize [38], wheat [25], and barley [9]. The visual and near-infrared regions of the spectrum have been the most common proxies for lodging detection [24] and lodging severity [23]. Geometric features derived from the photogrammetric reconstruction of the canopy have been less commonly used in these tasks [2,4,41]. However, the "before and after" comparison of geometric information in CSM was crucial for the enhanced predictive ability of 3D-CNN over 2D-CNN. This parallels and extends the successful integration of geometric and spectral information to improve predictions of traits such as biomass [12,42] and plant height [43].
It was also notable that the benefit of integrating geometric and spectral features was greater for lodging severity than for lodging detection [36,39], regardless of the CNN architecture considered. This is consistent with previous findings [44,45], where the spectral information alone proved highly efficient in detecting lodging as a pixel-wise classifica-tion [22]. Additionally, this is likely, in part, explained by the fact that the geometric features are highly sensitive and less prone to signal saturation than spectral features [46], which becomes relevant when considering small differences in the degree of lodging severity between plots under dense canopy conditions. Even though, the computational cost of both traits cannot be directly compared given the fact that the sample sizes and their 2D architectures differ slightly. The improvement in predictive performance by 3D-CNN implied a higher computational cost. Due to the relative higher improvement of these models on lodging severity rather than lodging detection, the relative computation cost was lower for this trait. The higher performance in this study did not fall into the larger time sequence lengths, but the exponential increase in the computation cost of the 3D architectures at long time series cannot be overlooked, and this is still a computational challenge and active area of research [47,48].
According to the activation maps, both CNN architectures utilized similar regions of images to determine lodging detection and severity. However, the greater predictive power of the 3D-CNN did coincide with larger and more refined activation regions in the images compared to the 2D models maps. In this regard, it can be argued that the best 3D-model is able to extract additional informative descriptors from the spatial-temporal features during 3D-convolution operations and, thereby, produce an improved prediction.
Even though we utilized manual ground truths to build the models, we recognize that a visual observation and quantification, by walking through the alleys and plots after lodging, was subject to errors that may affect the performance of the model. These could be improved by utilizing ground-based imaging methods such as LIDAR for groundtruthing, but this was beyond the scope of our study. Future research might further exploit the additional ground truth data of a quantitative nature for the use of spatio-temporal CNN architectures to better understand the capacity of certain genotypes to recover after lodging damage. Additional analyses are also needed to determine the impact of the spatial resolution of imagery and how to reduce the processing time of data preprocessing and modelling implementation of 3D-CNN architectures.

Conclusions
This study aimed to explore the potential improvements in crop phenotyping with UAV remote sensing imaging that could be made with 3D-CNN versus 2D-CNN. Lodging detection and lodging severity in biomass sorghum were chosen as target traits for the case study because their short-term, stochastic nature present a distinct set of challenges and opportunities for interpretation versus traits such as yield, which are determined over an entire growing season. In addition, lodging in highly productive bioenergy feedstock grasses and other cereal crops is agronomically important and inefficient to assess by traditional methods. It has previously been shown that lodging detection prediction with 2D-CNN is feasible and produces results accurate enough to monitor lodging. This study shows that adding an explicit analysis of time as an additional dimension of variation in the spectral and geometric features imagery derived from UAV flights produces (1) a modest improvement in the predictions of lodging detection, but (2) significant gains in the more challenging task of estimating quantitative variation in lodging severity. Using a small number of data collection dates before and after the lodging event provides enough data to outperform a single time-point model prediction in both tasks. However, using images from many dates across the growing season does not substantially impair predictive ability. This work partly addresses the need for advanced modelling techniques that can take advantage of recently improved spatial resolution, the ease of collecting data, cloud storage and processing for UAV remote sensing.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/10 .3390/rs14030733/s1, Figure S1: Diagram of custom CNN architectures utilized to assess lodging severity, 2D-CNN (a) and 3D-CNN (b). Figure S2: Utilization of GPU memory during models training, validation, and testing steps for lodging detection prediction via 2D-models (a) and 3Dmodels (b), and lodging severity prediction via 2D-models (c) and 3D-models (d), respectively. Table S1: Description of the 60 CNN models implemented according to combinations of UAV-based image features, CNN architectures, and target traits.
Author Contributions: S.V. and A.D.B.L. conceived the study, interpreted the data, and wrote the manuscript. T.L.P. established, maintained and harvested the field trial. S.V collected, processed and analyzed data. All authors have read and agreed to the published version of the manuscript.