1. Introduction
Precise and accurate definition of the timing of crop phenology is of vital importance for agriculture. Spectral information collected by remote sensing satellites is readily used to track seasonal patterns and changes in vegetation greenness. Establishing a relationship between this spectral information and crop phenology can allow for assessment of crop development and growth dynamics across agricultural landscapes. The spectral reflectance of the plant canopy varies depending on vegetation characteristics. As crops grow, they undergo physical changes such as increased leaf area, flowering, or higher chlorophyll content, which are detectable in remotely sensed data [
1,
2,
3]. Indices such as the Normalized Difference Vegetation Index (NDVI) are sensitive to theses changes, reflecting the changes in plant characteristics. Analyzing these spectral changes can provide insights into crop growth and conditions, and through this, support crop monitoring and management.
However, linking these remote sensing observations with crop growth stages in a meaningful way for agricultural practitioners is often challenging [
1,
2,
3]. Many existing phenology detection approaches using thresholds, derivatives or trend information offer only a broad overview of phenology (e.g., green up) or focus on stages with distinct features (e.g., rape flowering) [
4]. Thresholding identifies phenological events by predefined value limits on vegetation indices (e.g., [
5,
6,
7,
8,
9,
10]), while a derivative approach captures rapid changes or the shape of a time series which indicate certain phenological transitions (e.g., [
11,
12,
13]). Trend analysis monitors features like slope to assess crop development over time (e.g., [
14,
15,
16]). Although these approaches can correlate with growth scales (e.g., BBCH), there is a gap between remotely observed transitions in e.g., time series of vegetation indices such as NDVI, and ground-observed phenological changes identifiable on a growth scale [
17]. The inability of these approaches to capture these finer in situ phenological growth stage changes (micro-stages) limits their practical application for growers [
1,
2]. To enhance the usability of remote sensing for agricultural applications, it is imperative that concurrent growth scale stages are identifiable at the field level or even sub-field level in the case of heterogeneous growing conditions.
In the past, image frequency and cloud cover were often limiting factors in the detection of crop growth stages [
1]. The advent of harmonized analysis-ready data, at daily or near-daily frequencies, has been shown to facilitate phenological growth stage classification with high precision to a ground-measured growth scale useful for agronomic applications [
18,
19,
20]. The combination of sensors to increase temporal frequency and spatial accuracy for detecting key phenological stages is especially promising, as shown by [
6,
11]. High-frequency data provide more continuous monitoring of changes in crop growth also during the season that could be missed with less frequent observations. This high temporal resolution allows for the detection of incremental changes and short-term phenomena, enabling more detailed monitoring of crop growth at a finer scale. However, the majority of these studies work only in the spatial domain, using machine learning techniques to classify single images to a growth stage [
20]. There is potentially further information in the temporal domain that is unused by these approaches. These high cadence time series have yet to be fully exploited for the monitoring of phenological micro-stages.
Obtaining a large volume of high-quality micro-stage labels with high variability (e.g., covering many regions and years) at the field level is a challenge [
21,
22,
23]. Available datasets are often coarse, recorded at the macro level and not for specific fields, or dense but highly localized on a small number of fields. This lack of quality ground truth labels precludes the use of complex deep or machine learning approaches on satellite data. Studies have shown that Dynamic Time Warping (DTW) can achieve high accuracy with limited training data and is robust in using phenological changes in a time series to classify vegetation features (e.g., crop type and forest type) [
24]. DTW uses templates as a reference pattern; unseen sequences are compared to these templates and classified based on similarity [
25]. In agriculture, DTW is predominantly used in crop type classification, [
26,
27] but has more recently been used for crop growth stage identification tasks. Zhao et al. used Time-Weighted DTW to identify limited growth stages (e.g., green-up, heading, maturity) with distinct characteristics at a regional scale from MODIS NDVI data for winter wheat in China [
28]. Similarly, Ye et al. used Derivate DTW to detect principal growth stages from a ground-based scale (BBCH macro-stages) on a limited number of corn fields from Sentinel-2 data [
29].
These studies generate a reference template by averaging NDVI times series. This approach could have a smoothing effect on the time series, removing important features. Additionally vegetation indices other than NDVI may carry useful information for some growth stages. We have noticed that some macro-stages can span several weeks, depending on weather and growing conditions, making it challenging for DTW to accurately match observations at the beginning and end of these stages. The potential of DTW to identify crop growth at micro-stage precision remains unexplored. DTW combined with higher-cadence, gap-filled imagery has the potential to deliver these finer measurements of phenological change. Finally, the robustness of the approach to unseen geographies is also yet to be assessed. Most studies are limited to a very small number of fields [
29].
To address these gaps and advance the field, our aims are as follows:
Compare various template selection strategies and propose the use of multiple field observations during the detection of the phenological stage.
Identify and delineate 70 micro-stages of maize growth using a comprehensive dataset comprising over 200 fields and 3200 observations.
Explore the effectiveness of different vegetation indices and image bands for accurate growth stage identification.
Evaluate the performance of algorithms using near-daily-temporal-resolution harmonized data obtained from two datasets which combine different sensors: Planet Fusion (PF) and Harmonized Landsat Sentinel-2 (HLS).
Assess the generalizability and effectiveness of the proposed method in a second, distinct geographical region, thus expanding the scope of applicability beyond the initial study areas.
3. Methodology
3.1. Preparation of the Time Series
Unlike most of the literature studies, we aimed to understand the potential of different spectral features selected or computed from the imagery for the phenology detection problem. Therefore, this paper focused on the individual spectral bands (green, red, and NIR) that are most sensitive to vegetation dynamics, along with the most commonly used spectral indices including Normalized Difference Vegetation Index (NDVI), Enhanced Vegetation Index (EVI), 2-band enhanced vegetation index (EVI2), Kernel Normalized Difference Vegetation Index (kNDVI), Modified Chlorophyll Absorption Ratio Index (MCARI), Chlorophyll Vegetation Index (CVI) and Normalized Difference Water Index (NDWI). These indices were chosen for their proven effectiveness in assessing various aspects of vegetation, such as chlorophyll content, photosynthetic activity, structural attributes and water content, providing comprehensive insights into the physiological status of the vegetation.
The formulas of the selected indices are given in
Table 2.
3.2. Dynamic Time Warping with Weighted Average
In this study, we aimed to develop an algorithm to detect the micro-stages given in
Table 1. For a target field,
t to detect each micro-stage, the algorithm finds the available observations and related fields in the training dataset. The time series of these fields are used as our templates,
. To achieve our goal, we leverage DTW for crop phenology detection. DTW is a method used to measure the similarity between two time series denoted as
t and
, which may fluctuate in time or intensity, such as the NDVI values. Our DTW approach employs the Mori step pattern [
47] for managing intricate time relationships, aiding in capturing subtle changes in phenological transitions. Additionally, the Itakura parallelogram [
48] windowing function enables the handling of time warping in signals exhibiting non-linear temporal distortions.
The process commences with the construction of a cost matrix, which encapsulates the pairwise distances between all points in t and . Incorporating the specified windowing function confines possible alignments. The Mori step pattern dictates permissible movements during alignment, thus influencing the alignment path within the cost matrix. Subsequently, we calculate the alignment distance, , representing the overall dissimilarity between t and after accounting for temporal variations and intensity disparities.
Following the computation of the cost matrix, the subsequent step entails determining the optimal alignment path. This is accomplished through backtracking, wherein the path with the minimum accumulated cost is traced from the bottom-right corner of the matrix to its top-left corner. Along this path, matched days of year (DOY) between t and are identified. It is assumed that the function returns the matched DOY on t for a selected point on . Thus, for an observation, , which belongs to the selected micro-stage, p, we can identify the matched DOY in the target field as .
After measuring the distances and matching the points for each template in
S, the algorithm calculates the confidence score
for each template as follows:
Then, it calculates weights for each prediction based on these normalized confidence scores, reflecting their relative importance:
Thus, for each micro-stage, the algorithm uses the available fields in the training set which have an observation for the selected stage.
The overall workflow of the proposed method is summarized in
Figure 6.
4. Experiments
The algorithm was developed and tested on the Kansas dataset. For the evaluation of the method on unseen maize fields in other regions, we used the PIAF dataset. This was the only dataset available for another geographical region and years with the limitation that the number of maize fields was very small.
Given the restricted number of observations for each phenological stage, we implemented leave-one-out cross-validation for all experiments with the Kansas dataset. This involved systematically excluding the target observation along with related observations from the same field to construct the training set. Templates for DTW matching were then derived from the remaining observations in the training set. DTW was subsequently applied to align the target field with each template, facilitating the detection of micro-stages. This rigorous approach ensured a comprehensive assessment of our method’s capability to accurately identify phenological shifts within the target field.
4.1. Performance Metrics
In order to measure the proposed model, we computed the three statistical measures mean absolute error (MAE), median absolute error (MedAE) and root mean square error (RMSE) for each micro-stage:
and
where
n is the number of samples,
is the prediction and
is the ground truth (as DOY).
As mentioned above, each field underwent approximately five visits during the season. Therefore, a reported observation does not necessarily signify the beginning of that stage. Most likely, the crop in the field was at the stage on the preceding and succeeding days. It is also important to consider that the same micro-stage could develop at different rates within a few days, depending on specific growing conditions and the particular growth stage of the crop. Hence, achieving a 0-day error is not a practical expectation, and predictions with differences of around one to five days are plausibly accurate.
4.2. Comparison of DTW Methods
To assess the efficacy of our proposed weighted average approach, we conducted a comparative study with two alternative methods using DTW on the PF-NDVI time series.
Method 1—Average Time Series: In this method, we computed the average time series of the training fields and employed it as a reference during the DTW matching process. This approach aimed to provide a baseline since it was used in several studies [
28,
49] (
Figure 7).
Method 2—Single Similar Field: Unlike the conventional approach of utilizing all fields for comparison, we experimented with selecting the most similar field from the training dataset based on the distance metric mentioned earlier. Subsequently, DTW matching was performed exclusively using this selected field:
Method 3—Weighted Average (Proposed Method): As described in the Methodology section, we proposed a method that applies DTW between the target field and each individual field in the training dataset. After this, we used a weighted averaging approach to calculate the DOY of the selected micro-stage. The weights were determined based on the confidence scores obtained from the distance values.
For each method, we evaluated its performance in detecting micro-stages using PF-NDVI time series (median per field). Comparative analysis was conducted to discern the strengths and limitations of the weighted average approach in contrast to the simpler alternatives.
Figure 8 shows the averages of the mean and median absolute errors for the 70 micro-stages. Mean and median absolute errors are smaller for our proposed method.
In addition to the overall performance metrics calculated earlier, our goal was to determine the error for each observation and tally the number of observations detected with errors of up to 1, 5, 10 and 15 days. It is crucial to note that field observers did not visit the fields daily. However, lacking this information, we opted to compute the difference between our predicted and observed days as the error (see
Table 3).
Our proposed method successfully identified 90% of the observations within a 10-day error margin. In contrast, method 1 only managed to recognize 79% of the observations within the same error range. Upon examining the bottom 3% of observations, where the error exceeded 15 days, we noted that a significant amount of these observations (49 out of 97) were associated with micro-stages of R5. This higher error rate during the R5 stage can likely be attributed to the subtle physiological changes that occur as the kernels transition from Early Dent to Black Layer, which are not easily distinguishable using remote sensing data. The other 48 were associated with the VE (emergence to seedling) and R4 (dough micro-stages). The same changes during VE and R4 are most likely too subtle to be picked up by the optical signal.
4.3. Performance of Spectral Indices and Bands
In the second experiment, our focus shifted to the input data. We computed various spectral indices derived from PF and used the bands most sensitive to vegetation dynamics (green, red and NIR), as explained in
Section 3.1, as input to our proposed DTW method (method 3). For each input type, we conducted experiments separately and evaluated their performance across different phenological stages.
Figure 9 shows the performance for NDVI, MCARI, CVI and NDWI indices over all micro-stages. Comparative analysis was conducted to assess the effectiveness of each input type in accurately detecting phenological micro-stages within the target field (RSME values for the selected micro-stages are given in
Appendix A Table A1 and
Table A3, while the MedAE values are given in
Appendix A Table A2 and
Table A4).
No single vegetation index outperformed the others across all micro-stages. However, we observed that NDVI and MCARI are the most reliable input types for these 70 micro-stages. For both of them, the average median errors for all stages was around 4 days. For 58 micro-stages, the median error difference was less than 1 day for these two indices.
Table 4 shows the performance of the proposed method based on NDVI and MCARI indices for a subset of micro-stages. We observed that MCARI provides better results for the R1 and VT stages. Although CVI has the smallest error from V10 to V16, the RMSE through the reproductive stages and early in the season (V1–V9) is higher than NDVI and MCARI. NDWI always has higher RMSE than at least one other index, except in the Blister—Milk micro-stage. DTW applied to individual spectral bands did not show better performances overall than when applied to spectral indices, and so the spectral bands were discarded from any further analysis. For single-variable methods like DTW, spectral indices seem to be better suited as they combine two or more wavelengths (bands) to enhance the information content of the underlying dataset.
4.4. Comparison of Planet Fusion and HLS
We compared PF to publicly available HLS. For each data source, we evaluated the DTW performance using NDVI and MCARI, as detailed in the preceding section.
Figure 10 illustrates the differences in the HLS and PF NDVI time series and detected micro-stages for two example fields. We compared the RMSE performance of the best PF-based DTW solution with that of the best HLS-based DTW solution for each micro-stage. For PF, MCARI performs the best overall, while for HLS, NDVI performs the best overall. PF’s average RMSE values are 2.5 days lower than HLS’s average RMSE values over all 70 micro-stages (RSME values for the selected micro-stages are given in
Appendix A Table A5 and
Table A6). We observed the biggest differences for V2, R1 and early R5.
Figure 11 illustrates the comparison between the best PF and HLS DTW solutions for two example macro-stages V4 and R1. During V4–V6 is when nitrogen fertilizer is applied during the growing season (in small doses and in addition to pre-planting) with V6 marking the beginning of rapid growth and uptakes of larger amounts of water and nutrients [
50]. Growing conditions around flowering (R1 silking) are important for determining yield, and water stress or shading in this phase negatively affects yield [
51]. The orange color represents the observed and predicted days for each observation in the V4 stage, which includes the 4-leaf and 4–5 leaf micro-stages. The pink color indicates the observed and predicted days for each observation in the R1 flowering stage (
Figure 11b). When we count the number of micro-stages where at least 80% of the samples were detected within a five-day error margin, HLS achieved this accuracy only for the 5–6-leaf stage, with a rate of 80.4%. In contrast, the PF-based method detected at least 80% of the samples within a five-day error margin for all micro-stages from ‘3–4-leaf’ to ‘7–8-leaf’, with the highest sample ratio being 86% for the ‘5–6-leaf’ stage. HLS had its worst performance at the ‘seedling’ stage, where only 59% of the samples were detected within a 10-day error margin. The PF method’s lowest ratio was 67%, observed at the ‘16–17-leaf’ stage within a 10-day error margin.
We observed similar patterns for the rest of the phenological stages.
Figure 12 shows a one-to-one comparison for macro-stages V8, R4 and R5. For both data sources (PF and HLS), we picked the index with the best performance, as mentioned above. As mentioned in
Section 4.2, the R5 macro-stage has a higher error margin compared to other stages.
4.5. Visual Validation for PIAF Fields
It is essential to assess the robustness of the proposed model against temporal and spatial variability. To accomplish this, we utilized the PIAF dataset, as outlined in
Section 2.2.2. All fields were located in Germany, with observation years being either 2018 or 2019. We used all Kansas fields as templates and applied our model to these seven fields. Due to the limited number of fields and observations (22 observations), we could only visually validate the model and conclude only on the overall performance. Additionally, the phenology dataset employed different categorizations of phenology stages following the BBCH scale. Therefore, our aim was to detect macro-stages corresponding to BBCH stages (as shown in
Table 1).
Figure 13 displays the BBCH stage of an observation and the detection results of the corresponding US-scale macro-stages for four different fields. If the observed macro-stage falls within the detected micro-stages (e.g.,
Figure 13d), we consider the macro-stage prediction correct. Otherwise, we calculate the distance between the observation and the nearest prediction of the related micro-stages. In total, 9 out of the 22 observations (41%) were error-free. The average error for all observations was 3.6 days. Similar to our KSU dataset results, most predictions were within a 10-day buffer compared to the ground truth.
5. Discussion and Conclusions
In this study, we proposed a novel method for maize phenology detection using Dynamic Time Warping (DTW) applied to daily time series of optical imagery. We compared the novel analysis-ready PF data and HLS data, limiting the analysis to the spectral bands both datasets had in common (R, G, B, NIR). Both were processed using daily datasets to evaluate the possibility of capturing subtle changes in crop development. We evaluated the model’s performance across 70 micro-stages, leveraging a dataset comprising more than 200 fields and 3235 observations. Our DTW strategy aligns target fields with templates, calculates distances between time series and derives confidence scores for predictions. Unlike methods relying on a singular template [
28], our approach preserves crucial patterns in the data. We highlight the significance of individual observations in prediction accuracy and reduce the impact of potential labeling issues. In addition to examining individual bands, we investigated eight commonly employed indices as inputs. Our observations revealed that NDVI and MCARI indices show particular promise. Across all stages, for PF, the average median errors for both indices were approximately 4 days. Notably, we found that MCARI yields comparatively superior results, especially for the VT and R1 stages, at the transition between vegetative and reproductive stages and during the peak of biomass. This could be due to NDVI and MCARI more closely representing the physical characteristics of each maize growth stage. Both VI have been shown to have a strong correlation with maize canopy cover changes [
52] and NDVI has been shown to strongly relate to biomass accumulation in the V1–V10 macro-stages [
53]. Although the EVI and EVI2 also measure these features, NDVI has been shown to better characterize the start-of-season growth for maize [
54]. In the reproductive growth stages, MCARI has been demonstrated to have a stronger relationship with chlorophyll content than NDVI and CVI [
55], possibly due to the saturation of NDVI at high biomass levels and CVI being largely driven by greenness. CVI also performed better in the V10–V16 stages due to greenness and saturation of NDVI. For HLS, NDVI performed better overall.
Additionally, for the best result with PF-MCARI, we observed that the difference between observation and prediction was equal to or less than 5 days for 2025 of the observations (65%), and 2900 (90%) of the observations had a maximum distance of 10 days. For all micro-stages from ‘3-4-leaf’ to ‘7-8-leaf’, at least 80% of the samples were detected within a five-day error margin, with the highest accuracy reaching 86%. Only 3% of the overall observations had a difference of more than 15 days, with more than half of these observations belonging to the micro-stages of R5, the last macro-stage before maturity. This macro-stage contains the highest number of micro-stages. The performance of the proposed method is particularly promising, considering that field observers did not visit the fields daily. Consequently, reported observations may not accurately represent the exact initiation of each stage. Furthermore, it is impractical and logistically challenging to compile a dataset where phenological stages are recorded daily throughout the growing season across multiple fields. Such an undertaking would require substantial labor and resources, which are often not feasible. Additionally, many micro-stages are nearly imperceptible to the naked eye, complicating accurate field observations. Therefore, expecting zero-day error is unrealistic. Given these constraints, the fact that 97% of the provided predictions fall within acceptable limits underscores the robustness and reliability of the proposed method.
To understand the advantages of Planet Fusion (PF), a daily gap-free, cloud-free, harmonized analysis-ready image dataset, we compared it with the publicly available HLS dataset. PF outperforms HLS, with significant improvements observed in key phenological stages like V4, R1 and late R5. PF is analysis-ready data supplied at a daily resolution, whereas HLS needed to undergo interpolation to produce a daily time series. The selected interpolation approach has an impact on the time series characteristics. As different interpolation approaches were not compared as part of this study, it is possible that the performance of HLS could be improved with a different interpolation algorithm or a more complex gap-filling approach using previous years and proximate pixel values, like PF processing. However, it seems probable that the largest impact on performance is attributable to the difference in raw image volume. Across the year 2017 in the Kansas dataset, for each field, an average of 110.99 PlanetScope observations formed the basis of PF for gap-filling. On the other hand, an average of 44.25 HLS observations were interpolated into a daily time series. Note that the eastern Kansas site had many cloud days and lower data frequency. The average number of days between observations was 8.33 for HLS and 3.25 for PF. As there are more images in PF, it is more likely to have captured changes in spectral reflectance that might indicate changes in the phenological stage. This is exacerbated by the HLS data for 2017 including only one of the Sentinel-2 satellites until July 6th. The increased cadence of PlanetScope data as a source constellation and the gap-filling approach of PF, which also uses signals from other sensors such as Sentinel-2, may be the reason that phenology micro-stages can be better identified in PF.
Finally, we aimed to demonstrate the model’s transferability to different regions and years. We used the Kansas dataset observations to detect stages in seven fields in Germany during 2019 and 2020. The average error was 3.6 days for the 22 observations. The results show its potential applicability across diverse regions and years. However, to provide statistically proven results, it is crucial to increase the number of fields from Germany or obtain additional data from any other region. Unfortunately, phenology observation datasets at this level of detail are very difficult to access. These detailed datasets are not available for all regions globally, limiting the applicability of this approach outside of North America and Europe. Our study hints at what could be achieved when such detailed phenology datasets are combined with the latest technology in remote sensing and machine learning. Further data are needed to validate these approaches for regional-scale research. Remote sensing data could improve crop growth models which are currently based on weather data only (e.g., [
56,
57]) and show variability in predictive power throughout the season. While our study demonstrates the robustness and efficiency of the DTW algorithm in detecting maize phenology stages, we did not focus on optimizing the computational aspects of the code. For application in very large regions, further optimization may be necessary. Additionally, the DTW method can be used in a knowledge distillation framework, where it serves as a teacher model to train a deep learning-based solution, thereby enhancing scalability and performance.
As a direction for future research, we aim to expand the scope of our study by redefining the problem. Instead of identifying a single day of year (DOY) for each micro-stage, we intend to approach it as a segmentation problem, where the model predicts both the start and end dates of each micro-stage. This adjustment will provide a more comprehensive understanding of the phenological dynamics. Furthermore, we intend to explore the variations in plant growth within individual fields. Rather than relying on a single median or mean value for a given day, we aim to leverage the entire field’s pixel data to capture phenology variability within each field accurately. Lastly, we will explore in-season phenology detection for fields where it is known that the planted crop is maize. Our current methodology focuses on retrospectively analyzing the entire time series to determine the timing of each stage; although understanding historical crop phenology is crucial for understanding long-term trends in agricultural productivity and management, there is a growing need in various application domains to ascertain the current phenological stage for timely decision-making. By addressing these aspects, we aim to enhance the robustness and applicability of our model across diverse agricultural landscapes and temporal contexts.