1. Introduction
Vegetation phenology is one of the most sensitive biological indicators of terrestrial ecosystem responses to global climate change [
1] and a key factor regulating the terrestrial carbon cycle [
2]. Therefore, high-accuracy monitoring and retrieval of vegetation phenology are essential for elucidating the response and feedback mechanisms of terrestrial ecosystems to climate change and for improving the reliability of carbon cycle process modeling.
Traditional phenology monitoring mainly relies on field plot surveys and manual, periodic observations [
3], in which changes in plant developmental stages are recorded to derive key parameters such as the start of the growing season (SOS), end of the growing season (EOS), and length of season (LOS). Although this approach can accurately represent local physiological processes, its application at large spatial and long temporal scales is constrained by labor and time costs and by sparse sampling. In wetlands, long-term inundation, restricted access, and safety concerns further limit field observations, weakening both their operational feasibility and representativeness. To overcome the limitations of traditional surveys in terms of continuity and automation, near-surface observation systems have been developed along several fixed, automatic, and high-frequency pathways. For example, phenological cameras (PhenoCam) continuously record canopy appearance and seasonal transitions using time-lapse imagery from fixed viewpoints [
4]; eddy covariance flux networks characterize ecosystem functional rhythms from the perspective of energy and mass exchange [
5]; and unmanned aerial vehicle platforms provide high spatial resolution surface information over small areas to complement and refine ground-based samples [
6]. Although these near-surface approaches still share the common constraint of limited spatial coverage, they provide high-frequency, calibratable process evidence that plays a crucial role in bridging ground-based observations and satellite-based phenology retrieval.
Satellite remote sensing, with its non-contact, wide-area coverage and long-term continuous observation capability, has become the core means of obtaining phenology information from regional to global scales. With the maturation of Earth observation systems, land surface phenology products with spatial resolutions of 250–1000 m have been widely applied [
7]. To better represent strongly heterogeneous surfaces, long time series of vegetation indices or biophysical variables have gradually become mainstream, and by fusing multispectral imagery from sensors such as Landsat 8 and Sentinel-2, it is now possible to generate spatiotemporally continuous phenology products at 10–30 m resolution with revisit intervals of about 3 days, thereby greatly enhancing the ability to resolve complex phenological dynamics at the land surface [
8]. However, remotely sensed phenology retrieval is highly sensitive to data quality and model assumptions. Cloud contamination, time-series noise, cross-sensor differences, and scale effects can all introduce uncertainty and reduce the accuracy of phenological transition detection.
In wetland ecosystems, these challenges are even more pronounced. Wetlands are characterized by highly diverse vegetation types and fragmented spatial distribution [
9], while rapid changes in land–water ecotones tend to cause mixed and abrupt remote sensing signals, which interfere with the stable extraction of phenological information. In addition, wetlands are often located in cloud-prone, rainy climatic regions, where usable optical observations are substantially reduced, and time series frequently exhibit large gaps and discontinuities, making it difficult to meet the requirements of large-scale, long-term phenology monitoring. Satellite remote sensing remains the primary tool for obtaining large-scale phenology information [
10], yet optical imagery in wetland monitoring still suffers from severe data scarcity [
11,
12,
13,
14]. In regions with frequent cloud and rain, medium- to high-resolution sensors rarely form stable sequences of valid observations, further weakening the ability to capture rapid vegetation changes [
15,
16]. Although sensors such as MODIS offer high revisit frequency, their coarse spatial resolution (250 m–1 km) is inadequate for characterizing fragmented and highly heterogeneous wetland landscapes. To alleviate the trade-off between spatial and temporal resolution, spatiotemporal image fusion (STF) techniques have been widely adopted [
17,
18,
19], aiming to combine the advantages of fine spatial resolution and high temporal frequency. Classical models such as STARFM [
20], ESTARFM [
21], and FSDAF [
22] perform well over relatively homogeneous surfaces, but in regions with strong land-cover variability, such as wetland margins and land–water transition zones, they often struggle to reconstruct sharp spectral changes and to adequately account for the uncertainty of cloud-contaminated observations, thereby compromising the realism of the fused series and the stability of phenology identification. In wetland regions with highly heterogeneous water–vegetation mixtures, kNDVI is expected to better capture the vegetation signal because its nonlinear kernel transformation reduces saturation effects and mitigates water background interference, providing a more accurate representation of vegetation dynamics in mixed pixels.
Beyond limitations in spatiotemporal resolution, the suitability of the vegetation index (VI) itself also constrains phenology retrieval accuracy [
23,
24,
25,
26]. Traditional indices such as NDVI are prone to saturation in high-biomass regions and are highly sensitive to backgrounds such as water and mudflats [
27], a problem that is particularly acute in wetlands where background conditions change rapidly. To overcome the limitations of linear indices, Camps-Valls et al. [
28] proposed the kernel Normalized Difference Vegetation Index (kNDVI), which uses a radial basis function (RBF) kernel to map spectral information into a high-dimensional Hilbert space, thereby more effectively capturing the nonlinear relationship between near-infrared and red bands. Previous studies have shown that kNDVI outperforms NDVI, EVI, and NIRv in resistance to saturation, sensitivity to gross primary production (GPP), and robustness to noise [
28], and it has been applied to dynamic monitoring and conservation-effectiveness evaluation across multiple ecosystems [
29,
30,
31,
32]. However, for highly dynamic wetlands, which are typical of cloud–rainy climates with strong background mixing and rapid land-cover transitions, there is still a lack of systematic technical frameworks and targeted validation for generating high-quality long time series of kNDVI and applying them to phenology retrieval. In particular, in wetland regions with strongly mixed water–vegetation pixels, kNDVI provides superior performance because its nonlinear kernel mapping reduces background interference from water and mudflats and better captures vegetation spectral responses related to canopy structure and photosynthetic activity. Empirical studies in wetlands have confirmed that kNDVI yields more accurate vegetation dynamics and phenology retrieval compared to linear indices in such heterogeneous environments [
33,
34,
35].
Although existing spatiotemporal fusion methods have substantially improved the temporal continuity of medium- and high-resolution optical imagery, a clear methodological gap remains for phenology retrieval in cloud-prone wetland environments. First, residual clouds and shadows are usually treated mainly during preprocessing, whereas their uncertainty is rarely incorporated into the fusion weighting process. Second, fixed or weakly adaptive similar-pixel selection strategies are not well suited to fragmented wetland landscapes, where water, mudflats, and vegetation patches are spatially interlaced and may lead to mixed-pixel errors along land–water boundaries. Third, most existing fusion methods rely on approximately linear reflectance or index conversion assumptions, which may be insufficient for representing nonlinear kNDVI changes during rapid green-up, senescence, or hydrologically driven vegetation transitions. These gaps directly motivate the methodological improvements proposed in this study, including cloud-probability and temporal-distance weighting, adaptive matching windows constrained by land-cover information, and a quadratic correction term for nonlinear kNDVI reconstruction.
To address the above issues, this study proposes an improved IESTARFM-based fusion scheme on the Google Earth Engine (GEE) platform for monitoring highly dynamic wetlands and uses it to construct high-frequency kNDVI time series for phenology retrieval. Methodologically, we extend the ESTARFM framework by introducing cloud-probability and temporal-proximity weighting factors to automatically identify and suppress unreliable pixels in reference images, thereby reducing the influence of residual cloud noise on the predicted series. In response to wetland landscape fragmentation and rapid land–water transitions, we incorporate a classification-driven fusion strategy, in which an adaptive spatial window and land-cover masks constrain the range of similar pixels to enhance spectral consistency and boundary preservation in land–water transition zones. Considering the nonlinear response characteristics of kNDVI, we further introduce a quadratic correction term to more finely describe rapid green-up and senescence processes and to improve the detection of key phenological events such as SOS and EOS. Using the Poyang Lake wetland as the study area, we reconstruct a 10 m, 8-day kNDVI dataset, evaluate the effectiveness of the improved scheme against ESTARFM, STARFM, FSDAF, and original Sentinel-2 observations, and finally combine Harmonic Analysis of Time Series (HANTS) with a dynamic threshold method to generate wetland vegetation phenology maps. This work provides a transferable technical framework for high-precision data reconstruction and phenology monitoring in cloudy–rainy wetlands.
2. Materials and Methods
2.1. Study Area
Taking the retrieval of wetland vegetation phenology in the Poyang Lake region as a case study, this paper investigates how different spatiotemporal fusion algorithms affect phenology retrieval in cloudy–rainy wetlands. Poyang Lake is located in northern Jiangxi Province on the southern bank of the middle reaches of the Yangtze River, with geographic coordinates ranging from 115°49′E to 116°46′E and 28°24′N to 29°46′N (
Figure 1). The Poyang Lake wetland is an important stopover and breeding site for migratory birds and is typically classified as a flow-through and river-connected lake, whose natural hydrological linkage with the Yangtze River provides the key background for wetland formation and evolution. The region is situated in a subtropical monsoon climate zone, characterized by a mild, humid climate and abundant sunshine. The multi-year mean air temperature ranges between 16.5 °C and 17.8 °C, displaying an overall pattern of higher temperatures in the south and lower in the north, with a north–south temperature difference of about 1 °C [
36]. The long-term mean annual precipitation over the lake area is 1542 mm; driven by the monsoon, precipitation exhibits pronounced spatiotemporal heterogeneity. Intra-annually, rainfall is highly unevenly distributed, with approximately 69.4% of the annual total occurring from April to September, while spatially it decreases gradually from the southeast toward the northwest [
36].
2.2. Source of Data
All remote sensing data used in this study were obtained from the Google Earth Engine (GEE) platform. The study period spans from 1 January 2024 to 31 December 2024. High-temporal- and high-spatial-resolution optical data were jointly used to construct a high-frequency kNDVI time series for the Poyang Lake wetland to support phenology retrieval. As the high-temporal-resolution input, we used the MODIS Terra Surface Reflectance 8-Day Composite product MOD09A1 (Collection 6.1), which provides 500 m surface reflectance for MODIS bands 1–7 together with quality layers, thus offering representative observations for each 8-day compositing period and supporting pixel-level screening based on quality information. As the high-spatial-resolution input, we used the Harmonized Sentinel-2 MSI Level-2A surface reflectance dataset, which provides multispectral observations at 10–60 m spatial resolution with a nominal revisit interval of about 5 days, thereby supplying finer spatial structural constraints.
Cloud and snow/ice contamination were removed by jointly masking the Sentinel-2 QA60 band and the Scene Classification Layer (SCL), in order to reduce the impact of residual cloud noise on the index time series under cloudy conditions. To ensure temporal consistency between sensors, Sentinel-2 observations were composited at 8-day intervals to match the MOD09A1 period. The MODIS MOD09A1 images were resampled to the Sentinel-2 10 m grid only for spatial registration and pixel-wise implementation of the spatiotemporal fusion algorithm. This operation should not be interpreted as generating true 10 m spatial details from MODIS observations. Similar preprocessing procedures have been widely adopted in established spatiotemporal fusion models, including STARFM, ESTARFM, and FSDAF, where coarse-resolution images are first aligned with the fine-resolution grid before fusion. In the proposed framework, MODIS mainly provides temporally continuous variation information, whereas Sentinel-2 constrains the fine-scale spatial structure. Bilinear interpolation was selected to reduce blocky discontinuities associated with nearest-neighbor resampling while preserving the smooth temporal signal of coarse-resolution MODIS observations. Vegetation indices were calculated from red and near-infrared reflectance, using bands B4 and B8 for Sentinel-2 and bands 1 and 2 for MODIS. On this basis, kNDVI time series were generated and subsequently used for phenological parameter retrieval.
Because the input optical images were pre-screened before fusion, this study mainly evaluates the reconstruction performance under moderate residual cloud contamination rather than under the full range of cloudiness conditions. Sentinel-2 scenes with cloud coverage greater than 25% were excluded, and remaining cloud-affected pixels were further masked using QA60 and SCL. MODIS MOD09A1 observations were also screened using quality information and used as temporally continuous coarse-resolution reference data. Therefore, the proposed IESTARFM framework is primarily designed to reduce the influence of residual cloud noise and short-term observation gaps within pre-filtered optical time series, rather than to reconstruct fine-scale vegetation dynamics under persistent cloud cover where valid Sentinel-2 observations are almost completely absent.
2.3. Proposed Algorithm for Spatio-Temporal Fusion of Remote Sensing Images
To address the coexistence of long-term data gaps in medium- to high-resolution optical imagery and strong landscape fragmentation in cloudy–rainy wetlands—which leads to temporal discontinuities, difficulty in preserving sharp boundaries, and easy propagation of residual cloud noise—this study proposes a spatiotemporal fusion algorithm tailored for wetland vegetation phenology retrieval in such regions. The algorithm fuses high-temporal-resolution, coarse-scale images with low-temporal-resolution, fine-scale images to generate continuous reflectance and index series with both high spatial and temporal resolution.
The core innovations of the proposed IESTARFM scheme are as follows: (i) cloud probability and temporal proximity are explicitly incorporated into the weight construction to systematically down-weight unreliable pixels and mitigate the propagation of residual cloud noise; (ii) an adaptive spatial window combined with land-cover-based constraints is used to improve the homogeneity of similar-pixel selection in land–water transition zones and to enhance boundary preservation; and (iii) a quadratic correction term is introduced into the fusion formulation to account for the nonlinear response of kNDVI, thereby improving shape fidelity during rapidly changing phases of the growing season and stabilizing the detection of key phenological events.
2.3.1. Adaptive Matching Window
In the traditional ESTARFM framework, a fixed-size moving window is used to search for similar pixels. In regions with frequent cloud cover or highly fragmented land-cover types, such a fixed window may be too small to locate a sufficient number of cloud-free similar pixels. To address this, we adopt an adaptive window strategy that starts from a relatively small window. If the number of similar pixels does not reach a predefined threshold , the window size is gradually enlarged until enough candidate pixels are found. As the window expands, only pixels that are spatially close and exhibit small spectral differences are admitted into the candidate set, in order to avoid introducing heterogeneous land-cover types from an overly large neighborhood. This adaptive design ensures that even under locally extensive cloud cover, clear pixels over similar land-cover can still be identified within a larger neighborhood to support prediction.
In this study, the local heterogeneity index H was defined as the coefficient of variation of valid Sentinel-2 kNDVI values within the initial local window. For a target pixel x, H was calculated as:
where
and
represent the standard deviation and mean value of valid kNDVI pixels within the initial local window
, respectively, and ε is a small constant used to avoid division by zero. Only pixels that passed the cloud mask and land-cover consistency constraint were included in the calculation. A larger H indicates stronger local spatial heterogeneity, such as fragmented land–water boundaries or mixed vegetation–mudflat pixels, whereas a smaller
indicates a relatively homogeneous surface.
The values of , , and α were determined according to the 10 m spatial resolution of Sentinel-2, the fragmented land–water pattern of the Poyang Lake wetland, and the need to balance boundary preservation with the availability of sufficient similar pixels. Specifically, was used to prevent the search window from crossing sharply contrasting land-cover boundaries in highly heterogeneous areas, while was used to ensure that enough candidate pixels could be obtained in relatively homogeneous or locally cloud-affected regions. The parameter α controls the rate at which the window radius decreases with increasing heterogeneity.
The window radius
is defined as a decreasing function of a local heterogeneity index
:
where
and
denote the minimum and maximum candidate window radius,
α controls the strength of the adjustment, and
H is the local heterogeneity metric. When local heterogeneity is high, the window converges toward
and focuses on more homogeneous areas around the target pixel; when heterogeneity is low, the window approaches
and can be enlarged to gather more samples. In this way, the window automatically shrinks in land–water transition zones to avoid spanning sharply contrasting land-cover types, whereas over extensive grassland or mudflat areas it can expand to incorporate a sufficient number of pixels.
For each target pixel, at least N similar pixels are required for prediction. If fewer than N similar pixels are found within the initial window, the window is progressively expanded until the requirement is satisfied; if a large number of similar pixels are retrieved but they include multiple land-cover types, the window is reduced or the similarity threshold is tightened. This procedure effectively adjusts the window on demand, balancing prediction accuracy and statistical stability.
Let
R and
denote the upper and lower bounds of the window radius, respectively, and let
be a heterogeneity threshold. When the local heterogeneity index
, the minimum window
is used; when
H is very small, the maximum window
is adopted; for intermediate values, the window radius is computed by linear interpolation:
Through this adaptive matching-window strategy, unsuitable similar pixels are substantially excluded, thereby reducing prediction noise. In the Poyang Lake wetland, the dynamic window helps avoid mixing water and land pixels and enhances the robustness of the model in highly heterogeneous environments.
2.3.2. Cloud-Probability and Temporal-Distance Weighting Factors
The original ESTARFM weight function mainly considers spectral difference and spatial distance, and lacks explicit treatment of cloud cover and temporal distance. In cloudy regions, residual clouds or shadows may render some neighborhood pixels unreliable, yet the original algorithm may still assign them relatively high weights. In addition, ESTARFM finally fuses the two base date predictions by simple interpolation and does not fully exploit the temporal information contained in multiple observations. Since the presence of clouds introduces observation uncertainty, it should be suppressed through the fusion weights. Therefore, in the IESTARFM scheme, two additional factors are introduced into the weight calculation, namely a temporal distance weight and a cloud probability weight, so that cloudy pixels and pixels that are far in time from the prediction date receive lower weights and the influence of unreliable data on the fused results is reduced.
Using cloud masks or cloud probability information, each candidate similar pixel is assigned a coefficient that reflects its degree of clearness. For a pixel
, this coefficient is defined as
where
is the probability of being cloud free or a cloud mask indicator, with 1 for cloud free pixels and 0 for cloudy pixels. Cloud free pixels thus obtain a coefficient close to 1, whereas cloudy pixels receive 0 or a very small value. This coefficient is directly multiplied with the weight so that cloudy pixels are down weighted. If a similar pixel is cloud covered in the base image, its effective weight is greatly reduced and cloud-contaminated values are prevented from interfering with the prediction:
where
denotes the original ESTARFM weight function.
To account for the fact that the two base images may have different temporal distances to the prediction time, and that each observation has different representativeness for the prediction time when multiple dates are used, a temporal distance factor is further introduced into the weights. The temporal difference is defined as
and the corresponding temporal distance weight factor is
where
is a tuning constant. When only two base dates are available, this factor effectively acts on the final fusion weights, assigning a larger weight to the date closer to the prediction time. In this study, τ was set to 16 days, corresponding to two 8-day compositing intervals, so that observations close to the prediction date were given higher weights while nearby valid observations could still contribute to the fusion. When multiple images are involved in the fusion, the temporal factor can also be directly multiplied with the weight of each pixel pair:
so that the influence of temporally distant data is reduced already at the neighborhood weighting stage.
The original ESTARFM weight function can be written as
where
denotes the spectral difference between the similar pixel and the central pixel at the base date,
is the spatial distance, and
,
are the tuning parameters. After introducing the two new factors, the complete improved weight function can be written as follows:
is normalized as follows:
With this modification, pixels with a high probability of cloud obstruction do not obtain large weights even if they are spectrally similar, while neighboring observations that are close to the prediction date are given higher weights. This optimization improves the noise robustness of the model under cloud affected conditions and makes fuller use of time adjacent observations to increase prediction accuracy.
2.3.3. Method for Constructing kNDVI Time Series
In this study, the NDVI value of each pixel is first transformed into kNDVI through a kernel function
. A linear fusion model is then applied in the kNDVI space to perform prediction, and the predicted results are finally mapped back to NDVI. Pan et al. [
37] proposed a quadratic polynomial kernel that represents NDVI variation using NDVI and its squared term. Experimental results have shown that this approach can be effectively integrated into existing models.
To more accurately characterize the nonlinear response relationships among pixels, a quadratic term is introduced into the conversion coefficient model of IESTARFM, extending the original linear conversion formulation to a quadratic polynomial relationship, which can be expressed as follows:
where
and
are coefficients to be determined, with
representing the linear term and
the quadratic term. Since this formulation includes the squared term of the coarse resolution change, it allows the predicted value to accelerate or decelerate relative to the coarse scale variation, and can fit curve shapes that cannot be captured by a simple linear model. For NDVI describing the growth of highly dynamic lake wetland vegetation, the quadratic term helps to capture the curvature of its temporal trajectory.
Since only two base images are available, it is difficult to directly determine the two parameters of the quadratic curve. In this study, the sample size is increased in the spatial dimension. Within a local window, if the NDVI of different similar pixels shows a similar quadratic relationship with MODIS NDVI, all pixel pairs and the corresponding coarse-resolution NDVI values at times and are collected, and a quadratic function of with respect to is fitted using least squares regression to estimate the coefficients and shared by the entire window. This approach is equivalent to assuming that all pixels within the window follow the same NDVI change curve, while individual pixels may occupy different positions along this curve, and the fitted curve can then be applied to the prediction of all pixels in the window.
The coefficients a and b were estimated locally using least-squares regression within the same adaptive matching window had been defined. Therefore, the fitting window was not an additional fixed-size window, but followed the adaptive radius r, which was constrained by and . For each target pixel, only cloud-free candidate pixels satisfying the land-cover consistency constraint and similar-pixel selection criteria were used for coefficient fitting.
Specifically, for each valid similar pixel i, the fine-resolution kNDVI change and the corresponding coarse-resolution kNDVI change between two base dates were calculated as and , respectively. The local quadratic relationship was then expressed as , where a and b were solved by least-squares fitting using all valid similar pixels in the adaptive window. In matrix form, this can be written as , where and . The fitted coefficients were then applied to predict the fine-resolution kNDVI change at the target date.
To avoid unstable fitting, the quadratic correction was applied only when a sufficient number of valid similar pixels were available within the adaptive window and when the fitted relationship was numerically stable. If the number of valid samples was insufficient, if the coarse-resolution change was too small to support stable quadratic fitting, or if the fitted coefficients produced unreasonable kNDVI values, the model reverted to the linear correction form by setting the quadratic term to zero. The predicted kNDVI values were also constrained within the physically meaningful range of the vegetation index. These boundary conditions were used to prevent overcorrection in highly heterogeneous or poorly observed regions.
This method can partially correct the systematic bias of a linear model. When the actual NDVI change is smaller than the linear prediction, a negative quadratic coefficient (b < 0) makes the predicted curve concave downward and reduces the predicted values. Conversely, when the actual change exceeds the linear prediction, a positive coefficient (b > 0) makes the curve convex upward and increases the predicted values, so that the model better matches the true NDVI dynamics. After introducing the nonlinear term, the correlation between the generated high resolution NDVI time series and the actual observations is significantly enhanced and the mean squared error is reduced. Therefore, in highly dynamic lake wetland regions, combining the kNDVI framework with quadratic correction effectively captures the nonlinear variation characteristics of vegetation indices, improves the realism and stability of time series reconstruction, and ultimately supports the construction of a kNDVI dataset for the Poyang Lake wetland for 2024 with an 8 day temporal resolution and a 10 m spatial resolution.
2.4. Method for Wetland Vegetation Phenology Retrieval
After obtaining the high spatiotemporal resolution kNDVI time series, key phenological parameters were extracted by combining harmonic analysis with a dynamic threshold method. Given that vegetation in the Poyang Lake wetland is strongly influenced by hydrological conditions and often exhibits complex growth trajectories, traditional smoothing approaches such as Savitzky–Golay filtering are generally insufficient to capture its rapid-growth features. Therefore, a phenology retrieval framework based on HANTS was constructed in this study.
2.4.1. Time Series Reconstruction Based on HANTS
HANTS is a classical signal processing method based on Fourier transform that decomposes a discrete time series into a superposition of sine and cosine waves at different frequencies [
38]. Compared with other filtering algorithms, HANTS has clear advantages in filling data gaps and removing nonperiodic noise, and is particularly suitable for fitting wetland vegetation growth curves with pronounced seasonal patterns.
The basic formulation of the HANTS model is given by:
where
is the fitted kNDVI value at time
,
is the mean of the time series,
is the number of harmonics,
and
are the cosine and sine amplitude coefficients of the
-th harmonic, respectively, and
is the residual term.
The reconstruction accuracy is highly sensitive to the choice of the harmonic order
. If the order is too low, the model tends to be underfitted and cannot effectively reproduce short term details of vegetation growth. Conversely, an excessively high order amplifies observational noise and leads to overfitting. To determine the optimal harmonic order, this study adopts the Akaike Information Criterion (AIC) for adaptive selection. By balancing goodness of fit and model complexity, the AIC criterion can automatically determine the best order for each pixelwise time series, and is given by:
where
is the number of observations and
is the sum of squared residuals. By minimizing the AIC value, the model can adaptively adjust the fitted curve for different vegetation types, thereby generating smooth and faithful continuous kNDVI time series.
2.4.2. Extraction of Wetland Vegetation Phenological Parameters
Based on the reconstructed kNDVI time series, this study uses the Dynamic Threshold Method to extract key phenological events. The dynamic threshold approach, also known as the seasonal amplitude (SA) method, was proposed by Song et al. [
39]. Because wetland vegetation communities have complex structures and large regional differences in peak biomass, a fixed threshold method is difficult to apply. The dynamic threshold method determines phenological phases using relative amplitude, which effectively reduces the influence of background soil and differences in maximum biomass.
First, the HANTS fitted curve is used to determine the maximum value
and the baseline value
of the annual growth cycle and to construct a relative change rate series. The calculation is given by:
where
is the value at the current time step, and
and
are the maximum and minimum values within the growing season, respectively.
According to the recommendations of the proponents of the Dynamic Threshold Method and previous studies [
40], the decision thresholds for the start of season (SOS) and end of season (EOS) are set to 20 percent and 50 percent of the seasonal amplitude, respectively, which yields results that are consistent with observed conditions. In addition, to eliminate pseudo phenological signals caused by short term flooding or residual cloud contamination, a logical constraint based on the length of season (LOS) is introduced. At the pixel level, anomalous pixels with LOS shorter than 30 days are removed, and the same procedure described in this section is applied to extract pixel scale phenological parameters. Finally, a kNDVI based phenology map for the Poyang Lake wetland in 2024 is produced, as shown in
Figure 2.
2.5. In-Situ Phenological Observations and Validation Sample Design
In-situ phenological observations were collected in the Poyang Lake wetland during the 2024 growing season to provide independent reference dates for validating remotely sensed phenological retrievals. The field dataset included 60 validation sites, consisting of 30 sites dominated by
Carex spp. (Cs) and 30 sites dominated by
P. australis (Pa). These sites were distributed across the main wetland vegetation zones and land–water transition areas in the study region. Due to space limitations, only representative field survey sites are presented in the manuscript. Specifically,
Table 1 lists 10 representative validation sites for Cs, and
Table 2 lists 10 representative validation sites for Pa, including their geographic coordinates, vegetation categories, and observation information. These representative sites are provided to illustrate the spatial coverage and field sampling design of the in-situ observations, while the complete set of 60 validation sites was used for accuracy assessment.
Field surveys were conducted during the key phenological periods of the two dominant wetland vegetation types, including the green-up, peak-growth, and senescence stages. For Cs, observations covered both the first and second growing seasons, whereas for Pa, observations focused on its single annual growing cycle. At each site, the dominant species, vegetation coverage, growth stage, and phenological status were recorded following a consistent field observation protocol. The start of season (SOS) was identified as the date when visible green-up and rapid leaf expansion became dominant within the plot, while the end of season (EOS) was identified as the date when widespread senescence or canopy browning was observed. When multiple observers participated in the field campaign, records were cross-checked after each survey to reduce observer-related uncertainty, and inconsistent records were resolved through joint inspection of field notes and photographs.
To compare point-based field observations with pixel-based remote sensing retrievals, a fixed spatial neighborhood with a radius of 15 m around each validation site was used to extract representative remotely sensed phenological values. This radius was selected because the field coordinates were accurately recorded and because a 15 m radius approximately covers the central Sentinel-2 pixel and its immediate neighboring pixels, thereby reducing single-pixel noise while minimizing the inclusion of heterogeneous land-cover types. Within each neighborhood, only pixels belonging to the same vegetation class as the field site were retained using an independent land-cover mask, and the median phenological date of the retained pixels was used for comparison. The land-cover mask used for neighborhood filtering was generated independently from the field phenological validation samples and was not used for model calibration, fusion training, or accuracy adjustment; it was used only to reduce class mismatch during pixel extraction.
2.6. Accuracy Assessment
To evaluate the reliability of the remotely sensed phenology retrieval, independent reference phenological dates were used to test the temporal consistency of the retrieved start of season (SOS) and end of season (EOS), and day level error statistics were adopted as the main accuracy metrics. The reference phenological dates were derived from field observations. Both the reference dates and the remotely sensed results were converted to day of year (DOY) and paired according to sample location. Considering the scale mismatch between point observations and pixel based estimates, as well as the strong heterogeneity of wetland landscapes, a fixed spatial neighborhood was constructed around each validation sample to extract a representative remote sensing based phenology value. Land cover consistency constraints were further applied within this neighborhood in order to reduce the influence of mixed pixels on the error statistics.
Let
and
denote the reference and remotely sensed phenological dates of the
-th sample respectively, where
can represent either SOS or EOS, and let
be the total number of samples. The day scale error is defined as follows:
and the Mean Error or bias (ME), Mean Absolute Error (MAE), and Root Mean Square Error (RMSE) are calculated as follows:
In addition, to characterize the level of agreement, the correlation coefficient
and coefficient of determination
between the retrieved and reference dates are calculated as:
4. Discussion
4.1. Error Mechanisms and Improvements of Multi Source Remote Sensing Image Fusion in Wetland Environments
Lake wetlands are characterized by pronounced land–water interlacing, fragmented landscapes, and rapid state transitions, which make the same-type pixel assumption difficult to satisfy in space and increase the likelihood that abrupt changes and observation gaps occur simultaneously in time. As a result, weight based spatiotemporal fusion tends to generate blocky artifacts, texture fragmentation, and spectral bias along boundaries and in change hotspots. Recent reviews of spatiotemporal fusion generally point out that, under heterogeneous landscapes and rapidly changing surfaces, the main bottlenecks of traditional methods lie in the reliability of similar pixel matching, the limited representativeness of observations during change periods, and the lack of effective control over error propagation; the absence of unified benchmark datasets and standardized evaluation frameworks further limits method comparability and generalization [
22,
41]. Against this background, the reconstructed series in this study achieve (
= 0.875) and RMSE = 0.066 on the validation set, while maintaining better spatial continuity and detail fidelity in cloud covered and shadowed areas, indicating that the proposed improvements help suppress error accumulation under strongly heterogeneous and cloudy wetland conditions. Compared with weight based frameworks represented by STARFM and ESTARFM, which are constrained in handling abrupt changes and mixed pixels, FSDAF enhances adaptability to heterogeneous landscapes and sudden changes through spectral unmixing and spatial interpolation. However, subsequent studies still highlight room for improvement in terms of handling complex changes and robustness, and methods such as FSDAF2.0 explicitly focus on improving change detection and stability [
42,
43]. These findings suggest that enhancing fusion performance in wetland scenarios generally requires concurrent strengthening of three aspects: sample availability, quality constraints, and change representation. In this study, an adaptive matching window is used to alleviate the shortage of valid samples under cloud cover and fragmented landscapes, and cloud probability and temporal distance are explicitly incorporated into the weight allocation to reduce the impact of unreliable observations. This is consistent with the existing emphasis on mitigating error propagation through quality control and spatiotemporal constraints [
44]. In recent years, deep learning-based fusion and reconstruction methods have developed rapidly, especially optical–SAR fusion frameworks designed to cope with persistent cloud cover, which show considerable potential for vegetation index time-series reconstruction in cloudy regions. Compared with these state-of-the-art deep learning approaches, IESTARFM has the advantages of clearer physical interpretability, lower dependence on large training datasets, simpler parameterization, and easier implementation in operational monitoring workflows. These advantages are particularly relevant in wetland regions where high-quality training samples and multi-year reference datasets are often limited. However, deep learning models are generally more powerful in learning complex nonlinear and cross-sensor relationships, and optical–SAR fusion models can make use of all-weather SAR observations to compensate for missing optical data under persistent cloud cover. Therefore, IESTARFM should be regarded as a robust and controllable fusion framework for pre-filtered optical time series, rather than a replacement for deep learning or optical–SAR fusion methods under all conditions. Incorporating Sentinel-1 SAR observations is a planned direction for future work, because SAR can provide complementary structural and hydrological information under cloudy conditions. Nevertheless, the integration of Sentinel-1 data requires careful treatment of optical–SAR radiometric differences, scattering mechanisms, speckle noise, and vegetation–water interaction signals in wetlands.
From the perspective of computational cost and scalability, IESTARFM was implemented on the Google Earth Engine platform using server-side preprocessing, cloud masking, 8-day compositing, kNDVI calculation, similar-pixel searching, pixel-wise weight calculation, and image export. The main computational cost comes from the adaptive-window search and pixel-wise weight calculation, whereas cloud masking, temporal compositing, and kNDVI calculation are relatively lightweight operations in GEE. In our implementation, reconstructing one target-date image generally required approximately 8–13 min under the current study-area extent and export settings. This runtime should be interpreted as an empirical reference rather than a fixed algorithmic constant, because GEE execution time can be affected by server load, export scale, region size, and task scheduling. In practical applications, the reconstruction can be organized by 8-day time steps and spatial tiles, making the method feasible for regional-scale wetland monitoring without requiring local GPU training or large local storage. For larger areas or multi-year applications, tiled processing and batch export are recommended to maintain scalability.
4.2. Limitations and Future Research
Although the fusion strategy improves temporal continuity and spatial detail, the remaining uncertainties can be broadly attributed to both algorithm-related and data-related limitations. From the algorithmic perspective, the resampling of MODIS observations from 500 m to the Sentinel-2 10 m grid was used only for spatial registration and pixel-wise fusion implementation, but it may still introduce local smoothing effects and mixed-pixel uncertainty, particularly along fragmented land–water boundaries. In addition, the quadratic correction term assumes that similar pixels within a local window share a comparable nonlinear relationship between coarse- and fine-resolution kNDVI changes; this assumption improves the representation of nonlinear vegetation growth but may be less effective under abrupt hydrological disturbances or rapid land-cover transitions. From the data perspective, residual radiometric inconsistencies between Sentinel-2 and MODIS, possible geometric co-registration errors, residual thin clouds or shadows, and the limited availability of valid Sentinel-2 observations may also contribute to reconstruction uncertainty. The overall validation result, with
= 0.875 and RMSE = 0.066, together with the regional comparison results in
Table 3, suggests that these uncertainties were reduced but not completely eliminated. Therefore, the remaining errors should be interpreted as the combined effect of algorithmic assumptions and input data quality. The second growing season of Cs coincides with the dry-season hydrological transition period in the Poyang Lake wetland. During this stage, rapid environmental changes associated with seasonal water-level fluctuations and exposed mudflats may interfere with the kNDVI signal, thereby contributing to the observed differences in EOS retrieval accuracy.
IESTARFM improves robustness by incorporating cloud-probability and temporal-distance weighting, but its performance still depends on the availability of sufficient valid fine-resolution observations. Under extremely persistent cloud cover, where valid Sentinel-2 observations are insufficient to constrain fine-scale spatial structure, reconstruction uncertainty may increase. Future studies using longer multi-year time series or additional sensors could further quantify fusion accuracy across a wider cloudiness gradient.
Although the proposed IESTARFM framework was evaluated using the 2024 Poyang Lake wetland dataset, the methodological framework itself is not limited to a single year or site. The main processing logic, including cloud masking, spatiotemporal fusion, kNDVI reconstruction, and phenological parameter extraction, can be transferred to other years or similar wetland systems. However, some parameter settings may be affected by the quality of the original remote sensing data, such as cloud contamination, the number of valid Sentinel-2 observations, temporal gaps, and local landscape heterogeneity. Therefore, when applying the method to other years or wetland regions, minor parameter adjustment or local validation may still be necessary.
Future work will focus on optimizing the preprocessing steps prior to fusion, including the implementation of more consistent cloud masking, stricter multi sensor geometric co registration, and enhanced radiometric normalization methods, in order to improve robustness across regions. In addition, the fusion framework will be extended by integrating complementary satellite observations, such as other optical sensors or radar data, to further mitigate the impact of persistent cloud cover and enhance the applicability of the method in cloudy and rainy regions.
5. Conclusions
In response to the challenges posed by scarce valid observations under cloudy and rainy conditions, highly fragmented wetland surfaces, and pronounced nonlinear responses of vegetation indices, this study proposes a spatiotemporal fusion algorithm for remote sensing imagery, IESTARFM, tailored to wetland vegetation phenology retrieval in cloudy and rainy regions. Using the Poyang Lake wetland in 2024 as a case study, the main conclusions are summarized as follows:
(1) At the model level, IESTARFM targets radiometric bias and noise propagation caused by mixed pixels in land–water ecotones and thin cloud contamination. An adaptive matching window and a joint weighting scheme based on temporal distance and cloud probability are constructed to improve the homogeneity of similar pixel selection and the reliability of observations. In addition, a quadratic polynomial correction is introduced into the fusion relationship to represent the nonlinear growth response of kNDVI. These modifications alleviate the systematic prediction bias of traditional linear fusion in highly dynamic periods and enhance both detail fidelity and temporal continuity of the reconstructed time series.
(2) At the application level, the high spatiotemporal continuity kNDVI sequences generated by IESTARFM can stably describe the growth processes of Poyang Lake wetland vegetation and support phenological parameter retrieval. The validation results show that the agreement between the fused images and reference observations is substantially improved, with
= 0.875 and RMSE = 0.066 for the validation set. The retrieved SOS and EOS also exhibited high temporal consistency with in situ measurements, with
values up to 0.86 and RMSE values reported for each species and phenophase in
Figure 8 and
Figure 9. These findings indicate that the proposed method has good robustness in suppressing thin cloud noise, reducing water background interference, and mitigating index saturation in high biomass areas, and can provide reliable technical support for high accuracy and continuous monitoring of wetland vegetation phenology in cloudy and rainy regions.