1. Introduction
Western Australia (WA) is a globally significant agricultural region, characterised by vast broadacre cropping systems dominated by wheat, barley, canola and other legumes and pulse crops. In the 2024 growing season, more than 8.3 million hectares were planted, contributing to one of the largest grain harvests on record (approximately 23 metric tons) and reflecting continued expansion of the cropping area driven by agronomic, economic, and climatic trends [
1]. In such a large and dynamic production system, accurate and timely information on crop distribution is crucial to support regional agricultural intelligence, yield forecasting, biosecurity surveillance, resource allocation for logistics and marketing, and policy development.
Agricultural land monitoring also plays a critical role in addressing global challenges such as rapid population growth, rising food demand, climate variability and shifting consumption patterns [
2,
3,
4]. This pressure highlights the need for reliable spatial and temporal information on cropping systems to support sustainable food production and resource management. Crop distribution data and location-specific crop classification maps enable timely monitoring of cropping patterns and provide key inputs for in-season yield forecasting and agricultural risk assessment. This need for accurate crop-type mapping is particularly pronounced in the Mediterranean-type climate of the southwest agricultural region in WA, where strong interannual rainfall variability, increasing climate extremes, and diverse management practices influence crop phenology and productivity. Despite this uncertainty, most crop-mapping products remain post-seasonal or proprietary, arriving too late to inform in-season decisions.
Advances in digital agriculture under the Agriculture 4.0 paradigm that integrates smart farming technologies, big data analytics, artificial intelligence (AI), the Internet of Things (IoT), and information and communication technology (ICT) are reshaping agricultural monitoring and decision-making [
5,
6,
7,
8,
9]. In parallel, progress in earth observation (EO), cloud computing, and data-driven modelling has enabled scalable crop mapping, demonstrated by operational products such as the USDA Cropland Data Layer (CDL) in the United States [
10,
11,
12], Canada’s Annual Crop Inventory [
13], and Copernicus high-resolution crop type layers across the European Union [
14,
15]. However, despite these advances, WA lacks an open, scalable, and explainable crop mapping framework capable of providing reliable in-season crop classification and transferable performance across multiple years and regions. Addressing this gap requires robust modelling approaches and transparent interpretation methods to support agricultural monitoring and decision-making at regional scales.
Satellite remote sensing has advanced significantly in recent decades, particularly with the launch of the Sentinel-1 and Sentinel-2 missions under the Copernicus programme. Sentinel-2 provides freely accessible optical imagery at 10–20 m spatial resolution with a five-day revisit cycle, enabling the monitoring of crop growth dynamics throughout the growing season [
16,
17,
18]. When combined with long-term Landsat archives, these datasets allow the detection of spectral and structural differences among crops across key phenological stages. However, the 8–16 days revisit cycle of Landsat-7, -8 and -9 limits its suitability for near-real-time monitoring in regions with persistent cloud cover [
19]. In addition, the 30 m spatial resolution may not resolve individual paddocks in heterogeneous agricultural landscapes [
20]. Sentinel-2’s higher temporal and spatial resolution address these constraints, through processing time series at scale requires substantial computational resources. Cloud computing platforms such as Google Earth Engine (GEE) have reduced technical barriers, enabling large-scale and near-real-time crop monitoring [
21,
22].
Remote sensing-based crop classification has largely followed two broad strategies. The first relies on spectral features from a single satellite scene sampled on a particular day within the growing season [
23,
24]. While computationally efficient, this approach often struggles to distinguish some crops with similar spectral responses during peak growth, leading to reduced classification accuracy. The second strategy incorporates multi-temporal time-series data, allowing models to capture seasonal dynamics and crop-specific phenological patterns throughout the season [
7,
25,
26,
27]. Time-series approaches are particularly valuable because different crops exhibit distinct growth trajectories, which can reveal when crop discrimination becomes possible and which phenological stages contain the most useful information [
28,
29]. Previous studies have demonstrated the potential of time-series satellite data for crop classification in systems such as rice, maize, soybean and major summer crops [
30,
31,
32]. However, much of this evidence is based on single year experiments, within-year train-test splits or simplified crop systems, including binary rotations [
21,
32,
33], leaving the year-to-year transferability of in-season classifiers less well resolved for diverse broadacre environments such as those found in WA.
Within this context, Extreme Gradient Boosting (XGBoost) and Long Short-Term Memory (LSTM) networks represent two complementary strategies for learning from time-series Earth observation data. XGBoost is well suited to tabular remote-sensing predictors because it can model non-linear relationships among spectral and temporal variables, accommodate heterogeneous feature importance, and perform strongly in operational agricultural applications [
27,
34,
35]. LSTM networks were developed to learn temporal dependencies in sequential data [
36,
37] and have shown strong performance in crop classification, in some cases exceeding traditional machine-learning approaches when crop separability depends on dense phenological trajectories [
31,
38,
39,
40]. More recent developments, including BiLSTM, hybrid temporal masking, and multimodal memory-based architectures, further highlight the value of recurrent and memory-based models for extracting temporal, spatial, and spectral information from remote-sensing data [
41,
42,
43,
44]. These models provide a useful comparison between two ways of learning from the same seasonal vegetation indices (VIs) information: XGBoost as a tabular learner over date-specific predictors, and LSTM as a sequential learner over ordered VI trajectory.
Yet two issues continue to constrain operational uptake. First, many studies use limited temporal coverage, which reduces confidence that in-season performance will remain stable across seasons with different rainfall, temperature, and management conditions. Second, high-performing classifiers often operate as black boxes, providing little insight into which phenological stages or VIs drive crop discrimination. Both issues are especially important in the Mediterranean cropping systems of south-western WA, where strong inter-annual variability can shift emergence, canopy development, flowering, and senescence, and thereby alter the timing of spectral separability.
Here, we use five growing seasons of Sentinel-2 observations (2020–2024) and paddock-level reference labels to evaluate in-season crop classification for six dominant crop classes in the southwest agricultural region of WA. We compare XGBoost and LSTM under a leave-one-year-out cross-validation (LOYOCV) design, assess independent external performance using field-based observations for canola, wheat, and barley, and progressively truncate the seasonal time-series to quantify how classification skill changes as the season progresses. We then apply SHapley Additive exPlanations (SHAP) to identify which VIs and observation windows drive model predictions. Our objectives are to determine the earliest decision-ready mapping window, assess year-to-year transferability, and relate model performance to phenological timing. The contribution of our study is on combining multi-year temporal transfer testing, progressive in-season truncation, and phenology-linked feature attribution to identify when in-season crop mapping becomes reliable and why.
2. Materials and Methods
2.1. Study Area
The focus of our current research is the southwest broadacre agricultural region in WA, as shown in
Figure 1. The study area extends from Geraldton in the north to Esperance in the south, spanning approximately 20 million hectares [
45] in 2022. This region exhibits a mediterranean-type climate characterised by hot, dry summers and mild to cool, wet winters, shaped by the seasonal shift in the tropical ridge and the passage of winter westerly storm systems that deliver cold fronts and low-pressure events to the southwest. Mean maximum temperatures peak in January–February (~31–32 °C) and are lowest in July (~18 °C), with typical diurnal ranges of 8–18 °C. Rainfall is strongly seasonal, averaging ~700 mm annually with around 80 rain days (>1 mm) per year [
46]. Agricultural soils in WA are inherently low nitrogen and phosphorus availability, predominately sandy to duplex profiles [
47,
48].
Importantly, the region displays pronounced spatial climatic gradients that influence both agricultural production and crop phenology. Rainfall generally declines with increasing distance from the coast, creating a marked west–east gradient in moisture availability. Higher-rainfall zones in the western and southern coastal districts support earlier crop establishment and longer growing seasons, whereas the drier inland eastern grainbelt is characterised by shorter seasons and increased susceptibility to moisture stress. A north–south temperature gradient also exists, northern areas experience warmer conditions and earlier growing season onset, while cooler southern zones delay emergence, elongate vegetative growth, and shift phenological stages later into the season. These interacting gradients in rainfall, temperature, and season length contribute to substantial variability in the timing of crop development across the study area. Such temporal shifts in phenology highlight the importance of time-sensitive approaches for in-season crop classification, particularly when applying fixed calendar-based observation windows across multiple years and locations.
The main grains cultivated in the southwest farming system include wheat, barley, canola, and lupins [
49]. In recent years, particularly since the late 2000s, the area under canola cultivation has expanded significantly, especially in the southern regions where average rainfall exceeds 400 mm [
49]. Annual pastures constitute the predominant land cover types, reflecting the extensive areas dedicated to livestock grazing in the WA rangelands and agricultural region [
50]. Broadacre crops in WA follow typical patterns characterised by autumn sowing with the onset on consistent seasonal rainfall (April–June), winter vegetative growth (June–August), spring flowering (August–October) and harvest (October–December).
Table 1 presents published sowing, flowering, maturity, and harvest ranges for the target crops from WA-based and closely comparable Australian agronomic sources. These ranges were harmonised into approximate regional phenology windows for the southwest agricultural region to support interpretation of temporal separability in the classification experiments. Because phenological timing varies with cultivar, sowing date, rainfall zone, and seasonal conditions, these windows should be interpreted as indicative rather than as fixed stage boundaries for all paddocks or years.
2.2. Labelled Reference Data Sources
Crop type labels used for supervised training and validation were obtained from Digital Agriculture Services (DAS), unpublished data, under a research licence. These labels were derived from an independently developed operational crop-mapping system and were used for model calibration and benchmarking. Details regarding the DAS training data and machine learning model are described in Lawes et al. [
45] and Fowler et al. [
57]. While these data do not represent field-verified ground truth, they provide a consistent and spatially comprehensive reference dataset suitable for large-scale modelling applications, particularly where field observations are limited or unavailable. The use of high-quality, model-derived labels as reference data is increasingly adopted in remote sensing studies for regional-scale classification and benchmarking, particularly when operational products are available and internally validated. In this study, DAS labels are treated as a reference dataset rather than absolute truth, and model evaluation is complemented by independent external test data using field-based observations to assess robustness and generalisability. Apart from the training dataset, an independent field-based test dataset was collected between 2020 and 2024 through crop disease surveillance activities conducted by the Department of Primary Industries and Regional Development (DPIRD) across Western Australian farming systems. Field surveys were undertaken annually during the growing season, where crop type and associated observations were recorded at paddock level across a broad geographic transect spanning from Geraldton to Esperance. After restricting the DAS reference dataset to the six target classes (barley, canola, fallow, lupins, pasture, and wheat) and excluding low-confidence records with DAS, 1,247,690 paddock-level samples were retained for model development. Under the outer LOYOCV design, the held-out sample counts were 242,630 (2020), 247,239 (2021), 253,926 (2022), 254,771 (2023), and 249,124 (2024), with corresponding training counts of 1,005,060, 1,000,451, 993,764, 992,919, and 998,566. An independent external field-based test dataset comprising 340 paddocks was reserved for final evaluation and included canola, wheat, and barley only.
2.3. Satellite Data Preparation
Figure 2 shows the overall methodology used for data preparation and classification model development. Sentinel-2 Level-2A surface reflectance imagery at 10 m spatial resolution (Copernicus/S2_SR_Harmonized) was processed using the Google Earth Engine (GEE) platform to generate multi-temporal inputs across the study region. To reduce cloud and shadow contamination, cloud masking was applied using a combination of the Sentinel-2 Scene Classification Layer (SCL) and the cloud probability band (MSK_CLDPRB). Pixels classified as cloud shadow (SCL = 3) and cirrus (SCL = 10) were removed, and an additional threshold was applied to exclude pixels with cloud probability greater than 5%. This combined approach provides a conservative and robust filtering of atmospheric contamination compared to using a single quality layer. To balance temporal resolution with data completeness, a 10-day median composite product [
58] was produced for each growing season between April and October for the years 2020–2024. Despite these preprocessing, optical time-series data remain susceptible to temporal gaps, particularly during periods of persistent cloud cover. To ensure continuous time-series inputs for subsequent modelling, linear interpolation was applied to fill missing observations within the composite sequences, preserving seasonal trajectories while minimising artefacts associated with irregular acquisition intervals. Time-series spectral bands, including blue, green, red, near-infrared (NIR), red-edge 1, and red-edge 2, were extracted for the growing season from April to October for each year between 2020 and 2024. Shortwave infrared (SWIR-1 and SWIR-2) bands were not included in this analysis to maintain a consistent 10 m spatial resolution across all input features and avoid potential artefacts introduced by resampling 20 m bands to finer resolution. Although 10-day median compositing and linear interpolation may smooth short-term phenological variation, this approach was adopted to reduce residual noise and cloud-related gaps while preserving the broader seasonal vegetation trajectories required for in-season crop classification.
To derive paddock-level inputs for model development, zonal statistics were applied using paddock boundaries as spatial units. For each composite date and spectral band, the mean value of all cloud-free Sentinel-2 pixels within each paddock was calculated, resulting in a paddock-scale time series of spectral features used for crop classification. Each paddock was treated as a single observational unit in the modelling framework, consistent with object-based (parcel-level) classification approaches. Consequently, all subsequent model training and validation procedures were based on paddock-level samples rather than individual pixels or paddock area, ensuring that each paddock contributed equally to the analysis and reducing potential bias associated with varying paddock sizes, although smaller paddocks may exhibit greater variability due to fewer contributing pixels.
Vegetation Indices (VIs) and Feature Engineering
A diverse set of VIs derived from Sentinel-2 spectral data was used as model features to capture crop growth dynamics and canopy biochemical properties throughout the growing season. These indices were selected to represent key spectral–biophysical relationships, such as chlorophyll concentration, vegetation density, and canopy structure, providing both temporal and structural context for LSTM and XGBoost models.
Traditional indices such as the Normalised Difference Vegetation Index (NDVI) [
59] and the Enhanced Vegetation Index 2 (EVI2) [
60,
61] were included to represent photosynthetic activity and overall vegetation vigour across the season. The Soil-Adjusted Vegetation Index (SAVI) [
62] was used to reduce soil brightness effects, particularly in early phenological stages, while the Normalised Difference Red-Edge Index (NDRE2) [
63] enhanced sensitivity to chlorophyll variation within dense crop canopies. In addition, the Visible Difference Vegetation Index (VDVI) [
64] was calculated from visible reflectance bands (blue, green, red) and was developed originally to provide an alternative vegetation signal independent of near-infrared data, as originally designed for applications where NIR information is unavailable or unreliable. The Vegetation Coverage Index (VCI) proposed by He et al. [
65] was also incorporated; unlike earlier drought-monitoring VCIs, this new index directly estimates fractional vegetation cover (FVC) using spectral contrast between vegetation and soil. The VCI demonstrates improved linearity with canopy closure and minimal sensitivity to soil background, making it a robust input for models that infer crop coverage and vigour at fine spatial scales [
65]. Furthermore, two recently developed indices, Vi
2 and Vi
3, proposed by Ashourloo et al. [
66], were integrated to enhance separability among cereal crops, particularly wheat and barley. These indices exploit optimised combinations of Sentinel-2 visible and red-edge bands identified through Relief-F feature selection, emphasising phenological differences during heading and flowering stages. When tested in combination with SVM and Random Forest classifiers, Vi
2 and Vi
3 achieved classification accuracies above 88% and Kappa > 0.80, outperforming NDVI and EVI2 for similar datasets [
66]. The equation used to calculate the VIs are listed on
Table S2.
2.4. Classification Model
Two complementary machine learning algorithms were implemented to evaluate in-season crop classification performance using Sentinel-2 derived VI features: XGBoost and LSTM networks. These algorithms were selected to represent contrasting but widely adopted approaches for modelling multi-temporal remote-sensing data. XGBoost was used as a tabular tree-based learner, in which multi-temporal vegetation-index observations were represented as fixed predictors. In contrast, LSTM was used as a sequential learner, in which the same vegetation-index observations were arranged as an ordered time series. Therefore, the comparison between XGBoost and LSTM should be interpreted as an evaluation of tabular and sequential representations of the same underlying spectral-temporal information.
2.4.1. Extreme Gradient Boosting (XGBoost)
XGBoost algorithm is an advanced ensemble learning method based on gradient-boosted decision trees, optimised for speed and predictive performance [
67]. It sequentially builds an ensemble of weak learners, typically decision trees, where each new tree corrects the residual errors of the previous ones using gradient descent optimisation [
68]. XGBoost incorporates regularisation (L1 and L2) to prevent overfitting and employs techniques such as parallelised tree construction and weighted quantile sketching for handling large, sparse datasets efficiently [
67]. The algorithm’s key hyperparameters govern its learning process and model complexity. Parameters such as the learning rate (
) control the contribution of each tree, while max_depth and min_child_weight regulate tree size and overfitting [
69]. The subsample and colsample_bytree parameters introduce stochasticity by sampling subsets of the data and features, enhancing generalisation. Regularisation parameters (
L2;
, L1) penalise overly complex trees, while gamma (
) controls the minimum loss reduction required for further partitioning [
67]. During training, XGBoost uses second-order gradient-based optimisation to efficiently compute both gradients and curvatures, thereby accelerating convergence and improving stability [
67]. To maintain strict separation between model development and evaluation, XGBoost hyperparameter optimisation was performed only within the training years of each LOYO fold. Candidate hyperparameter combinations were evaluated using three-fold cross-validation applied exclusively to the four training years, with macro-F1 score used as the optimisation metric to give balanced weight to all crop classes. The held-out year was not used during hyperparameter tuning or model selection and was reserved only for fold-level evaluation. The evaluated hyperparameter space included n_estimators: 100, 200, and 300; max_depth: 3, 5, 7, and 10;
: 0.01, 0.05, 0.1, and 0.2; subsample: 0.7 and 1.0; colsample_bytree: 0.7 and 1.0;
: 0 and 1; and min_child_weight: 1 and 3 [
69]. Randomised grid search combined with targeted grid refinement around promising regions was used to identify the best-performing parameter set within each fold. After completing all five LOYO folds, the best hyperparameter sets identified from the training data were aggregated, and the most frequently occurring parameter combination was selected as the final XGBoost configuration.
2.4.2. Long Short-Term Memory (LSTM)
For crop classification, time-series features derived from optical or radar imagery are commonly structured as sequential input vectors and passed through one or more stacked LSTM layers, followed by a fully connected dense layer that outputs the predicted crop label [
39,
43]. To guide hyperparameter selection and model configuration, we reviewed previous studies that evaluated recurrent deep-learning models for remote-sensing-based crop classification. Sher et al. [
70] systematically tested more than 1000 LSTM hyperparameter combinations, including optimiser type, activation function, batch size, and number of LSTM layers, and showed that LSTM performance is sensitive to hyperparameter choice. Zhao et al. [
41] evaluated several deep-learning models, including LSTM, GRU, LSTM-CNN, and GRU-CNN, for Sentinel-2 crop-type mapping and used empirical and grid-search-based tuning of parameters such as dropout rate and recurrent cell number. Similarly, Durrani et al. [
44] demonstrated that layer number, batch size, and filter configuration can affect recurrent deep-learning performance in crop classification.
Based on these studies, a bidirectional LSTM model was implemented in this study to represent multi-temporal vegetation-index trajectories. The LSTM input tensor had dimensions
, where
represents the number of temporal observations within the seasonal window and
represents the number of VI features at each time step. For the April–October growing-season configuration,
time steps were available at 10-day intervals, spanning from d10 to d31. This sequence-based approach is consistent with previous crop-mapping studies that have demonstrated the value of multi-temporal Sentinel-2 observations for capturing crop growth dynamics [
43].
The standard sigmoid and hyperbolic tangent activation functions were used internally within the LSTM gates, while ReLU activation was applied in the dense layer. Model training used a weighted categorical cross-entropy loss function with the AdamW optimiser, with a learning rate of and weight decay of . Class weights were calculated from the inverse frequency of each crop class in the training data to reduce the influence of class imbalance and ensure that under-represented classes contributed more strongly to the loss during optimisation. Gradient norms were clipped to 1.0 to stabilise training. Early stopping was applied with a patience of 10 epochs based on macro-averaged F1-score from an internal validation split within the training years, up to a maximum of 100 epochs.
Hyperparameter optimisation was performed within the LOYOCV framework using only the training years available in each fold. For each LOYO fold, the held-out year was excluded from hyperparameter selection, early stopping, scaler fitting, and model training, and was used only for final fold-level evaluation. Candidate LSTM configurations were evaluated through grid search using only the four training years in each fold. The evaluated grid search space included hidden dimensions , number of LSTM layers , and batch size , with a fixed dropout rate of 0.3. For each fold, the best-performing hyperparameter configuration (highest macro-F1 on training years) was selected and evaluated on the held-out test year. The final LSTM model used the best-performing hyperparameter configuration identified through the within-fold grid-search optimisation.
2.5. Model Validation and Accuracy Assessment
To ensure model robustness and temporal generalisation, a LOYOCV framework was implemented using growing season data from 2020 to 2024. This temporal validation strategy avoids optimistic bias arising from random within-season splits and is particularly appropriate for this study because spatial and temporal autocorrelation are prevalent in large-scale agricultural datasets. LOYOCV ensures that models are evaluated on temporally independent held out data, providing a realistic assessment of predictive performance on unseen crop years. For both XGBoost and LSTM, the held-out year was not used for model fitting, hyperparameter optimisation, scaler fitting, early stopping, or model selection. Hyperparameter tuning was conducted only within the training years available in each LOYO fold, using the model-specific procedures described in above sections.
Model performance was quantified using overall accuracy, precision, recall, class-specific F1-score, commission error, omission error, and confusion matrices (
Table 2) to assess within-class and between-class classification patterns. Because the dataset was imbalanced, with pasture and wheat more prevalent than minor classes such as lupins and fallow, class-specific metrics were emphasised alongside overall accuracy. This allowed model performance to be assessed more appropriately under imbalanced class distributions. After LOYOCV evaluation, final models were retrained using the full multi-year labelled dataset from 2020 to 2024 with the selected hyperparameter configurations. These final models were then evaluated using the independent external independent field-based test dataset.
In addition to the full-season evaluation, an in-season classification experiment was conducted to assess the feasibility of early crop-type identification. For this experiment, time series of VIs (NDVI, EVI2, SAVI, VDVI, NDRE2, Vi2, Vi3, and VCI) were incrementally truncated at multiple temporal cut-off points, corresponding to key phenological stages between April and October (e.g., early growth, flowering, grain fill). Models were retrained and evaluated at each truncation using the same LOYOCV framework, allowing classification accuracy to be tracked through the growing season. This allowed assessment of progressive classification accuracy through the season, quantifying the earliest period at which reliable crop discrimination could be achieved, while maintaining consistency in model structure, validation strategy, and performance metrics. This validation framework provides an operationally relevant assessment of model performance against the reference dataset, addressing temporal generalisation, class imbalance, and early-season uncertainty.
3. Results
3.1. Crop Distribution
The distribution of crop types followed consistent patterns throughout the study period (2020–2024). Pasture was the dominant class in all seasons, accounting for 38% of the total paddock area each year (
Table 3). This reflects the prevalence of mixed farming systems across the southwest agricultural region, where pasture plays a key role in crop-livestock rotations and long-term soil management.
The second most prevalent crop class was wheat, consistently accounting for more than 27–29% of the annual paddock composition. Its relatively stable spatial extent across seasons makes wheat a contributor to overall classification performance. Canola exhibited higher inter-annual variability than pasture-wheat classes, with its proportional area increasing from approximately 9% in 2020 to a peak of 17% in 2022, before stabilising around 14% in subsequent years. This expansion is consistent with longer-term trends toward increased canola adoption in higher rainfall zones. Barley maintained representation across all seasons, accounting for 10–14% of the paddock area. Lupins and fallow were consistently minor classes, each contributing a small proportion of the dataset (6.2%). The limited representation of lupin and fallow compared with the dominant wheat and pasture classes highlight the class imbalance inherent in the large-scale agricultural landscapes and presents a more challenging classification problem than balanced experimental datasets.
The observed class distribution provides important context for interpreting classification results presented in subsequent sections. Dominant classes, such as pasture and wheat, benefit from larger training samples, whereas minority classes, e.g., lupins and fallow, are more sensitive to inter-annual variability and to spectral overlap with other crop types. The use of LOYOCV ensures that model performance reflects temporal generalisation rather than year-specific class distributions.
3.2. Seasonal Vegetation Index (VI) Signature and Phenological Behaviour
Seasonal trajectories of VIs revealed distinct phenological patterns among major crop classes, providing a clear basis for temporal crop discrimination.
Figure 3,
Figure 4 and
Figure 5 show representative seasonal profiles for key VIs, highlighting both inter-crop separability and inter-annual variability across the 2020–2024 growing seasons.
Figure 3 shows the EVI2 seasonal overlay of barley, wheat and canola. Both cereals exhibit a classic bell-shaped growth curve, with the onset of vegetative growth, the ascending phase, beginning in June, followed by a peak around late August (approximately August 20–30), and a decline thereafter toward harvest in late October. Peak EVI2 values for wheat and barley typically ranged between 0.4 and 0.5, which corresponds to partial canopy closure, that is, an intermediate canopy density during heading and early grain fill. Notably, EVI2 values rarely approach 1 in agricultural systems; even dense vegetation usually peaks below ~0.7 [
60]. Both cereal crops have similar curve characteristics, reflecting rapid early growth followed by a symmetric senescence period, corresponding to the tillering, flag leaf emergence, and grain fill stages that strongly influence canopy reflectance.
Figure 4 and
Figure 5 show that the start of the growing season or initial rise in NDVI and VCI around mid-May for both wheat and barley, but barley often showed slightly earlier canopy development and marginally higher peak values.
In contrast to the cereals, canola exhibited a broader and earlier reflectance profile. Canopy development began earlier, with peak EVI2 values reaching approximately 0.6 between 20 July and 10 August across seasons. Unlike the cereals, which show a sharp and symmetrical rise and decline, canola displayed an extended plateau associated with prolonged flowering and pod development. Importantly, canola also senesced earlier, with the decline in EVI2 beginning soon after peak flowering, well before senescence was apparent in wheat and barley. This broader peak and earlier onset of senescence produced a distinct temporal signature that differentiated canola from cereal crops, even under varying seasonal conditions (
Figure 3). The NDVI and VCI curve for canola shows the start of the season in early May, and the point of maximum curvature before the peak lies between early and mid-July (
Figure 4 and
Figure 5).
Seasonal climate variability between 2020 and 2024 strongly influenced the magnitude and timing of these VI trajectories. For example, cooler and wetter winter conditions in 2021 promoted prolonged canopy greenness, elevating mid-season EVI2 values for cereals (
Figure 3a,b). In contrast, the year 2023 had below-average winter rainfall across much of the grainbelt, resulting in compressed growing-season curves with lower peak VI values and an earlier onset of senescence. Break-of-season rainfall also played a major role in determining the timing of initial canopy development: years with early, consistent April-May rainfall exhibited earlier rise in EVI2, NDVI and VCI, whereas late rainfall delayed canopy establishment regardless of crop type.
Inter-annual climatic variability influenced the magnitude and timing of VI trajectories but did not fundamentally alter their relative structure. Wetter seasons, such as 2021, were associated with elevated peak VI values and prolonged periods of high greenness, particularly for wheat, barley and canola (
Figure 3). Drier or more variable seasons showed compressed growth cycles, reduced peak values, and earlier onset of senescence. Despite these shifts, the relative ordering and shape of crop-specific VI profiles remained consistent across years, indicating stable phenological signatures that are robust to moderate seasonal variability (
Figure 3).
In contrast to the major grain crops (the cereals and canola), pasture, lupins, and fallow displayed markedly different VI trajectories. Annual pasture showed an early rise in greenness following autumn rainfall, reaching moderate VI values (0.3–0.4) by May–June and peaking around August with values approaching 0.7. This pattern reflects rapid biomass accumulation in mixed grass–legume systems under favourable soil moisture, followed by gradual senescence during spring. The seasonal trajectory of lupins’ VIs were similar in shape to the cereal crops, showing a rapid increase in VI magnitude, and a well-defined peak, but with a delayed onset and greater seasonal amplitude (
Figure 4 and
Figure 5). This indicates a slower establishment but denser canopies later in winter for lupins demonstrated by the peak VI values frequently exceeding those of cereals, reaching 0.75–0.85 in mid to late August, before declining toward maturity. Fallow paddocks maintained consistently low VI values throughout the season, typically below 0.3, reflecting the limited vegetation growth.
3.3. In-Season Classification Performance and Temporal Sensitivity
Figure 6 and
Table 4 reports LOYOCV accuracy metrics representing agreement with the reference dataset for LSTM and XGBoost model. Classification performance increased consistently as longer segments of the growing season were incorporated into the LSTM model, demonstrating strong temporal sensitivity as shown in
Figure 6. Across most years, F1-scores and overall accuracy (
Table 4) improved from early season (April–July) to the full season (April–October) inputs, reflecting the increasing availability of phenological information as crops progressed toward maturity. While the general trend indicates improved performance with longer observation windows, some years exhibited plateauing or slight declines at the end of the season, highlighting inter-annual variability in temporal signal quality.
Using the full growing season (April–October), both XGBoost and LSTM models achieved high classification accuracy, with overall accuracies exceeding 90% across the LOYOCV experiments (
Table 4). XGBoost consistently achieved higher accuracy than LSTM, particularly under truncated early-season conditions (
Figure 6,
Table 4), indicating robust predictions when limited temporal information was available. The performance of XGBoost relies on the summary features derived from the VI time series, which tend to be more robust to noise, missing observations, and irregular temporal sampling. In contrast, LSTM models depend on learning temporal patterns directly from sequential inputs and may be more sensitive to architectural choices, limited hyperparameter tuning, class imbalance or noise in training data. LSTM performance improved as additional time steps were included, reflecting its capacity to exploit longer sequential phenological patterns. Additionally, the use of paddock-level averaging may reduce within-field temporal variability, thereby diminishing the advantage of sequence-based models [
71].
The temporal truncation experiments showed a clear inflection point in classification performance during late winter. When VI time series were truncated at the end of July, both models maintained reasonable accuracy, with overall accuracy exceeding 80%.; however, substantial improvements were observed when data through August were included, with accuracy increasing to approximately 90% for XGBoost and 88% for LSTM. Beyond this point, gains from additional late-season data diminished, suggesting that key information was already captured by late August to early September.
Inter-annual variability affected early-season model performance, particularly when limited observations and reduced spectral separability constrained crop discrimination. Lower early-season accuracy was observed across multiple years (e.g., 2020, 2021, and 2024;
Figure 6), indicating that this pattern is not solely attributable to a single season but reflects a broader limitation of early phenological stages, where crops exhibit similar spectral characteristics. In some seasons, such as 2021, persistent winter cloud covers likely further reduced temporal data density and contributed to lower performance during early to mid-season periods. However, once observations from August onwards were included when cloud frequency typically decreases and phenological divergence between crop types increases classification performance improved and became more consistent across years.
Overall, the observed error patterns are consistent with known limitations of multispectral VIs for separating crop types with similar growth habits. The dominance of wheat–barley confusion highlights the challenge of separating cereal crops using optical indices alone, particularly during early to mid-season stages. These results underscore the importance of incorporating phenologically informative temporal windows and motivate the use of explainable analyses to identify when crop separability is maximised, as explored in the subsequent section.
3.3.1. Crop Specific Errors and Confusion Patterns
Class-specific accuracy metrics and confusion matrices based on reference labelled dataset suggested systematic patterns in misclassification that are consistent with observed phenological overlap among crop types (
Figure 7 and
Figure 8). While overall classification performance was high, errors were not evenly distributed across classes, with the highest uncertainty observed among spectrally and phenologically similar crops.
Canola consistently exhibited the lowest commission and omission errors across all temporal windows (
Figure 7). When the full growing season was used, both models achieved commission and omission errors below 5% for canola. Even under a truncated observation period (April–July), canola errors remained below 13%, indicating strong separability from other crop types early in the season. These results suggest that canola’s distinct phenological pattern allows for early-season classification with minimal confusion. This correlates with its characteristic early and broad VI peaks, which produce a clear spectral signature (i.e., earlier canopy closure, higher peak VI values, and an extended flowering period) that sets it apart from cereals even before full maturity.
In contrast, the errors in wheat classification show a stronger dependence on the length of the temporal input data. For the LSTM model, total error rates increased from approximately 10% using full season data to nearly 19% when only data up to July were included. Omission errors dominated this increase, indicating that wheat paddocks were more frequently misclassified as other crops under early-season conditions. XGBoost exhibited higher temporal stability, maintaining total error rates of approximately 10% across most observation periods, although confusion with barley remained evident. These results suggest that wheat identification relies heavily on late-season phenological features associated with grain filling and early senescence.
Figure 8 shows the confusion matrix from XGBoost model for different observation periods. Since the overall accuracy of XGBoost was higher than LSTM, the confusion matrix results were presented as the highest achieving model. The barley class showed the highest classification uncertainty across both models. Confusion matrixes indicate that approximately 30–40% of barley paddocks were misclassified as wheat, particularly when training data were truncated before August (
Figure 8). The confusion persisted at reduced levels (i.e., 0.90 precision) even when full season data were available. The high degree of misclassification reflects the overlap in canopy structure, chlorophyll dynamics, and phenology between wheat and barley, as demonstrated by their closely aligned VI trajectories (
Section 3.2/
Figure 8). Additional confusion was observed between barley and annual pasture during mid-season, when pasture reached peak reflectance and exhibited VI values comparable to sparse or early–senescing cereal canopies.
Minor classes displayed contrasting classification outcomes. Lupins were generally well classified once sufficient mid- to late-season data were available, achieving precision values exceeding 90% by August. This performance reflects lupins’ delayed onset and higher peak VI values relative to cereals. Pasture exhibited high precision and recall throughout most of the season (0.95 and 0.97), although limited confusion with cereals occurred during early establishment and late senescence phases (
Figure 4 and
Figure 5). Fallow paddocks did not exhibit strong seasonal VI peaks and showed lower amplitude across the season, making them largely separable from other classes. Misclassification of fallow mainly occurred when pasture or cereal fields were sparsely vegetated, reducing spectral contrast between classes. This is supported by observed confusion patterns (
Figure 8) and is consistent with known limitations of VI–based classification, where low fractional vegetation cover and poor crop establishment can lead to spectral similarity between bare soil, stressed crops, and fallow conditions [
72].
Overall, the observed error patterns are consistent with known limitations of multispectral VIs for discriminating crop types with similar growth habits. The dominance of wheat–barley confusion highlights the challenge of separating cereal crops using optical indices alone, particularly during early to mid-season stages. These results point to the importance of incorporating phenologically informative temporal windows and motivate the use of explainable analyses to identify when crop separability is maximised.
The confusion matrix from model further illustrates how the temporal input length affects the class separability. In October, when the full signal was available, canola achieved a high precision of 98%, demonstrating the model’s ability to identify canola with minimal false positives. Even with limited data available up to July, canola precision remained above 93%, confirming that canola can be accurately classified as early as mid-season. Wheat and barley showed greater inter-class confusion early on; misclassification rates only decreased after including post-anthesis and senescence features in the data. A similar study was conducted over a region in Victoria, Australia by Nguyen et al. [
73], which shows the performance of the LSTM model trained over early season (April–June), mid-season (April–October) and full season (April–December) with overall accuracy of 80%, 91% and 93%, respectively. In contrast to our model, the accuracy of canola, wheat, and barley reported by Nguyen et al. [
73] for the April–October window was lower (88%, 86%, and 83%, respectively), and declined further for early-season classification (79%, 69%, and 65%). These differences may reflect variations in class distribution, environmental conditions, and crop phenology between the study regions, as well as differences in data sources and preprocessing approaches.
Lupins were also well classified in August, with a precision of 93%. In both models, pasture and fallow exhibited contrasting classification behaviours, with a precision of 95% and 85% in August, respectively. Minor confusion mainly involved wheat and pasture, especially early in the season when pasture greenness overlapped with emerging cereals, and later when pasture’s VI values resembled those of sparsely vegetated surfaces. Fallow had lower precision at 85% but consistently showed a low VI signal (below 0.3, according to
Figure 4 and
Figure 5) throughout the season, clearly distinguishing it from active land uses. Overall, pasture can be reliably detected throughout most of the season, while fallow, despite its spectral distinctness, can occasionally be confused with degraded or senescing vegetation under variable growing conditions.
3.3.2. Explainable AI Analysis of Temporal Importance
To interpret the drivers of model performance and identify phenologically critical periods for crop discrimination, SHAP analysis was conducted on the XGBoost model trained across different temporal windows. The SHAP score measured the contribution of each feature, VI, and time interval to the model’s prediction, allowing direct interpretation of when and why particular features influence crop classification outcomes. Positive SHAP value indicates that the feature increases the model’s likelihood of predicting the target crop class, thereby supporting that class. Conversely, a negative SHAP value suggests the feature decreases the possibility of that crop class and may favour a different class. The colour represents the feature value, whether the paddocks exhibit a high or low value for the specific VIs (
Figure 9). Feature names reported in SHAP outputs follow the form VI_dXX, where VI denotes the vegetation index and dXX denotes the sequential 10-day composite window within the annual series retained for analysis (e.g., ndvi_d10 = NDVI for the tenth 10-day window, corresponding approximately to early April after exclusion of January–March and November–December windows) (
Table S1).
The SHAP analysis showed a strong correlation of feature importance in late August to early September (d25;
Table S1). Across all temporal configurations that included this window, VIs centred on late August ranked as the most influential predictors a with EVI2, Vi3 and VCI at d25 consistently ranked as the top three features (
Figure 9). When models were trained using data truncated before August (i.e., up to July), the contribution of these features disappeared, and overall classification accuracy declined substantially (0.88 to 0.82,
Table 3). SHAP values for earlier-season indices were more diffuse and of lower magnitude, indicating weaker and less consistent discriminatory power during early vegetative stages. Conversely, features derived from late September and October exhibited negligible SHAP values, suggesting that post-maturity vegetation signals contribute little additional information for crop differentiation once senescence was well underway.
The late-August and early September window coincides with a phenologically important stage in the southwest agricultural region. Winter cereals typically reach peak canopy development and early senescence begins, while canola and lupins exhibit divergent canopy trajectories. During this period, wheat and barley maintain high but subtly declining greenness, while canola displays partial senescence associated with flowering completion and pod development, and lupins sustain high canopy vigour. These contrasting dynamics create maximum inter-class spectral and structural divergence, which the model exploits for classification (
Figure 3,
Figure 4 and
Figure 5). Overall, the SHAP analysis establishes late August to early September as the most informative period for in-season crop classification.
3.4. External Test Data
To distinguish reference-product agreement from field-verified crop identification accuracy, external validation was conducted using an independent field-based test dataset derived from crop disease-surveillance records collected between 2020 and 2024. This dataset was not used during model training and provides an independent evaluation of model performance under real-world conditions, including variation in climate, management practices, and spatial distribution of crop types. The classification model was validated for only three major crop types (Wheat, Barley and Canola) due to the data limitations on other crop types.
Using VI data truncated at the end of August, the XGBoost classification model achieved consistently strong agreement with the external test data. Canola was classified with high accuracy, achieving an F1-score of 0.91, with balanced precision (0.93) and recall (0.90). Wheat also demonstrated robust performance, with an F1-score of 0.85, reflecting reliable detection across seasons despite residual confusion with barley. Barley exhibited lower performance relative to other crops, with an F1-score of 0.65, driven primarily by misclassification as wheat. These results are consistent with the error patterns observed in the cross-validation experiments and reflect the inherent spectral and phenological similarity between the two cereal crops.
The confusion matrix for the external test-set highlights that most classification errors occurred among cereal classes, while confusion between broadleaf crops and cereals was minimal (
Table 5). Canola paddocks were rarely misclassified as wheat or barley, reinforcing the stability of its phenological signature across seasons. The persistence of wheat–barley confusion in the independent test-set indicates that this limitation is systematic rather than specific to the dataset in this study, further supporting the interpretation that optical VIs alone provide limited separability for these crops during certain growth stages.
Overall, the external validation showed strong agreement between the model-derived classifications and the observed field records across multiple seasons and variable climatic conditions. Canola achieved the highest and most stable accuracy, reflecting its distinct phenological trajectory and clear separability across years. Wheat also performed well, although its accuracy was slightly lower than canola due to residual confusion with barley consistent with the overlapping spectral and phenological characteristics observed during mid-season. Barley showed comparatively lower agreement with field records, again reflecting systematic confusion with wheat rather than year-specific effects. The results support the temporal patterns observed in LOYOCV analysis for the major three crop types.
4. Discussion
This study demonstrates that accurate, in-season crop classification across a large and heterogeneous agricultural region is achievable using open-access Sentinel-2 data, multi-temporal VIs, and transparent machine learning methods. Here, we have integrated phenology-aware modelling with explainable AI which delivers high classification accuracy and provides clear insight into when and why crop separability is maximised. The results show that late winter, specifically late August to early September, represents the most informative temporal window for reliable in-season crop discrimination in the southwest agricultural region of WA.
4.1. Phenological Timing as the Primary Driver of In-Season Accuracy
The strong dependence of classification performance on temporal coverage highlights the central role of crop phenology in remote sensing-based crop mapping. Both classification approaches exhibited steady improvements in accuracy as additional time windows of the growing season were included, with a marked inflection point once August data were incorporated. This finding aligns with established literature indicating that mid- to late-season phenological stages contain the greatest spectral separability among crops, as differences in canopy structure, chlorophyll dynamics, and senescence processes become more pronounced [
74,
75].
The SHAP analysis provides a mechanistic explanation for this temporal sensitivity, demonstrating that VIs centred on late August dominate model predictions. This period corresponds to peak canopy development for winter cereals and the onset of divergent senescence trajectories among wheat, barley, canola, and lupins. Results also point to the limited contribution of post-maturity signals indicating that extending classification into late spring leads to diminishing returns, reinforcing the value of targeting phenologically informative windows rather than maximising data volume.
The high accuracies achieved by the October models (90–93%) indicate strong agreement between the predicted crop classes and the reference crop labels under the LOYOCV framework, supporting the temporal transferability of the classification models across different growing seasons. The performance decline observed in the 2021 season highlights the sensitivity of spectral signatures to extreme climatic conditions, where excessive rainfall likely altered the expected phenological reflectance or impacted satellite data availability due to cloud cover [
76]. Despite this, the ability of both models to achieve approximately 90% accuracy by August suggests that in-season classification is viable for most crop types well before harvest.
Together, these results identify late August to early September as the most stable and phenologically informative period for in-season crop discrimination in the southwest agricultural region of WA. This window captures maximum divergence in crop growth trajectories while still providing sufficient lead time for operational applications such as yield forecasting, disease surveillance, and regional production assessment.
4.2. Crop-Specific Performance and Sources of Uncertainty
Classification performance varies systematically among crop types, reflecting known agronomic and phenological similarities. Canola was consistently well classified across all temporal windows, including early- to mid-season periods, due to its distinctive growth pattern, earlier canopy closure, and extended flowering phase [
7,
77]. This robustness suggests strong potential for early in-season detection of canola, which is particularly valuable for yield forecasting, disease surveillance, and market intelligence [
73,
78,
79].
In contrast, wheat and barley exhibited persistent confusion across models and validation scenarios. This limitation reflects the substantial overlap in their canopy architecture, growth timing, and spectral signatures, especially during the vegetative and early reproductive stages. The persistence of wheat–barley confusion under independent external validation confirms that this challenge is structural rather than bound to the study dataset. Similar limitations have been widely reported in optical remote sensing studies and highlight the constraints of relying solely on multispectral VIs for cereal discrimination [
4,
80]. Addressing this challenge will likely require integrating additional information sources, such as shortwave infrared bands, radar backscatter, or ancillary agronomic data.
Minor classes such as lupins, pasture, and fallow displayed more variable but generally interpretable performance. Lupins benefited from their delayed phenology and higher peak VI values, enabling reliable classification once sufficient mid-season data were available. Pasture and fallow exhibited distinct temporal signatures but occasionally overlapped with cropped paddocks under conditions of sparse canopy cover or stress, underscoring the influence of seasonal variability on class separability. Overall, crop-specific results indicate that the main residual errors were associated with spectrally and phenologically similar cereal crops and with seasonal variability affecting minor or more heterogeneous classes.
4.3. Implications for Operational Crop Mapping
The identification of a stable, phenologically meaningful in-season window has direct implications for operational agricultural monitoring. The ability to generate reliable crop maps by late August provides a practical lead time for applications such as yield forecasting, biosecurity monitoring, and regional production assessments. Importantly, the consistency of this window across contrasting seasonal conditions suggests that the framework is resilient to inter-annual climate variability, a critical requirement for deployment in Mediterranean cropping systems as the southwest of WA.
Beyond immediate seasonal applications, reliable crop type mapping enables the development of longer-term temporal records of cropping patterns [
35]. Multi-year crop maps can support monitoring of crop rotations, changes in cropping intensity, and the expansion or contraction of certain crop types across regions [
15]. The availability of consistent crop type information across multiple seasons further supports risk management and policy development, including monitoring shifts in land use, identifying emerging production zones, and assessing regional exposure to pests, diseases, or extreme climate events. For example, multi-year crop distribution datasets can help track host crop availability for disease modelling, support biosecurity surveillance, and improve the targeting of extension and advisory services.
The use of explainable AI further supports the operational relevance of the framework. The approach supports transparency, validation, and stakeholder confidence by explicitly linking model predictions to specific VIs and time periods. This interpretability is particularly important in public-sector and research contexts, where black-box models can limit trust and uptake despite strong predictive performance [
81,
82].
4.4. Limitations and Future Directions
Several limitations should be acknowledged, while also providing clear directions for future work. First, the reliance on optical VIs may constrain the separability of crops with similar phenological behaviour, most notably wheat and barley. Furthermore, the use of 10-day median composites with linear interpolation may smooth short-term phenological variation in the reconstructed time series. Future work could address these issues by incorporating complementary data sources such as Sentinel-1 synthetic aperture radar (SAR), shortwave infrared bands, soil type, and thermal time accumulation. Sentinel-1 SAR would be particularly useful because it is less affected by cloud cover than optical imagery and can provide additional information on crop structure and moisture conditions, thereby reducing reliance on optical gap-filling while improving discrimination among crops with similar spectral and phenological behaviour. Future studies should also assess the sensitivity of classification performance to alternative gap-filling and time-series reconstruction approaches, including different compositing windows, non-interpolated time series, Savitzky–Golay filtering [
83], and harmonic smoothing [
84].
Second, supervised training and validation were conducted using an operational crop-mapping product rather than field-level ground observations, introducing potential uncertainty in class labels. While this approach is pragmatic for large-scale studies and was mitigated through multi-year cross-validation and external accuracy testing, it remains a source of uncertainty. Future work should therefore prioritise expanded field-based validation datasets across additional crops, seasons, and regions to better quantify label uncertainty and strengthen confidence in crop-specific accuracy estimates. Importantly, although the training labels are licenced, the satellite data, feature-generation workflow, and modelling framework are based entirely on open-access data and tools.
Future work could also explore newer representation-learning products such as Geospatial Embedding Models (AlphaEarth and BetaEarth), which provide a unified representation of the terrestrial land surface by integrating diverse data sources, including optical imagery, radar, terrain, and climate information [
85,
86]. In practical terms, these embeddings could be extracted and aggregated to paddock scale and combined with the existing vegetation-index features before classification. Their expected benefit would be to provide additional landscape, structural, and environmental context that may improve discrimination among spectrally similar crops, particularly wheat and barley, and support more robust mapping of under-represented classes. However, their use should be evaluated carefully through staged experiments, such as VIs only and VIs plus embeddings, because embedding dimensions are not directly interpretable and annual embedding products may need to be assessed for compatibility with in-season prediction.
From a modelling perspective, future work could improve performance for under-represented crops such as lupins and fallow by combining class-imbalance mitigation strategies with more advanced temporal modelling approaches. For example, targeted sampling, class weighting, or additional field observations could be used to improve representation of minority classes, while recurrent and sequence-learning architectures such as BiLSTM, attention-based LSTM, temporal masking, spatial–temporal integration, and broader tuning of hidden units, layer depth, dropout, and learning rate could improve the model’s ability to capture complex crop phenological trajectories under variable seasonal conditions. In this study, the LSTM model was used as a literature-guided temporal deep-learning baseline, and these extensions provide a clear pathway for further model development.
Extending the framework across additional growing seasons and regions would further strengthen generalisability and support the development of nationally consistent in-season crop-mapping products. Incorporating spatial stratification or climate-zone-specific parameterisation by calibrating models separately for the northern, central, and southern grainbelt regions of WA could also enable prediction windows to adjust across climatic gradients, improving regional applicability. While this study focuses on the southwest agricultural region of WA, the findings are relevant to other Mediterranean and temperate cropping systems characterised by winter-dominant rainfall and broadacre cereal production. The emphasis on phenology-aware modelling, temporal validation, and explainability provides a transferable template for developing transparent, in-season crop monitoring systems using open-access satellite data.
Looking ahead, the insights from this study suggest clear opportunities for developing an in-season crop monitoring pipeline across a larger extent of southwest WA. Since canola accuracy remains high even in early growth stages, future applications could focus on translating these models into real-time, near-real-time, or early-season mapping tools for operational decision making. Such systems could support disease surveillance, nutrient management, and harvest planning, where early and accurate detection of crop type enables timely intervention. By coupling explainable AI models with continuous Earth Observation data streams, future research can advance toward dynamic in-season crop mapping and improve regional-scale agronomic decision-support systems for public use.