Machine Learning in Climate Downscaling: A Critical Review of Methodologies, Physical Consistency, and Operational Applications
Abstract
1. Introduction: The Imperative for High-Resolution Climate Projections and the Rise of Machine Learning
1.1. Positioning This Review in the Literature
- 1.
- Creating a novel taxonomy mapping model families (CNNs, GANs, Transformers, Diffusion Models) to the downscaling challenges they address.
- 2.
- Conducting a critical analysis of the “performance paradox” and its implications for robustness under non-stationary climate change.
- 3.
- Proposing a practical evaluation protocol and research priorities to guide the community toward physically consistent, trustworthy, and operationally viable models.
1.2. Overview of the Review’s Scope and Objectives
- RQ1:
- Evolution of Methodologies: How have ML approaches for climate downscaling evolved from classical algorithms to the current deep learning architectures, and what are the primary capabilities and intended applications of each major model class?
- RQ2:
- Persistent Challenges: What are the critical, cross-cutting challenges that limit the operational reliability of contemporary ML downscaling models, particularly regarding their physical consistency, generalization under non-stationary climate conditions, and overall trustworthiness?
- RQ3:
- Emerging Solutions and Future Trajectories: Which methodological frontiers—including physics-informed learning (PIML), robust uncertainty quantification (UQ), and explainable AI (XAI)—show the most promise for addressing these challenges and guiding future research?
2. Review Methodology
2.1. Search Strategy and Data Sources
2.2. Inclusion and Exclusion Criteria
3. Background: The Downscaling Problem
3.1. The Scale Gap in Climate Modeling and the Need for Downscaling
3.2. Sectoral Implications of the Scale Gap
3.3. Limitations of Traditional Downscaling Methods
The Fidelity–Cost Trilemma
3.4. Emergence and Promise of ML in Transforming Statistical Downscaling
4. The Evolution of Machine Learning Approaches in Climate Downscaling
4.1. Early Applications and Classical ML Benchmarks
4.2. The Deep Learning Paradigm Shift
4.2.1. Pioneering Work with Convolutional Neural Networks (CNNs)
4.2.2. Architectural Innovations
U-Nets
Residual Networks (ResNets)
Generative Adversarial Networks (GANs)
Diffusion Models
Transformers
Operational Constraints and Deployability

5. The Physical Frontier: Hybrid and Physics-Informed Downscaling
5.1. The Imperative for Physical Consistency
5.2. Architectural Integration of Physical Laws: PIML
- Soft Constraints: This is the most common approach, where the standard data-fidelity loss term () is augmented with a physics-based penalty term () [66]. The total loss becomes , where is a weighting hyperparameter. is formulated as the residual of a governing differential equation (e.g., the continuity equation for mass conservation). By minimizing this residual across the domain, the network is encouraged, but not guaranteed, to find a physically consistent solution. This method is flexible and has been used to penalize violations of conservation laws [25] and to solve complex PDEs [26]. A common example is enforcing mass conservation in precipitation downscaling. If x is the value of a single coarse-resolution input pixel and are the n corresponding high-resolution output pixels from the neural network, a soft constraint can be added to the loss function to penalize deviations from the conservation of mass. In other words, the sum of the smaller pixels cannot be larger than the value of the corresponding coarse pixel. The total loss, , becomes a weighted sum of the data fidelity term (e.g., Mean Squared Error, ) and a physics penalty term:where is a hyperparameter that controls the strength of the physical penalty. Minimizing this loss encourages, but does not guarantee, that the mean of the high-resolution patch matches the coarse-resolution value.
- Hard Constraints (Constrained Architectures): This approach modifies the neural network architecture itself to strictly enforce physical laws by design. For example, Harder et al. [27] introduced specialized output layers that guarantee mass conservation by ensuring that the sum of the high-resolution output pixels equals the value of the coarse-resolution input pixel. Such methods provide an absolute guarantee of physical consistency for the constrained property, which can improve both performance and generalization. While more difficult to design and potentially less flexible than soft constraints, they represent a more robust method for embedding inviolable physical principles [27]. In contrast to soft constraints, a hard constraint enforces the physical law by design, often through a specialized, non-trainable output layer. Continuing the mass conservation example, let be the raw, unconstrained outputs from the final hidden layer of the network. A multiplicative constraint layer can be designed to produce the final, constrained outputs that are guaranteed to conserve mass:This layer rescales the raw outputs such that their sum is precisely equal to , thereby strictly enforcing the conservation law at every forward pass, without the need for a penalty term in the loss function.
5.3. Hybrid Frameworks: Merging Dynamical and Statistical Strengths
- 1.
- An initial, computationally inexpensive dynamical downscaling step using an RCM to bring coarse ESM output to an intermediate resolution (e.g., from 100 km to 45 km). This step grounds the output in a physically consistent dynamical state.
- 2.
- A subsequent generative ML step, using a conditional diffusion model, to perform the final super-resolution to the target scale (e.g., from 45 km to 9 km). The diffusion model learns to add realistic, high-frequency spatial details.
5.4. Enforcing Physical Realism in Practice
5.4.1. The Frontier of Physics-Informed Machine Learning (PIML)
The Promise of Physics–ML Integration
Implementation Approaches for PIML
- Hard constraints: Modify the architecture or add constraint layers to strictly satisfy selected laws [27], e.g., enforcing that total downscaled precipitation matches the coarse-grid input (water mass conservation). Pros: guarantees consistency for enforced laws. Cons: harder to design and can reduce flexibility if constraints are overly restrictive or misspecified.
- Soft constraints via loss functions: Add penalty terms for physical-law violations to the training objective [26]. Pros: flexible and can incorporate multiple principles, including complex non-linear PDEs. Cons: encourages but does not guarantee satisfaction; performance depends on tuning .
- Hybrid statistical–dynamical models: Combine ML with components of dynamical models [8], using ML to emulate expensive parameterizations within an RCM or to learn corrective terms for RCM biases, thereby leveraging the physical basis of dynamical components.

Case Studies and Results
6. Data, Variables, and Preprocessing Strategies in ML-Based Downscaling
6.1. Common Predictor Datasets (Low-Resolution Inputs)
- ERA5 Reanalysis: The fifth-generation ECMWF atmospheric reanalysis, ERA5, is extensively used as a source of predictor variables, particularly for training models in a “perfect-prognosis” framework [71,72]. ERA5 provides a globally complete and consistent, high-resolution (relative to GCMs, typically 31 km or 0.25°) gridded dataset of many atmospheric, land-surface, and oceanic variables from 1940 onwards, assimilating a vast number of historical observations. Its physical consistency and observational constraint make it an ideal training ground for ML models to learn relationships between large-scale atmospheric states and local climate variables. Often, models trained on ERA5 are subsequently applied to downscale GCM projections.
- CMIP5/CMIP6 GCM Outputs: Outputs from the Coupled Model Intercomparison Project Phase 5 (CMIP5) and Phase 6 (CMIP6) GCMs are indispensable when the objective is to downscale future climate projections under various emission scenarios (e.g., Representative Concentration Pathways—RCPs, or Shared Socioeconomic Pathways—SSPs). These GCMs provide the large-scale atmospheric forcing necessary for projecting future climate change. However, their coarse resolution and inherent biases necessitate downscaling and often bias correction before their outputs can be used for regional impact studies [10,72].
- CORDEX RCM Outputs: Data from the Coordinated Regional Climate Downscaling Experiment (CORDEX) are also used, particularly when ML techniques are employed for further statistical refinement of RCM outputs, as RCM emulators, or in hybrid downscaling approaches. CORDEX provides dynamically downscaled climate projections over various global domains, offering higher resolution than GCMs and incorporating regional climate dynamics. However, these outputs may still require further downscaling for very local applications or may possess biases that ML can help correct.
6.2. High-Resolution Reference Datasets (Target Data)
- Gridded Observational Datasets: Products like PRISM (Parameter-elevation Regressions on Independent Slopes Model) for North America [8,73], Iberia01 for the Iberian Peninsula [74], E-OBS for Europe [75], and regional datasets like REKIS [76] are commonly used [8]. PRISM, for example, provides high-resolution (e.g., 800 m or 4 km) daily temperature and precipitation data across the conterminous United States, incorporating physiographic influences like elevation and coastal proximity into its interpolation [73]. These datasets are invaluable for training models in a perfect-prognosis setup, where historical observations are used as the target.
- Satellite-Derived Products: Satellite observations offer global or near-global coverage and are increasingly used as reference data. Notable examples include the Global Precipitation Measurement (GPM) mission’s Integrated Multi-satellitE Retrievals for GPM (IMERG) products for precipitation [77] and the Soil Moisture Active Passive (SMAP) mission for soil moisture [78]. GPM IMERG, for instance, provides precipitation estimates at resolutions like 0.1° and 30 min intervals, with various products (Early, Late, and Final Run) catering to different latency and accuracy requirements [77].
- Regional Reanalyses or High-Resolution Simulations: In some cases, outputs from high-resolution regional reanalyses or dedicated RCM simulations (sometimes run specifically for the purpose of generating training data) are used as the “truth” data, especially when high-quality gridded observations are scarce [18].
- FluxNet: For variables related to land surface processes and evapotranspiration, data from the FluxNet network of eddy covariance towers provide valuable site-level observational data for model validation [79]. These towers measure exchanges of carbon dioxide, water vapor, and energy between ecosystems and the atmosphere.
6.3. Key Downscaled Variables
- Daily Precipitation and 2-m Temperature: These are the most commonly downscaled variables because of their direct relevance for impact studies (e.g., agriculture, hydrology, health). This includes mean, minimum, and maximum temperatures.
- Multivariate Downscaling: There is a growing trend towards downscaling multiple climate variables simultaneously (e.g., temperature, precipitation, wind speed, solar radiation, humidity). This is important for ensuring physical consistency among the downscaled variables.
- Spatial/Temporal Scales: Typical downscaling efforts aim to increase resolution from GCM/Reanalysis scales of 25–100 km to target resolutions of 1–10 km, predominantly at a daily temporal resolution.
6.4. Feature Engineering and Selection
- Static Predictors: High-resolution static geographical features such as topography (including elevation, slope, and aspect), land cover type, soil properties, and climatological averages are frequently incorporated as additional predictor variables. These features provide crucial local context that is often unresolved in coarse-scale GCM or reanalysis outputs. For instance, orography heavily influences local precipitation patterns and temperature lapse rates, while land cover affects surface energy balance and evapotranspiration [37,73]. The inclusion of these static predictors allows ML models to learn how large-scale atmospheric conditions interact with local surface characteristics to produce fine-scale climate variations.
- Dynamic Predictors: For specific variables like soil moisture, dynamic predictors such as Land Surface Temperature (LST) and Vegetation Indices (e.g., NDVI, EVI) derived from satellite remote sensing are often used, as these variables capture short-term fluctuations related to surface energy and water balance [80].
- Dimensionality Reduction and Collinearity: When dealing with many potential predictors, dimensionality reduction techniques like Principal Component Analysis (PCA) are sometimes employed to reduce the number of input features while retaining most of the variance. This can help to mitigate issues related to collinearity among predictors and reduce computational load. Regularization techniques (e.g., L1 or L2 regularization) embedded within many ML models also implicitly handle collinearity by penalizing large model weights.
6.5. Data Preprocessing Challenges
- Data-Scarce Areas: A significant hurdle is the availability of sufficient high-quality, high-resolution reference data for training and validation, especially in many parts of the developing world or in regions with complex terrain where observational networks are sparse [22]. In data-scarce regimes, deep learning models are prone to overfitting. However, this limitation is increasingly mitigated via Transfer Learning. Recent studies demonstrate that models pre-trained on data-rich domains (e.g., North America) or global reanalysis datasets (ERA5) learn universal atmospheric features—such as adiabatic lapse rates and frontal boundary structures—that are physically valid globally. By fine-tuning these pre-trained backbones with limited local data, robust performance can be achieved even in regions with sparse observational networks, effectively leveraging the “learned physics” from data-abundant regions.
- Imbalanced Data for Extreme Events: Extreme climatic events (e.g., heavy precipitation, heatwaves) are, by definition, rare. This leads to imbalanced datasets where extreme values are underrepresented, potentially biasing ML models (trained with standard loss functions like MSE) to perform well on common conditions but poorly on these critical, high-impact events. This issue often hinders models from learning the specific characteristics of extremes.
- Ensuring Domain Consistency: Predictor variables derived from GCM simulations may exhibit different statistical properties (e.g., means, variances, distributions) and systematic biases compared to reanalysis data (like ERA5) often used for model training. This mismatch, known as a domain or covariate shift, can degrade model performance and is a critical pre-processing consideration. This occurs because GCMs typically exhibit systematic biases and statistical properties that differ from the reanalysis data used for training. Even during historical periods, this discrepancy violates the fundamental ML assumption that training and application data are drawn from identical distributions (IID), resulting in performance degradation. Techniques such as bias correction of GCM predictors, working with anomalies by removing climatological means from both predictor and predictand data to focus on changes, or more advanced domain adaptation methods are employed to mitigate this critical issue and enhance consistency [81].
- Quality Control and Gap-Filling: Observational and satellite-derived datasets frequently require substantial preprocessing steps, including quality control to remove erroneous data, and gap-filling techniques (e.g., interpolation) to handle missing values because of sensor malfunction or environmental conditions (like cloud cover for satellite imagery) [82].
6.6. Quantitative Benchmarks and Methodological Uncertainties in Preprocessing
6.6.1. Normalization Sensitivity and Extremes
6.6.2. Regridding Artifacts and Representativeness
6.6.3. The Bias Correction Paradox
7. A Prescriptive Protocol for Model Evaluation
7.1. Variable-Specific Minimum Suites
7.1.1. Protocol for Precipitation Downscaling
- RMSE (baseline): Report for average error and legacy comparison, while acknowledging that it can penalize realistic high-frequency variability.
- FSS (primary spatial metric): Report the Fraction Skill Score (FSS) across multiple intensity thresholds and neighborhood sizes [86]. We recommend thresholds relevant to hydrological impacts (depending on local severity/return period), e.g., 1, 5, and 20 mm/day, and reporting FSS as a function of neighborhood size; 10, 20, 40, and 80 km can serve as illustrative scales, but should be chosen to identify the scale at which forecasts become skillful.
- Extreme-value fidelity: Report bias or absolute error at a high quantile (e.g., the 99th or 99.5th percentile) to directly assess rare, intense event magnitude. Complementary indices such as Rx1day and R99p help confirm tail behavior.
- PSD (spatial realism): Plot the 1D radially averaged power spectrum versus reference data. An overly steep slope indicates excessive smoothing; a shallow slope or high-frequency bumps can indicate unrealistic noise or GAN-induced artifacts.
- Probabilistic calibration (generative models): Report CRPS and RPSS to assess whether predictive distributions encompass observed outcomes; for probabilistic ensembles (e.g., GAN/Diffusion), CRPS is the primary overall skill score [87].
7.1.2. Protocol for Temperature Downscaling
- RMSE and Bias: Report RMSE and mean bias (downscaled minus reference) as standard accuracy/systematic-error metrics.
- PSD: Verify realistic spatial variability and detect over-smoothing.
- Distributional metrics (e.g., Wasserstein distance): Compare full distributions to capture shifts in shape and tails beyond mean/variance.
- Reliability diagram (probabilistic models): Assess calibration against the 1:1 diagonal.
7.2. Comparative Analysis and State of the Art
- Spatial structure and deterministic accuracy: U-Net and ResNet-based CNNs remain strong contenders, especially for smoother variables (e.g., temperature), because of inductive bias for local patterns and topographically induced variations [8].
- Probabilistic outputs and UQ: Diffusion models are emerging as state-of-the-art because of stable training and high-fidelity, diverse ensembles [9,31], often outperforming GANs on distributional metrics. As a simple, strong epistemic-UQ baseline, report deep ensembles [89] with CRPS and reliability diagnostics.
- Transferability and zero-shot generalization: Transformer-based foundation models represent the cutting edge, enabling generalization to new resolutions/regions with minimal fine-tuning and improved operational scalability [15].
7.3. Validation Under Non-Stationarity
7.3.1. Pseudo-Global Warming (PGW) Experiments
7.3.2. Transfer Learning and Domain Adaptation
- Pre-train on large, diverse datasets (e.g., multiple GCMs, long records) to learn general/invariant atmospheric features.
- Fine-tune on smaller target datasets (e.g., a region, future period, or new GCM) [17], improving generalization and reducing target-data needs; careful validation is required to avoid importing source-domain biases. Prasad et al. [17] shows pre-training can enhance zero-shot transferability for some tasks, though fine-tuning often remains necessary for optimal performance on distinct targets such as different GCM outputs.
7.3.3. Process-Informed Architectures and Predictor Selection
- Encoding known physical relationships into the architecture (e.g., layers/connections that mimic processes or constraints).
- Using physically motivated predictors (e.g., potential temperature, specific humidity, circulation indices) instead of large, collinear, or causally weak predictor sets.
7.3.4. Validation Strategies for Non-Stationary Conditions
- Perfect Model Framework (Pseudo-Reality): Treat high-resolution GCM/RCM output as “truth” [7]; train on its coarsened version and reconstruct the original truth. This enables testing across different climate states (historical vs. future periods) with known truth, directly probing extrapolation.
- Cross-GCM Validation: Train on a subset of GCMs and test on withheld GCMs to assess generalization across structural differences and biases.
- Temporal Extrapolation (Out-of-sample): Train on earlier periods and test on the most recent record or distinct climatic periods (e.g., warmest historical years as future proxies) [8].
- Process-Based Evaluation: Verify physically plausible inter-variable relationships (e.g., temperature–precipitation scaling, wind–pressure) and key processes (diurnal cycles, seasonal transitions, extremes) under different conditions; XAI can help assess whether mechanisms are physically sound.
7.4. A Multi-Faceted Toolkit for Model Evaluation
Uncertainty Baselines
7.5. Tier 1: Mandatory Baseline Diagnostics
- Baseline error (RMSE/MAE) and bias: Necessary for legacy comparison and diagnosing systematic wet/dry drifts, despite the “double penalty” effect.
- Texture and spectral realism (PSD): Mandatory to detect “spectral drop-off” (blurring); models that miss the correct (or similar) slope are physically deficient.
- Distributional sanity checks (QQ plot): Required to detect “distributional collapse” (regression to the mean), especially in tails.
7.6. Tier 2: Essential Operational Standards
- Spatial verification (FSS): FSS decouples displacement from intensity error and prevents selecting smooth, unrealistic models [86].
- Extreme-value fidelity: Report high-quantile bias (e.g., ) or tail-dependence metrics to ensure rare events are captured.
| Category | Metric | Description and Use Case | When to Use | Key Refs |
|---|---|---|---|---|
| Pixel-wise Accuracy | RMSE/MAE | Root Mean Squared Error/Mean Absolute Error. Standard metrics for average error, but can be misleading for skewed distributions (e.g., precipitation) and penalize realistic high-frequency variations. | Standard baseline, but use with caution; supplement with other metrics. | [11] |
| Spatial Structure | Structural Similarity (SSIM) | Measures perceptual similarity between images based on luminance, contrast, and structure. Better than RMSE for assessing preservation of spatial patterns. | To evaluate preservation of spatial patterns and textures. | [46] |
| Power Spectral Density (PSD) | Compares the variance at different spatial frequencies. Crucial for diagnosing overly smooth outputs (loss of high-frequency power) or GAN-induced artifacts (spurious power). | To diagnose smoothing or unrealistic high-frequency noise. | [88,90] | |
| Variogram Analysis | Geostatistical tool that quantifies spatial correlation as a function of distance. Comparing nugget, sill, and range diagnoses noise, variance suppression, and incorrect spatial correlation length. | To quantitatively assess spatial dependency structure and diagnose over-smoothing. | [91] | |
| Method for Object-based Diagnostic Evaluation (MODE) | Identifies and compares attributes (e.g., area, location, orientation, intensity) of distinct objects (e.g., storms). Provides diagnostic information on specific spatial biases beyond grid-point errors. | For detailed diagnostic evaluation of precipitation fields, avoiding the “double penalty” issue. | [92,93] | |
| Temporal Coherence | Temporal Autocorrelation | Measures the correlation of a time series with itself at a given lag (e.g., lag-1 for daily data). Assesses the model’s ability to reproduce temporal persistence or “memory”. | To diagnose unrealistic temporal “flickering” or lack of persistence in time series. | [94,95] |
| Wet/Dry Spell Characteristics | Quantifies the statistics of consecutive days above/below a threshold (e.g., 1 mm/day for precipitation). Key metrics include mean/max spell duration, frequency, and cumulative intensity. | Essential for impact studies related to droughts and floods; evaluates temporal clustering of events. | [96,97] | |
| Extreme Events | Fraction Skill Score (FSS) | A neighborhood-based verification metric that assesses the skill of forecasting events exceeding a certain threshold across different spatial scales. Mitigates the “double penalty” issue. | Essential for verifying precipitation fields at specific thresholds. | [86,90] |
| Quantile-based scores (e.g., 99th percentile error) | Directly evaluates the accuracy of specific quantiles (e.g., p95, p99), focusing on performance in the tails of the distribution. | To specifically quantify performance on rare, high-impact events. | [36] | |
| Return Level/Period Consistency | Compares the magnitude of extreme events for given return periods (e.g., the 1-in-100-year event) between the downscaled output and observations, often using Extreme Value Theory. | For climate impact studies where long-term risk from extremes is key. | [98] | |
| Distributional Similarity | Wasserstein Distance (Earth Mover’s Distance) | Measures the “work” required to transform one probability distribution into another. A robust measure of similarity between the full distributions of the downscaled and reference data. | For a rigorous comparison of the entire statistical distribution. | [33,99] |
| CRPS (Continuous Ranked Probability Score) | For probabilistic forecasts, measures the integrated squared difference between the predicted cumulative distribution function (CDF) and the observed value. A proper scoring rule that generalizes MAE. | Gold standard for evaluating probabilistic/ensemble forecast skill. | [87,90] | |
| Perkins Skill Score (PSS) | Measures the common overlapping area between two probability density functions (PDFs). An intuitive, distribution-agnostic metric of overall distributional similarity. | To provide a robust, integrated score of distributional overlap, common in climate model evaluation. | [100] | |
| Uncertainty Quantification (UQ) | Reliability Diagram | Plots observed frequencies against forecast probabilities for binned events to assess calibration. A perfectly calibrated model lies on the diagonal. | To assess if forecast probabilities are statistically reliable. | [90] |
| PIT Histogram | Probability Integral Transform. For a calibrated ensemble, the PIT values of the observations should be uniformly distributed. Deviations indicate biases or incorrect spread. | To diagnose issues with ensemble spread and bias. | [90] | |
| Physical Consistency | Conservation Error | Directly measures the violation of a conservation law (e.g., mass, energy) by comparing the aggregated high-resolution output to the coarse-resolution input value. | When conservation of a quantity is a critical physical constraint. | [25] |
| Multivariate Correlations | Assesses whether the physical relationships and correlations between different downscaled variables (e.g., temperature and humidity) are preserved realistically. | Essential for multi-variable downscaling to ensure physical coherence. | [9] | |
| Clausius-Clapeyron Scaling | Verifies if the intensity of extreme precipitation scales with temperature at the physically expected rate (7%/°C). Tests if the model has learned a fundamental thermodynamic relationship. | Critical for assessing the credibility of future projections of extremes under warming. | [12] |
7.7. Tier 3: Advanced and Probabilistic Standards
- Probabilistic calibration (CRPS): For generative models (Diffusion, GANs), deterministic evaluation is ill-posed; CRPS is a strictly proper scoring rule and gold standard for ensemble assessment [87].
- Non-stationarity stress tests (PGW): PGW injects thermodynamic signals into historical dynamics to stress-test extrapolation (e.g., Clausius–Clapeyron scaling) prior to deployment.
- Physical consistency: Require conservation checks (mass/energy budget closure) to detect fundamental physical violations.
7.8. Diagnostic Visualization Suite
- Parity (predicted vs. reference) plots: Reveal additive bias (intercept), multiplicative bias (slope), heteroscedasticity (error increasing with intensity), and tail distortion (systematic under-/over-estimation of extremes). When stratified by season, elevation/coast distance, or intensity bins, they also expose regime-dependent failures that are otherwise hidden by a single MAE/RMSE. Figure 6 demonstrates this approach, showing both in-domain and out-of-domain (transfer) performance to assess model generalization.
- Confusion matrices for thresholded events: For impact-relevant exceedances (e.g., above P95/P99 or fixed hydrologic thresholds), a confusion matrix makes the trade-off between misses and false alarms explicit. Reporting derived skill measures (e.g., POD/recall, FAR, CSI, precision, and bias) alongside the matrix clarifies whether the model is conservative (many misses) or over-active (many false alarms).
- Correlation heatmaps and dependence checks: Heatmaps of inter-variable correlations (and, when relevant, lag/lead correlations) help verify whether the model preserves physical coherence rather than matching marginal distributions only. Comparing correlation structure between the reference and predictions can reveal “physically implausible” dependence patterns even when pixel-wise errors are small.
- Spatial error maps and structural diagnostics: Maps of mean bias, MAE/RMSE, and (for precipitation) event-based structure metrics (e.g., neighborhood/FSS-style summaries) localize systematic errors (orographic/coastal bands, convective hotspots, land–sea boundaries). This directly supports model development by tying failures to geography and process regimes. Figure 7 exemplifies this diagnostic approach, comparing spatial outputs and difference maps to reveal how deep learning methods recover the fine-scale topographic detail that traditional methods smooth out.
7.9. Operational Relevance: Beyond Statistical Skill
- Computational Cost: Dynamical downscaling is exceptionally expensive, limiting its use for large ensembles. ML offers a computationally cheaper alternative by orders of magnitude [9,63]. However, costs vary within ML: inference with CNNs is fast, while the iterative sampling of diffusion models is slower. Training large foundation models requires massive computational resources, but once trained, fine-tuning and inference can be efficient [16]. The hybrid dynamical–generative approach offers a compelling trade-off, drastically cutting the cost of the most expensive part of the physical simulation pipeline [9].
- Interpretability: As discussed in Section 9.2.2, the “black-box” nature of deep learning is a major barrier to operational trust. The ability to use XAI tools to verify that a model is learning physically meaningful relationships, rather than spurious “shortcuts”, is crucial for deployment in high-stakes applications.
- Robustness and Generalization: The single most important factor for operational relevance is a model’s ability to generalize to out-of-distribution (OOD) data, namely future climate scenarios. As detailed in Section 9.1, models that fail under covariate or concept drift are not operationally viable for climate projection. Therefore, rigorous OOD evaluation using techniques like cross-GCM validation and Pseudo-Global Warming (PGW) experiments is a prerequisite for deployment.
- Baselines: Always include strong classical comparators (e.g., BCSD/quantile-mapping and LOCA) as default references alongside modern DL models; these remain common operational choices in hydrologic and climate-service pipelines [102,103]. Formal assessments and national products continue to operationalize statistical interfaces between GCMs and impacts—bias adjustment and empirical/statistical downscaling (e.g., LOCA2, STAR-ESDM)—as default pathways, which underscores why ML downscalers must demonstrate clear, application-relevant added value [104,105].
8. Critical Investigation of Model Performance and Rationale
8.1. Rationale for Model Choices
- CNNs/U-Nets for Spatial Patterns: These architectures are predominantly chosen for their proficiency in learning hierarchical spatial features from gridded data. Convolutional layers are adept at identifying local patterns, while pooling layers capture broader contextual information. U-Nets, with their encoder–decoder structure and skip connections, are particularly favored for tasks requiring precise spatial localization and preservation of fine details, making them well-suited for downscaling variables like temperature and precipitation where spatial structure is paramount [8].
- LSTMs/ConvLSTMs for Temporal Dependencies: When the temporal evolution of climate variables and their sequential dependencies are critical (e.g., for daily precipitation sequences or hydrological runoff forecasting), LSTMs and ConvLSTMs are preferred because of their recurrent nature and ability to capture long-range temporal patterns.
- GANs/Diffusion Models for Realistic Outputs and Extremes: These generative models are selected when the objective is to produce downscaled fields that are not only statistically accurate but also perceptually realistic, with sharp gradients and a better representation of the full statistical distribution, including extreme events [8].
- Transformers for Long-Range Dependencies: The increasing adoption of Transformer architectures is driven by their powerful self-attention mechanisms, which allow them to model global context and long-range dependencies in both spatial and temporal dimensions, a capability that can be beneficial for complex climate system dynamics [16,57].
8.2. Strategic Framework for Architecture Selection
- 1.
- Resource-Constrained/Mean-State Applications: If inference latency must be minimal, CNNs (U-Net, ResNet) are optimal. Their deterministic nature ensures speed, though they illustrate the trade-off of potentially smoothing high-frequency variability compared to generative baselines.
- 2.
- Risk Assessment/Extremes: For applications requiring accurate heavy tails, Diffusion Models are the state of the art. Tomasi et al. [31] demonstrate that latent diffusion models can mimic kilometer-scale dynamics with high fidelity, recovering the small-scale variance lost by standard CNNs.
- 3.
- Texture Synthesis/Energy Assessment: For renewable energy where spectral realism is paramount, GANs offer a compromise. They generate sharp textures fast, though they carry risks of mode collapse.
- 4.
- Continental Domains/Teleconnections: For domains driven by remote teleconnections, spectral or operator-based architectures with global receptive fields, such as FourCastNet, capture global context better than local CNNs.
8.3. The Coherent Pipeline: Linking Loss, Architecture, and Validation
- Loss–Architecture Alignment: Using a generative architecture (GAN) with a dominant MSE loss negates the generative benefit, collapsing the output back to the mean. Conversely, utilizing a probabilistic metric like CRPS requires a stochastic architecture (Diffusion or Dropout-Ensemble) to be mathematically meaningful.
- Predictor–Physics Alignment: The architecture must be supplied with predictors that carry the relevant physical signal. For example, a Vision Transformer designed to capture teleconnections will fail if the input domain is too small to contain the large-scale driver (predictor mismatch).
- Validation–Goal Alignment: A pipeline designed for extremes (using Weighted Loss) will appear to fail if evaluated strictly on RMSE (which penalizes variance). It must be validated using FSS or Tail-Dependence metrics.
8.4. Factors Contributing to Model Success
- Appropriate Architectural Design: Matching the model architecture to the inherent characteristics of the data and the downscaling task is paramount. For instance, CNNs are well-suited for gridded spatial data, while LSTMs excel with time series. The incorporation of architectural enhancements like residual connections and the skip connections characteristic of U-Nets have proven crucial for training deeper models and preserving fine-grained spatial detail.
- Effective Feature Engineering: The performance of ML models is significantly boosted by the inclusion of relevant predictor variables. In particular, incorporating high-resolution static geographical features such as topography, land cover, and soil type provides essential local context that coarse-resolution GCMs or reanalysis products inherently lack. This allows the model to learn how large-scale atmospheric conditions are modulated by local surface characteristics.
- Quality and Representativeness of Training Data: The availability of sufficient, high-quality, and representative training data is fundamental. Data augmentation techniques, such as rotation or flipping of input fields, can expand the training set and improve model generalization, especially for underrepresented phenomena like extreme events [21,106].
- Appropriate Loss Functions: The choice of loss function used during model training significantly influences the characteristics of the downscaled output. While standard loss functions like MSE are common, they can lead to overly smooth predictions and poor representation of extremes. Tailoring loss functions to the specific task—for example, using quantile loss, Bernoulli-Gamma loss for precipitation (which models occurrence and intensity separately), Dice loss for imbalanced data, or the adversarial loss in GANs for perceptual quality—can lead to substantial improvements in capturing critical aspects of the climate variable’s distribution [8]. Studies show that L1 and L2 loss functions perform differently depending on data balance, with L2 often being better for imbalanced data like precipitation [83].
- Rigorous Validation Frameworks: The use of robust validation strategies, including out-of-sample testing and standardized evaluation metrics beyond simple error scores (e.g., the VALUE framework [107]), is crucial for assessing true model skill and generalizability.
8.5. Factors Hindering Model Learning
Comparative Susceptibility to Physical Inconsistency
- CNNs (Spectral Smoothing): Purely statistical CNNs minimize pixel-wise error, encouraging averaging. Physically, this manifests as a violation of energy conservation at small scales—the Power Spectral Density (PSD) drops off rapidly at high wavenumbers.
- GANs (Structural Hallucination): While GANs restore PSD, they are prone to structural hallucinations. Annau et al. [88] explicitly document how GANs can generate realistic-looking wind features in physically inconsistent locations to satisfy the discriminator, leading to artifacts in geophysical fields.
- Transformers (Boundary Artifacts): Models that process data in patches, such as certain Vision Transformers, face the risk of boundary artifacts. As analyzed by Pérez et al. [108], tiling approaches can lead to performance degradation or discontinuities at patch borders if not managed with careful overlap strategies.
- Overfitting: Models may learn noise, data-set-specific artifacts, or spurious correlations that appear predictive in-sample but fail under domain shift (e.g., new regions, new GCMs, or future climates), especially with highly flexible DL models and limited or non-diverse training data. The following is how current ML practice addresses this critique:
- (i) regularization and capacity control (weight decay, dropout, spectral/weight normalization where appropriate, and conservative architecture sizing), (ii) early stopping and robust training protocols (learning-rate schedules, checkpoint selection on out-of-sample validation, and monitoring of extreme-focused diagnostics rather than loss alone), (iii) data strategies (augmentation, regime-balanced sampling for rare extremes, and careful handling of leakage in spatio-temporal splits), (iv) validation designed to expose overfit (spatial blocking, temporal blocking, cross-region tests, and cross-GCM/PGW-style tests when the model is intended for future projection), and (v) uncertainty-aware stabilization (ensembles or MC-dropout-style approximations) to reduce variance and improve calibration. In combination, these practices respond directly to the standard overfitting critique in high-dimensional climate settings by making generalization a first-class design constraint rather than an afterthought.
- Poor Generalization (The “Transferability Crisis”), Covariate Shift, Concept Drift, and Shortcut Learning: A major and persistent challenge is the failure of models to extrapolate reliably to conditions significantly different from those encountered during training. This ’transferability crisis’ is the core of the “performance paradox” and is rooted in the violation of the stationarity assumption. It can be rigorously framed using established machine learning concepts:
- −
- Covariate Shift: This occurs when the distribution of input data, , changes between training and deployment, while the underlying relationship remains the same [109]. In downscaling, this is guaranteed when applying a model trained on historical reanalysis (e.g., ERA5) to the outputs of a GCM, which has its own systematic biases and statistical properties. It also occurs when projecting into a future climate where the statistical distributions of atmospheric predictors (e.g., mean temperature, storm frequency) have shifted.
- −
- Concept Drift: This is a more fundamental challenge where the relationship between predictors and the target variable, , itself changes [109]. Under climate change, the physical processes linking large-scale drivers to local outcomes might be altered (e.g., changes in atmospheric stability could alter lapse rates). A mapping learned from historical data may therefore become invalid.
- −
- Shortcut Learning: This phenomenon provides a mechanism to explain why models are so vulnerable to these shifts [110]. Models often learn “shortcuts”—simple, non-robust decision rules that exploit spurious correlations in the training data—instead of the true underlying physical mechanisms [110]. For example, a model might learn to associate a specific GCM’s known regional cold bias with a certain type of downscaled precipitation pattern. This shortcut works perfectly for that GCM but fails completely when applied to a different, unbiased GCM or to the real world, leading to poor OOD performance. The finding by González-Abad et al. [65] that models may rely on spurious teleconnections is a prime example of shortcut learning in this domain.
- Lack of Physical Constraints: Purely data-driven ML models, optimized solely for statistical accuracy, can produce outputs that are physically implausible or inconsistent (e.g., violating conservation laws). This lack of physical grounding can severely limit the trustworthiness and utility of downscaled projections.
- Data Limitations: Insufficient training data, particularly for rare or extreme events, remains a significant bottleneck. Data scarcity in certain geographical regions also poses a challenge for developing globally applicable models. Furthermore, the lack of training data that adequately represents the full range of potential future climate states can hinder a model’s ability to project future changes accurately.
- Inappropriate Model Complexity: Choosing an inappropriate level of model complexity can be detrimental. Models that are too simple may underfit the data, failing to capture complex relationships. Conversely, overly complex models are prone to overfitting, may be more difficult to train, and can be computationally prohibitive.
- Training Difficulties (e.g., Vanishing/Exploding Gradients): In very deep neural networks, especially plain CNNs without architectural aids like residual connections, the gradients used for updating model weights can become infinitesimally small (vanishing) or excessively large (exploding), hindering the learning process.
- Input Data Biases and Inconsistencies: Systematic biases present in GCM outputs, or inconsistencies between the statistical characteristics of training data (e.g., reanalysis) and application data (e.g., GCM outputs from a different model or future period), representing a significant covariate shift as discussed previously, can significantly degrade downscaling performance. Preprocessing steps, such as bias correction of predictors or working with anomalies by removing climatology, are often crucial for mitigating these issues [7].
8.6. Comparative Analysis of ML Approaches
9. Overarching Challenges in ML-Based Climate Downscaling
9.1. Transferability and Domain Adaptation: The Achilles’ Heel
- Extrapolation to Future Climates: Models trained exclusively on historical climate data often struggle to perform reliably when applied to future climate scenarios characterized by significantly different mean states, altered atmospheric dynamics, or novel patterns of variability. Studies by Hernanz et al. [120] demonstrated catastrophic drops in CNN performance when applied to future projections or GCMs not included in the training set. The models may learn statistical relationships that are valid for the historical period but do not hold under substantial climate change.
- Cross-GCM/RCM Transfer: Because of inherent differences in model physics, parameterizations, resolutions, and systematic biases, ML models trained to downscale the output of one GCM or RCM often exhibit degraded performance when applied to outputs from other climate models. This limits the ability to readily apply a single trained downscaling model across a multi-model ensemble.
- Spatial Transferability: A model developed and trained for a specific geographical region may not transfer effectively to other regions with different climatological characteristics, topographic complexities, or land cover types. Local adaptations are often necessary, which can be data-intensive.
- Domain Adaptation Techniques: These methods aim to explicitly adapt a model trained on a “source” domain (e.g., historical data from one GCM) to perform well on a “target” domain (e.g., future data from a different GCM) where labeled high-resolution data may be scarce or unavailable [8]. Transfer learning operates by exploiting the hierarchical feature extraction of CNNs. The initial layers of a deep network typically learn fundamental, domain-agnostic primitives such as edges, gradients, and textural patterns. By “freezing” the weights of these lower layers and retraining only the final, high-level distinct layers on the target dataset, the dimensionality of the optimization problem is drastically reduced. This approach allows the model to adapt to new high-dimensional datasets without the massive sample sizes typically required to learn fundamental feature representations from scratch.
- Training on Diverse Data: A common strategy is to pre-train ML models on a wide array of data encompassing multiple GCMs, varied historical periods, and diverse geographical regions. The hypothesis is that exposure to greater variability will help the model learn more robust and invariant features that generalize better. For instance, Prasad et al. [17] found that training on diverse datasets (ERA5, MERRA2, NOAA CFSR) led to good zero-shot transferability for some tasks, though fine-tuning was still necessary for others, such as the two-simulation transfer involving NorESM data.
- Pseudo-Global Warming (PGW) Experiments: This approach involves training or evaluating models using historical data that has been perturbed to mimic certain aspects of future climate conditions (e.g., by adding a GCM-projected warming signal). This allows for a more systematic assessment of a model’s extrapolation capabilities under changed climatic states.
- Causal Machine Learning: There is growing interest in developing ML approaches that aim to learn underlying causal physical processes rather than just statistical correlations. Such models are hypothesized to be inherently more robust to distributional shifts.
Case Studies (Quantitative Case Studies)
- Cross-model transfer (temperature UNet emulator). In a pseudo-reality experiment, daily RMSE for a UNet emulator rose from ∼0.9 °C when evaluated on the same driving model used for training (UPRCM) to ∼2–2.5 °C when applied to unseen ESMs; for warm extremes (99th percentile) under future climate, biases were mostly within °C but reached up to 5 °C in some locations, and were larger than a linear baseline [120].
- GAN downscaling artifacts (near-surface winds). Deterministic GAN super-resolution exhibited systematic low-variance (low-power) bias at fine scales and, under some partial frequency-separation settings, isolated high-power spikes at intermediate wavenumbers; allowing the adversarial loss to act across all frequencies restored fine-scale variance, but it also raised pixelwise errors via the double-penalty effect [88].
- Classical SD variability and bias pitfalls (VALUE intercomparison). In a 50+ method cross-validation over Europe, several linear-regression SD variants showed very large precipitation biases—sometimes worse than raw model outputs—while some MOS techniques systematically under- or over-estimated variability (e.g., ISI-MIP under, DBS over), underscoring that method class alone does not guarantee robustness [4].
9.2. Physical Consistency and Interpretability
9.2.1. Ensuring Physically Plausible Outputs
- Cross-Model Generalization: Train the downscaling model on data from one climate data source (e.g., ERA5 reanalysis) and test its performance on an entirely different, unseen source (e.g., historical simulations from a CMIP6 model). This tests robustness to systematic biases and different statistical properties (covariate shift).
- Future Climate Extrapolation (PGW): Evaluate the trained model on a Pseudo-Global Warming (PGW) dataset. PGW experiments modify historical data to represent future warmed conditions, providing a controlled test of the model’s ability to extrapolate to novel climate states.
- Cross-Region Transfer: For models intended for broad applicability, train on one or more geographic regions and test on a held-out region with distinct climatological or topographical characteristics. This assesses the model’s ability to learn generalizable physical relationships rather than region-specific correlations.
- Covariate Shift Detection and Adaptation: Before applying the model, quantify the distributional shift between the training predictors and the target application predictors using a metric like Maximum Mean Discrepancy (MMD) or energy distance [122,123]. In climate settings, Wasserstein distance is also used for model/field comparison [99]. Try lightweight adaptations—target-domain re-normalization, spatially aware cross-validation, and (where feasible) drift-aware fine-tuning—and reassess performance; see general drift/adaptation guidance [109] and transferability caveats in downscaling [120].
- Physics-Informed Neural Networks (PINNs) and Constrained Learning:
- −
- Soft Constraints: This approach involves incorporating penalty terms into the model’s loss function that discourage violations of known physical laws. The total loss becomes a weighted sum of a data-fidelity term and a physics-based regularization term (e.g., ). Physics-informed loss functions have been explored to guide models towards more physically realistic solutions. While soft constraints can reduce the frequency and magnitude of physical violations, they may not eliminate them entirely and can introduce a trade-off between statistical accuracy and physical consistency [25].
- −
- Hard Constraints: These methods aim to strictly enforce physical laws by design, either by modifying the neural network architecture itself or by adding specialized output layers that ensure the predictions satisfy the constraints. Harder et al. [27] introduced additive, multiplicative, and softmax-based constraint layers that can guarantee, for example, mass conservation between low-resolution inputs and high-resolution outputs. Such hard-constrained approaches have been shown to not only ensure physical consistency but also, in some cases, improve predictive performance and generalization [27]. The rationale for PINNs includes reducing the dependency on large datasets and enhancing model robustness by ensuring physical consistency, especially in data-sparse regions or for out-of-sample predictions [26]. Recent work explores Attention-Enhanced Quantum PINNs (AQ-PINNs) for climate modeling applications like fluid dynamics, aiming for improved accuracy and computational efficiency [124].
- Hybrid Dynamical–Statistical Models: Another avenue is to combine the strengths of ML with traditional physics-based dynamical models (RCMs). This can involve using ML to emulate computationally expensive components of RCMs, to statistically post-process RCM outputs (e.g., for bias correction or further downscaling), or to develop hybrid frameworks where ML and dynamical components interact [8,18]. For example, “dynamical–generative downscaling” approaches combine an initial stage of dynamical downscaling with an RCM to an intermediate resolution, followed by a generative AI model (like a diffusion model) to further refine the resolution to the target scale. This leverages the physical consistency of RCMs and the efficiency and generative capabilities of AI [9]. Such hybrid models aim to achieve a balance between computational feasibility, physical realism, and statistical skill.
9.2.2. Explainable AI (XAI): Unmasking the “Black Box”
The Need for Interpretability
- Model validation and debugging: Identifying which input features the model relies on helps determine whether it learned scientifically meaningful relationships or is exploiting spurious correlations/artifacts in the training data—a shortcut-learning failure mode where models can appear “right for the wrong reasons”.
- Scientific discovery: Highlighting unexpected learned relationships may reveal new insights into climate processes.
- Building trust: Models whose decision logic aligns with physical understanding are more likely to be trusted by domain scientists and policymakers.
- Identifying biases: XAI can expose hidden biases in the model or in the training data.
Common XAI Techniques Applied to Downscaling
- Saliency maps and feature attribution: Integrated Gradients, DeepLIFT, and Layer-Wise Relevance Propagation (LRP) attribute an output (e.g., a high-resolution pixel) back to input features (e.g., coarse predictor fields), revealing influential regions or variables [8]. González-Abad et al. [65] proposed aggregated saliency maps for CNN-based downscaling and found models may rely on spurious teleconnections or ignore important physical predictors. LRP has also been adapted for climate semantic-segmentation tasks (e.g., tropical cyclone and atmospheric river detection) to test whether CNNs use physically plausible input patterns [125].
Challenges in XAI for Climate Downscaling
- Faithfulness and Plausibility: Ensuring that explanations truly reflect the model’s internal decision-making process (faithfulness) and are consistent with physical understanding (plausibility) is challenging [129]. Different XAI methods can yield different, sometimes conflicting, explanations for the same prediction [130].
- Relating Attributions to Physical Processes: While methods like integrated gradients are mathematically sound, the resulting attribution maps can be difficult to directly relate to specific, understandable physical processes or mechanisms.
- Standardization: Methodologies and reporting standards for XAI in climate downscaling remain inconsistent, making comparisons across studies difficult. Different XAI methods can yield conflicting explanations for the same prediction, and there is a lack of consensus on benchmark metrics, hindering systematic evaluation [129].
- Beyond Post Hoc Explanations: Current XAI often provides post hoc explanations. There is a growing call to move towards building inherently interpretable models or to integrate interpretability considerations into the model design process itself, drawing lessons from how dynamical climate models are understood at a component level. This involves striving for “component-level understanding” where model behaviors can be attributed to specific architectural components or learned representations.
9.3. Representation of Extreme Events
9.3.1. The Challenge
- Data Imbalance: Extreme events are rare by definition, leading to their under-representation in training datasets—an issue long recognized in extreme value analysis [98]. Models optimized to minimize average error across all data points may thus prioritize fitting common, non-extreme values, effectively “smoothing over” or underestimating extremes. In precipitation downscaling, tail-aware training (e.g., quantile losses) has been used precisely to counter this tendency [36]; empirical studies also note that standard DL architectures can underestimate heavy precipitation and smooth spatial variability in extremes [37,120].
- Loss Function Bias: MSE loss, for example, penalizes large errors quadratically, which might seem beneficial for extremes. However, because extremes are infrequent, their contribution to the total loss can be small, and the model may learn to predict values closer to the mean to minimize overall MSE, thereby underpredicting the magnitude of extremes. This regression-to-the-mean behavior under quadratic criteria is well documented in hydrologic error decompositions [131]; tail-focused alternatives such as quantile (pinball) losses offer a direct mitigation [36].
- Failure to Capture Compound Extremes: Models may also struggle to capture the co-occurrence of multiple extreme conditions (e.g., concurrent heat and drought), which requires learning cross-variable dependence structures. Reviews of compound events highlight the prevalence and impacts of such co-occurrences and the difficulty for standard single-target pipelines to reproduce them [132,133]; see also evidence on changing risks of concurrent heat–drought in the U.S. [134].
9.3.2. Specialized Approaches for Extremes
- Tailored Loss Functions: Using loss functions that give more weight to extreme values or are specifically designed for tail distributions. Examples include the following:
- −
- −
- Quantile Regression: Quantile Regression (QR) offers a powerful approach by directly modeling specific quantiles of a variable’s conditional distribution, which inherently allows for a detailed focus on the distribution’s tails and thus on extreme values. For instance, Quantile Regression Neural Networks (QRNNs), as implemented by Cannon [36], provide a flexible, nonparametric, and non-linear method. This approach avoids restrictive assumptions about the data’s underlying distribution shape, a significant advantage for complex climate variables like precipitation where parametric forms are often inadequate. A key feature of the QRNN presented is its ability to handle mixed discrete-continuous variables, such as precipitation amounts (which include zero values alongside a skewed distribution of positive amounts). This is achieved through censored quantile regression, making the model adept at representing both the occurrence and varying intensities of precipitation, including extremes.Cannon [36] notes this was the first implementation of a censored quantile regression model that is non-linear in its parameters. Furthermore, the methodology allows for the full predictive probability density function (pdf) to be derived from the set of modeled quantiles. This enables more comprehensive probabilistic assessments, such as estimating arbitrary prediction intervals, calculating exceedance probabilities for critical thresholds (i.e., performing extreme value analysis), and evaluating risks associated with different outcomes. To enhance model robustness and mitigate overfitting, especially when data for extremes might be sparse, Cannon [36] incorporates techniques like weight penalty regularization and bootstrap aggregation (bagging). The practical relevance to downscaling is demonstrated through an application to a precipitation downscaling task, where the QRNN model showed improved skill over linear quantile regression and climatological forecasts. Importantly, the paper also suggests that QRNNs could be a “viable alternative to parametric ANN models for non-stationary extremes”, a crucial consideration for climate change impact studies where the characteristics of extreme events are expected to evolve. The Quantile-Regression-Ensemble (QRE) algorithm trains members on distinct subsets of precipitation observations corresponding to specific intensity levels, showing improved accuracy for extreme precipitation [121].
- −
- Bernoulli-Gamma or Tweedie Distributions: For precipitation, which has a mixed discrete-continuous distribution (zero vs. non-zero amounts, and varying intensity), loss functions based on these distributions (e.g., minimizing Negative Log-Likelihood—NLL) can better model both occurrence and intensity, including extremes [121].
- −
- Dice Loss and Focal Loss: These are explored for handling sample imbalance in heavy precipitation forecasts, with Dice Loss showing similarity to threat scores and effectively suppressing false alarms while improving hits for heavy precipitation [84].
- Generative Models (GANs and Diffusion Models): These models, by learning the underlying data distribution, can be better at generating realistic extreme events compared to deterministic regression models [1]. Diffusion models, in particular, have shown promise in capturing the fine spatial features of extreme precipitation and reproducing intensity distributions more accurately than GANs or CNNs [135].
- Data Augmentation: Techniques to artificially increase the representation of extreme events in the training data, as used in the SRDRN model [21].
- Architectural Modifications: Designing model architectures or components specifically to handle extremes, such as the gradient-guided attention model for discontinuous precipitation by Xiang et al. [70] or multi-scale gradient processing in GANs. Beyond tailored loss functions and data augmentation, the architectural choices within generative frameworks and other advanced models are also pivotal for addressing the severe class imbalance inherent in extreme events and for capturing their unique characteristics. For instance, some GAN variants, such as evtGAN, integrate Extreme Value Theory to better model the tails of distributions associated with rare events. Other architectural improvements, like the use of multi-scale gradients in MSG-GAN-SD, aim for more stable training dynamics, which is a general challenge in GANs [48,114]. Diffusion models, while noted for their stable training and ability to capture fine spatial details of extremes such as precipitation [18], might inherently be better at representing multimodal distributions and capturing tail behavior because of their iterative refinement process. This could make them less prone to the averaging effects that often cause simpler architectures to underestimate extremes. Similarly, attention mechanisms in Transformers, if appropriately designed, could learn to focus on subtle precursors or localized features indicative of rare, high-impact events, thereby complementing specialized loss functions in a synergistic manner. Effectively tackling extreme events thus necessitates a holistic approach where the model architecture itself is capable of learning and representing the complex, often subtle, features that characterize these rare phenomena, rather than relying solely on adjustments to the loss function or data handling.
- Extreme Value Theory (EVT) Integration: Combining ML with EVT provides a statistical framework for modeling the tails of distributions. For instance, evtGAN [114] combines GANs with EVT to model spatial dependencies in temperature and precipitation extremes [114]. Models using Generalized Pareto Distribution (GPD) for tails can incorporate covariates from climate models to improve estimates [98].
9.4. Uncertainty Quantification (UQ)
Sources of Uncertainty
- Aleatoric Uncertainty: Represents inherent randomness or noise in the data and the process being modeled (e.g., unpredictable small-scale atmospheric fluctuations).
- Epistemic Uncertainty: Arises from limitations in model knowledge, including model structure, parameter choices, and limited training data. This uncertainty is, in principle, reducible with more data or better models.
- Scenario Uncertainty: Uncertainty in future greenhouse gas emissions and other anthropogenic forcings.
- GCM Uncertainty: Structural differences among GCMs lead to a spread in projections even for the same scenario.
- Downscaling Model Uncertainty: The statistical downscaling model itself introduces uncertainty.
UQ Approaches in ML Downscaling
- Ensemble Methods:
- −
- Deep Ensembles: Training multiple instances of the same DL model with different random initializations (and potentially variations in training data via bootstrap sampling) and then combining their predictions to estimate both the mean and the spread (uncertainty) [69,136]. DeepESD [10] is an example of a CNN ensemble framework that quantifies inter-model spread from multiple GCM inputs and internal model variability. Deep ensembles can improve UQ, especially for future periods, by providing confidence intervals [69]. The optimal number of models in an ensemble for improving mean and UQ is often found to be around 3–6 models [69].
- −
- Multi-Model Ensembles (MMEs): Applying a downscaling model to outputs from multiple GCMs to capture inter-GCM uncertainty.
- Bayesian Neural Networks (BNNs): These models learn a probability distribution over their weights, rather than point estimates. By sampling from this posterior distribution, BNNs can provide probabilistic predictions that inherently quantify both aleatoric and epistemic uncertainty [137]. Techniques like Monte Carlo dropout are often used as a practical approximation to Bayesian inference in deep networks [137]. Bayesian AIG-Transformer and Precipitation CNN (PCNN) are examples of models incorporating these techniques for downscaling wind and precipitation [136,138].Strengths: Provide a principled way to decompose uncertainty into aleatoric and epistemic components.Weaknesses: Can be computationally more expensive to train and sample from compared to deterministic models or simple ensembles.
- Generative Models for Probabilistic Output: GANs and Diffusion Models can, in principle, learn the conditional probability distribution and generate multiple plausible high-resolution realizations for a given low-resolution input, thus providing a form of ensemble for UQ. Diffusion models, in particular, are noted for their ability to model complex distributions effectively [1].
- Quantile Regression: As mentioned for extremes, models that predict quantiles of the distribution (e.g., Quantile Regression Neural Networks [36]) directly provide information about the range of possible outcomes.
Challenges in UQ
- Computational Cost: Probabilistic methods like BNNs and large ensembles can be computationally intensive.
- Validation of Uncertainty: Validating the reliability of uncertainty estimates, especially for future projections where ground truth is unavailable, is a significant challenge. Pseudo-reality experiments are often used for this [69].
- Communication of Uncertainty: Effectively communicating complex, multi-faceted uncertainty information to end-users and policymakers is crucial but non-trivial.
9.5. Reproducibility, Data Handling, and Methodological Rigor
- Reproducibility: Ensuring that research findings can be independently verified is a cornerstone of scientific progress. In ML-based downscaling, this involves the following:
- −
- Public Code and Data: Sharing model code, training data (or clear pointers to standard datasets), and pre-trained model weights [8].
- −
- Containerization and Deterministic Environments: Using tools like Docker to create reproducible software environments and ensuring deterministic operations in model training and inference where possible [139].
- −
- Well-Defined Train/Test Splits and Evaluation Protocols: Clearly documenting how data are split for training, validation, and testing, and using standardized evaluation protocols (like VALUE [107]) to facilitate fair comparisons across studies.Baselines. The seven-method study by Vandal et al. [1] justifies using strong linear/bias-correction baselines (BCSD, Elastic-Net, hybrid BC + ML) alongside modern DL.Spectral/structure metrics. Following Harris et al. [90] and Annau et al. [88], include power spectra/structure functions, fraction skill scores, and spatial-coherence diagnostics to detect texture hallucinations and scale mismatch.Uncertainty metrics. For probabilistic models (GAN/VAEs/diffusion), report CRPS, reliability diagrams/PIT, and quantile/interval coverage (as in [36,90]).Tail-aware metrics. Report quantile-oriented scores (e.g., QVSS), return-level/return-period consistency, and extreme-event FSS where relevant (cf. [115]).Explicitly include warming/OOD tests (e.g., pseudo-global-warming or future-slice validation). Rampal et al. [19] show intensity-aware losses and residual two-stage designs can improve robustness for extremes under warming.
- −
- Active Frontiers: As noted in recent papers (e.g., Quesada-Chacón et al. [8]), while reproducibility advances are being made through such efforts, consistent adoption of best practices across the community is still needed to ensure the robustness and verifiability of research findings.
- Data Handling Issues:
- −
- Collinearity: High correlation among predictors (e.g., physically related fields such as temperature, humidity, pressure, and winds) can inflate coefficient variance, destabilize feature attribution, and make sensitivity analyses misleading. Practical diagnostics include (i) pairwise correlation heatmaps, (ii) variance inflation factors (VIF), and (iii) condition indices; however, VIF thresholds should be interpreted in context rather than treated as universal cutoffs [140,141]. Mitigations include predictor grouping (ablate correlated predictors as a set), regularization, dimensionality reduction (e.g., PCA/PLS), and domain-informed variable selection [128,140].
- −
- Feature Ablation and Importance Checks: To ensure predictors are genuinely contributing (and not acting as proxies for correlated fields), use structured ablations (drop-one/drop-group), permutation importance, and stability checks across regimes (e.g., wet vs. dry seasons, coastal vs. inland subregions). Report performance deltas (e.g., RMSE, FSS, tailMAE) rather than only “top features”, and interpret importance jointly with collinearity diagnostics [128].
- −
- Sensitivity to Random Initialization (Seed Variance): Neural downscalers can exhibit non-trivial run-to-run variability because of random initialization, data shuffling, and nondeterministic GPU kernels. To avoid over-interpreting single-run results, report the distribution of scores across multiple seeds (e.g., mean ± std over S runs) and, when feasible, select models based on expected validation performance rather than “best seed luck” [142,143]. When probabilistic inference is needed, approaches such as MC-dropout can reflect parameter uncertainty at test time [137].
- −
- Suppressor Variables: Suppressors are predictors that may appear marginally weak (or even negatively correlated) yet improve performance when combined with other predictors by removing irrelevant variance or resolving confounding. A practical diagnostic is a sign-flip check: compare marginal association (univariate) versus partial association (multivariate) and flag predictors whose effect changes sign or whose coefficient magnitude increases markedly after controlling for correlated variables. Use hierarchical modeling/ablation to verify that improvements persist out-of-sample [144].
- Methodological Rigor in Evaluation:
- −
- Beyond Standard Metrics: While RMSE is a common metric, it may not capture all relevant aspects of downscaling performance, especially for spatial patterns, temporal coherence, or extreme events. A broader suite of metrics is needed, including the following:
- *
- Spatial correlation, structural similarity index (SSIM) [46].
- *
- Metrics for extremes (e.g., precision, recall, critical success index for precipitation thresholds; metrics from Extreme Value Theory like GPD parameters or return levels) [8].
- *
- Metrics for distributional similarity (e.g., Earth Mover’s Distance, Kullback–Leibler divergence) [145].
- *
- Metrics for temporal coherence and spatial consistency (e.g., spectral analysis, variogram analysis, or specific metrics like Restoration Rate and Consistency Degree from TemDeep [6]). The Modified Kling–Gupta Efficiency (KGE) decomposes performance into correlation, bias, and variability [131,146].
- −
- Out-of-Sample Validation: A key pitfall in ML-based downscaling is that standard random k-fold cross-validation is generally invalid for spatio-temporal climate data, because spatial and temporal autocorrelation causes information leakage between train and test folds. This leakage can substantially inflate apparent skill and produce overly optimistic uncertainty estimates. Therefore, we recommend the following minimum standard: (i) never use purely random folds for gridded fields or station networks with spatial dependence; (ii) use blocked (spatial and/or temporal) splits that reflect the intended deployment setting; and (iii) explicitly report the split design (block definition, block size, any buffers, number of folds, and whether the test fold is out-of-time and/or out-of-region). These practices are well supported in the broader spatio-temporal modeling literature [147,148]. Practical validation options include the following:
- *
- Spatial blocked k-fold CV: Partition the domain into contiguous spatial blocks and hold out entire blocks for testing. Blocks should be larger than the effective spatial dependence scale of the target variable/covariates to reduce leakage. This is the default choice for gridded downscaling evaluation when the deployment involves new locations within the same climate regime [147,148].
- *
- *
- Buffered spatial CV: Add a buffer zone around the held-out test region and exclude training samples within that buffer to further minimize leakage because of spatial proximity. As a practical guideline, the buffer distance should be on the order of the estimated decorrelation length (or larger) for the relevant fields [147,149].
- *
- Temporal blocked CV/forward-chaining: Hold out contiguous time periods (e.g., the most recent years) rather than random days/months. This is essential when models will be applied across time or under evolving climate conditions; it tests robustness to temporal regime shifts and reduces leakage from serial dependence [147].
Challenges in Benchmarking and Inter-Comparison
| Downscaling Configuration/Claim | Main Risk | Recommended Validation Design |
|---|---|---|
| Station-based or sparse-site PP downscaling (generalize to unseen sites) | Spatial leakage across nearby stations | Leave-Location-Out (LLO): hold out entire stations/regions; report site-wise skill distributions and worst-case sites [147]. |
| Gridded-field downscaling within same climate regime (new areas inside domain) | Nearby pixels leak into test folds | Spatial blocked k-fold: hold out contiguous blocks; prefer fewer, larger blocks. If dependence is strong, add buffered CV [148]. |
| Need strict spatial independence (high autocorrelation; sharp gradients like coast/orography) | Inflated skill from proximity | Buffered spatial CV: exclude training data within of test blocks; choose from estimated decorrelation length(s), and report sensitivity to buffer choice [147]. |
| Historical → future projection (non-stationarity claim) | Train/test mismatch over time; concept drift | Temporal blocking/forward-chaining: train on earlier period(s), test on later period(s). For projection claims, include a warm test as a minimum requirement, and when feasible include scenario-stress tests such as PGW [147,150]. |
| Cross-GCM deployment (apply to a different driving GCM) | Cross-model distribution shift | Leave-One-GCM-Out (LOGO): train on multiple GCMs, hold out one GCM for testing; report cross-GCM variance and failure cases (recommendation aligns with domain-shift evaluation principles) [147]. |
| Future + cross-GCM (strongest generalization claim) | Compound OOD shift | Combine LOGO with out-of-time splits (and PGW/scenario stress tests where feasible): “unseen GCM” and “unseen future period” simultaneously; treat this as the most conservative validation design. |
10. Future Trajectories: Grand Challenges and Open Questions
10.1. Grand Challenge 1: Overcoming Non-Stationarity
- Foundation Models: Large, pretrained backbones learned from massive, diverse earth-system data (e.g., multiple GCMs or reanalyses) that provide broad, reusable priors; usable zero-/few-shot or with fine-tuning [16].
- Domain Adaptation and Transfer Learning: Methods to adapt models from a source to a target distribution like (historical → future, reanalysis → GCM, region A → B), including fine-tuning FMs or smaller models and explicit shift-handling techniques [17].
- Rigorous OOD Testing: Systematically using Pseudo-Global Warming (PGW) experiments and holding out entire GCMs or future time periods for validation to stress-test and quantify extrapolation capabilities [19].
- How can we formally verify that a model has learned a causal physical process rather than a spurious shortcut?
- What are the theoretical limits of generalization for a given model architecture and training data diversity?
- Can online learning systems be developed to allow models to adapt continuously as the climate evolves, mitigating concept drift in near-real-time applications?
10.2. Grand Challenge 2: Achieving Verifiable Physical Consistency
- Hybrid Dynamical-Statistical Models: Frameworks like dynamical–generative downscaling leverage a physical model to provide a consistent foundation, which an ML model then refines. This approach strategically outsources the enforcement of complex physics to a trusted dynamical core [9].
- How can we design computationally tractable physical loss terms for complex, non-differentiable processes like cloud microphysics or radiative transfer?
- What is the optimal trade-off between the flexibility of soft constraints and the guarantees of hard constraints for multi-variable downscaling?
- Can we develop methods to automatically discover relevant physical constraints from data, rather than relying solely on pre-defined equations?
10.3. Grand Challenge 3: Reliable and Interpretable Uncertainty Quantification (UQ)
- How can we effectively validate UQ for far-future projections where no ground truth exists?
- How can we decompose total uncertainty into its constituent sources in a computationally efficient manner for deep learning models?
- How can we best communicate complex, multi-dimensional uncertainty information to non-expert stakeholders to support robust decision-making?
10.4. Grand Challenge 4: Skillful Prediction of Climate Extremes
- Integration with Extreme Value Theory (EVT): Hybrid models that combine ML with statistical EVT offer a principled way to model the extreme tails of climate distributions [114].
- How do we ensure that generative models produce extremes that are not only statistically realistic but also physically plausible in their spatio-temporal evolution?
- How will the statistics of compound extremes (e.g., concurrent heat and drought) change, and can ML models capture their evolving joint probabilities?
- Can we develop models that explicitly predict changes in the parameters of EVT distributions (e.g., GPD parameters) as a function of large-scale climate drivers?
10.5. Benchmarkable Objectives for Measuring Progress
10.6. Current State Assessment
- Performance Paradox: ML models, particularly deep learning architectures like CNNs, U-Nets, and GANs, often demonstrate excellent performance on in-sample test data or when downscaling historical reanalysis products. They excel at learning complex spatial patterns and non-linear relationships, leading to visually compelling high-resolution outputs. However, this strong in-sample performance frequently does not translate to robust extrapolation on out-of-distribution data (e.g., future climate scenarios from different GCMs or entirely new regions)—a critical limitation given that downscaling is intended to inform future projections.
- Trust Deficit: The limited transparency of many deep learning models, together with historically sparse uncertainty quantification, constrains end-user confidence and practical uptake. Without clear reasoning traces and robust uncertainty estimates, the utility of ML-downscaled products for decision-making remains limited.
- Physical Inconsistency: Many current ML downscaling methods do not inherently enforce fundamental physical laws (e.g., conservation of mass/energy, thermodynamic constraints). Resulting fields can be statistically plausible yet physically unrealistic, undermining scientific interpretability and downstream use.
- Challenges with Extreme Events: Accurately capturing the frequency, intensity, and spatial characteristics of extremes remains difficult. Class imbalance and commonly used loss functions tend to underestimate magnitudes and misplace patterns of high-impact events; specialized targets, data curation, and evaluation for extremes are required.
- Data Limitations and Methodological Gaps: The scarcity of high-quality, high-resolution reference data in many regions, together with inconsistent metrics, validation protocols, and reporting standards, impedes apples-to-apples comparison and cumulative progress. Recent work emphasizes that computational repeatability is essential for building trust and enabling rigorous comparison across methods [8].
Answers to the Research Questions
10.7. Priority Research Directions
- Robust Extrapolation and Generalization Frameworks (Addressing RQ2):
- Systematic Evaluation Protocols: Develop and adopt standardized protocols and benchmark datasets specifically designed to test model transferability across different climate states (historical, near-future, far-future), different GCMs/RCMs, and diverse geographical regions. This includes rigorous out-of-sample testing beyond simple hold-out validation.
- Metrics for Generalization: Establish and use metrics that explicitly quantify generalization and extrapolation capability, rather than relying solely on traditional skill scores computed on in-distribution test data.
- Understanding Failure Modes: Conduct systematic analyses of why and when ML models fail to extrapolate, linking failures to model architecture, training data characteristics, or violations of physical assumptions.
- Physics–ML Integration and Hybrid Modeling Standards (Addressing RQ2):
- Standardized PIML Approaches: Develop and disseminate standardized methods and libraries for incorporating physical constraints (both hard and soft) into common ML architectures used for downscaling. This includes guidance on formulating physics-based loss terms and designing constraint-aware layers.
- Validation Suites for Physical Consistency: Create benchmark validation suites that explicitly test for adherence to key physical laws (e.g., conservation principles, thermodynamic consistency, realistic spatial gradients and inter-variable relationships).
- Advancing Hybrid Models: Foster research into hybrid models that effectively combine the strengths of process-based dynamical models with the efficiency and pattern-recognition capabilities of ML, including RCM emulators and generative AI approaches for refining RCM outputs Tomasi et al. [31].
- Operational Uncertainty Quantification (Addressing RQ3):
- Beyond Point Estimates: Shift the focus from deterministic (single-value) predictions to probabilistic projections that provide a comprehensive assessment of uncertainty.
- Efficient UQ Methods: Develop and promote computationally efficient UQ methods suitable for high-dimensional DL models, such as scalable deep ensembles, practical Bayesian deep learning techniques (e.g., with improved variational inference or MC dropout strategies), and generative models capable of producing reliable ensembles [32,69,137].
- Decomposition and Attribution of Uncertainty: Advance methods to decompose total uncertainty into its constituent sources (e.g., GCM uncertainty, downscaling model uncertainty, internal variability) and attribute uncertainty to specific model components or assumptions.
- User-Oriented Uncertainty Communication: Develop effective tools and protocols for communicating complex uncertainty information to diverse end-users in an understandable and actionable manner.
- Explainable and Interpretable Climate AI (Addressing RQ3): Why this interacts with other challenges: Explainability is not merely a usability feature; it is a diagnostic tool for transferability, physical consistency, UQ, and extremes. In particular, XAI can help detect when a model’s apparent skill is driven by spurious correlates that may fail under non-stationarity or cross-domain deployment, and it can reveal whether predicted extreme events are triggered by physically meaningful drivers. Explanations are most decision-relevant when paired with calibrated uncertainty, enabling stakeholders to interpret not only what the model relies on but also how confident it is in regimes where data support is limited.
- Domain-Specific XAI Metrics: Establish XAI metrics and methodologies that are specifically relevant to climate science, moving beyond generic XAI techniques to those that can provide physically meaningful insights.
- Linking ML Decisions to Physical Processes: Develop XAI techniques that can causally link ML model decisions and internal representations to known climate processes and drivers, rather than just highlighting input feature importance [129].
- Standards for Model Documentation and Interpretation: Promote standards for documenting ML model architectures, training procedures, and the results of interpretability analyses to enhance transparency and facilitate critical assessment by the broader scientific community [129].
- Community Infrastructure and Benchmarking (Addressing all RQs): Why benchmarking must be coupled: Benchmarking is the mechanism that makes progress on the other challenges measurable and comparable. Critically, benchmark suites should jointly evaluate (i) accuracy under structured OOD tests (time/domain/region), (ii) physical-consistency diagnostics, (iii) probabilistic verification for uncertainty-aware models, and (iv) tail-/event-focused extreme metrics. Evaluating any one dimension in isolation can hide trade-offs (e.g., improved mean RMSE with degraded extremes, or sharper predictions with miscalibrated uncertainty). Therefore, shared benchmarks should report a compact “scorecard” across these linked dimensions to reflect the true operational readiness of a downscaling method.
- Shared Evaluation Frameworks: Expand and support the community-driven evaluation frameworks (e.g., extending the VALUE initiative [107]) to facilitate systematic intercomparison of ML downscaling methods using standardized datasets and metrics.
- Reproducible Benchmark Datasets: Curate and maintain open, high-quality benchmark datasets specifically designed for training and evaluating ML downscaling models across various regions, variables, and climate conditions. These should include data for testing transferability and extreme event representation.
- Open-Source Implementations: Encourage and support the development and dissemination of open-source software implementations of key ML downscaling methods and PIML components to lower the barrier to entry and promote reproducibility.
- Collaborative Platforms: Foster collaborative platforms and initiatives (e.g., CORDEX Task Forces on ML [62]) for sharing knowledge, best practices, model components, and downscaled datasets.
11. Ethical Consideration, Responsible Development, and Governance in ML-Based Climate Downscaling
11.1. Ethical Foundations for ML Downscaling: From Technical Failure Modes to Normative Obligations
- Reliability under shift (do no harm through invalid extrapolation): Anticipate non-stationarity and explicitly stress-test transferability (Section 9.1).
- Transparency and contestability (make limitations legible): Document data provenance, modeling assumptions, and known failure regimes; pair explanations with uncertainty to avoid false confidence (Section 9.4 and Section 9.5).
- Equity and inclusion (avoid systematically disadvantaging the vulnerable): Treat uneven observations and capacity as a first-order risk, not a nuisance variable (Section 6.2).

11.2. Algorithmic Bias, Fairness, and Equity
11.2.1. Auditable Fairness Metrics for Climate-Relevant Strata
11.2.2. Why This Matters for Downscaling (Tied to Prior Sections)
11.2.3. Practitioner Checklist
- Report data coverage maps and per-region sample counts used in training and validation (ties to Section 6.5); stratify metrics by data-rich vs. data-scarce subregions.
- Use shift-robust training/evaluation (Section 9.1): e.g., held-out regions, time-split OOD tests, and stress tests on atypical synoptic regimes.
- Track extreme-aware metrics (Section 9.3): CSI/POD at multi-thresholds, tail-MAE, quantile errors, and bias of return levels.
- Quantify epistemic uncertainty (Section 9.4) and suppress overconfident deployment in regions with low training support; communicate abstentions or wide intervals as a feature, not a bug.
11.2.4. Mini-Case (Sparse Data → Biased Extremes → Policy Risk)
11.3. Transparency, Accountability, and Liability
11.3.1. Minimum Transparency Artifacts (Auditable)
11.3.2. Make Transparency Operational
11.4. Accessibility, Inclusivity, and the Digital Divide
11.4.1. Tie to Data-Scarce Regions and Model Bias
11.4.2. Actionable Steps
- Publish downscaling baselines and trained weights under permissive licenses; provide lightweight inference paths for low-resource agencies.
- Release region-stratified diagnostics and data coverage artifacts so local stakeholders can judge fitness-for-use.
- Prioritize augmentation/sampling schemes that upweight underrepresented regimes and seasons, with ablation evidence (links to Section 7). Augmentation, of course, must respect the invariant transformations of the augmented data in climate science.
11.5. Misinterpretation, Misuse, and Communication of Uncertainty
11.5.1. Quantitative Uncertainty Requirements (Calibration and Skill)
11.5.2. Communicating Uncertainty for Decisions
11.6. Data Governance, Privacy, and Ownership
Downscaling-Specific Governance
11.7. The Need for Governance Frameworks and Best Practices
11.7.1. An Auditable Governance Workflow for ML Downscaling
11.7.2. Suggested Roles (Minimum)
- Data Steward: Documents observational provenance/coverage and approves preprocessing and updates.
- Model Developer: Trains the model and produces reproducible artifacts (code, configs, seeds, environment).
- Independent Evaluator/Auditor: Runs prespecified benchmarks (including OOD tests) and signs off on release.
- Service Operator: Deploys, monitors drift, and manages incident response and user communication.
- Decision Stakeholders: Define acceptable risk thresholds and fitness-for-purpose criteria.
11.7.3. A Minimal, Testable Governance Bundle (Linked to Prior Sections)
12. Future Outlook: The Next Decade of ML in Climate Downscaling
12.1. Emerging Paradigms
12.1.1. Foundation Models for Climate Downscaling
- Potential Benefits:
- Enhanced Transfer Learning: These models could provide powerful, pre-trained representations of atmospheric and Earth system dynamics, enabling effective transfer learning to specific downscaling tasks across various regions, variables, and GCMs with significantly reduced data requirements for fine-tuning [61].
- Multi-Task Capabilities: Foundation models can be designed for multiple downstream tasks, including forecasting, downscaling, and parameterization learning, offering a versatile tool for climate modeling.
- Implicit Physical Knowledge: Through pre-training on vast datasets governed by physical laws, these models might implicitly learn and encode some degree of physical consistency, although explicit PIML techniques will likely still be necessary to guarantee it.
- Challenges: The developing and training of these massive models require substantial computational resources and curated large-scale datasets. Ensuring their generalizability and avoiding the propagation of biases learned during pre-training are also critical research areas.
12.1.2. Hybrid Hierarchical and Multi-Scale Approaches
- Global ML models or foundation models providing coarse, bias-corrected boundary conditions.
- Regional physics-informed ML models or RCM emulators operating at intermediate scales, incorporating more detailed regional physics.
- Local stochastic generators or specialized ML models (e.g., for extreme events or specific microclimates) providing the final layer of high-resolution detail and variability.
12.1.3. Online Learning and Continuous Model Adaptation
- Benefits: This could help mitigate the stationarity assumption by allowing models to learn changing relationships over time and improve their performance for ongoing or near-real-time downscaling applications.
- Challenges: Ensuring model stability, avoiding catastrophic forgetting (where learning new data degrades performance on old data), and managing the computational demands of continuous retraining are significant hurdles.
12.1.4. Deep Integration of Causal Inference and Process Understanding
12.2. Critical Success Factors
- Interdisciplinary Collaboration: Sustained and deep collaboration between climate scientists, ML researchers, statisticians, computational scientists, and domain experts from impact sectors is essential. Climate scientists bring crucial domain knowledge about physical processes and data characteristics, while ML experts provide algorithmic innovation.
- Open Science Practices: The continued adoption of open science principles—including the sharing of code, datasets, model weights, and standardized evaluation frameworks—is vital for reproducibility, transparency, and accelerating collective progress [8]. Initiatives like CORDEX and CMIP6, which foster data sharing and model intercomparison, provide valuable models for the ML downscaling community [182,183].
- Deep Stakeholder Engagement and Co-production Throughout the Lifecycle: While listed as a critical success factor, the principle of stakeholder engagement and co-design deserves elevated emphasis, framed not merely as a desirable component but as an essential element integrated throughout the entire research, development, and deployment lifecycle of ML-based downscaling tools and climate services. Moving beyond consultation, true co-production involves iterative, sustained processes of relationship building, shared understanding, and joint output development with end-users and affected communities.Actively involving end-users from diverse sectors (e.g., agriculture, water resource management, urban planning, public health, indigenous communities) from the very outset of ML downscaling projects offers profound benefits [165]:
- Ensuring Relevance and Actionability: Co-production helps ensure that ML downscaling efforts are targeted towards producing genuinely useful, context-specific, and actionable information that meets the actual needs of decision-makers rather than being solely technology-driven.
- Defining User-Relevant Evaluation Metrics: Collaboration with users can help define evaluation metrics and performance targets that reflect their specific decision contexts and thresholds of concern, moving beyond purely statistical measures to those that indicate practical utility.
- Building Trust and Facilitating Uptake: A transparent, demand-driven, and participatory development process fosters trust in the ML models and their outputs. When users are part of the creation process, they gain a better understanding of the model’s capabilities and limitations, which facilitates the responsible uptake and integration of ML-derived products into their decision-making frameworks.
- Addressing the “Trust Deficit”: By fostering a collaborative environment, co-production directly addresses the “trust deficit”. It allows for a two-way dialogue where the complexities, uncertainties, and assumptions inherent in ML downscaling are openly discussed and understood by both developers and users, leading to more realistic expectations and appropriate applications.
- Incorporating Local and Indigenous Knowledge: Participatory approaches can facilitate the integration of valuable local and indigenous knowledge systems with scientific data, leading to more holistic and effective adaptation strategies [184].
13. Conclusions
- Methodological Maturation: The field has progressed from simple super-resolution CNNs to sophisticated generative architectures (GANs, Diffusion) that effectively resolve the texture-smoothing problems of earlier deterministic models.
- The Physics Gap: Physical consistency remains the primary challenge. Purely data-driven models frequently violate conservation laws, necessitating the adoption of Physics-Informed Machine Learning (PIML) frameworks.
- Generalization Risk: Non-stationarity under climate change poses a fundamental risk to statistical validity. Robust stress-testing using “perfect model” frameworks is essential to quantify extrapolation errors.
- Operational Requirements: Widespread adoption requires moving beyond RMSE. Operational confidence demands robust Uncertainty Quantification (UQ) and Explainable AI (XAI) to transparently communicate limitations to decision-makers.
13.1. Quantitative Synthesis and Implications for Practice
13.2. Roadmap: Minimum Validation Standards for Climate Services
14. Future Scope and Outstanding Challenges
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Vandal, T.; Kodra, E.; Ganguly, A.R. Intercomparison of machine learning methods for statistical downscaling: The case of daily and extreme precipitation. Theor. Appl. Climatol. 2019, 137, 557–576. [Google Scholar] [CrossRef]
- Rampal, N.; Hobeichi, S.; Gibson, P.B.; Baño-Medina, J.; Abramowitz, G.; Beucler, T.; González-Abad, J.; Chapman, W.; Harder, P.; Gutiérrez, J.M. Enhancing regional climate downscaling through advances in machine learning. Artif. Intell. Earth Syst. 2024, 3, 230066. [Google Scholar] [CrossRef]
- Maraun, D.; Wetterhall, F.; Ireson, A.M.; Chandler, R.E.; Kendon, E.J.; Widmann, M.; Brienen, S.; Rust, H.W.; Sauter, T.; Themeßl, M.; et al. Precipitation downscaling under climate change: Recent developments to bridge the gap between dynamical models and the end user. Rev. Geophys. 2010, 48, RG3003. [Google Scholar] [CrossRef]
- Gutiérrez, J.M.; Maraun, D.; Widmann, M.; Huth, R.; Hertig, E.; Benestad, R.; Roessler, O.; Wibig, J.; Wilcke, R.; Kotlarski, S.; et al. An intercomparison of a large ensemble of statistical downscaling methods over Europe: Results from the VALUE perfect predictor cross-validation experiment. Int. J. Climatol. 2019, 39, 3750–3785. [Google Scholar] [CrossRef]
- Reichstein, M.; Camps-Valls, G.; Stevens, B.; Jung, M.; Denzler, J.; Carvalhais, N.; Prabhat, F. Deep learning and process understanding for data-driven Earth system science. Nature 2019, 566, 195–204. [Google Scholar] [CrossRef]
- Wang, L.; Li, Q.; Lv, Q.; Peng, X.; You, W. TemDeep: A self-supervised framework for temporal downscaling of atmospheric fields at arbitrary time resolutions. Geosci. Model Dev. 2025, 18, 2427–2442. [Google Scholar] [CrossRef]
- Lanzante, J.R.; Dixon, K.W.; Nath, M.J.; Whitlock, C.E.; Adams-Smith, D. Some Pitfalls in Statistical Downscaling of Future Climate. Bull. Am. Meteorol. Soc. 2018, 99, 791–803. [Google Scholar] [CrossRef]
- Quesada-Chacón, D.; Stöger, J.; Güntner, A.; Bernhofer, C. Repeatable high-resolution statistical downscaling through deep learning. Geosci. Model Dev. 2022, 15, 7353–7370. [Google Scholar] [CrossRef]
- Lopez-Gomez, I.; Wan, Z.Y.; Zepeda-Núñez, L.; Schneider, T.; Anderson, J.; Sha, F. Dynamical-generative downscaling of climate model ensembles. Proc. Natl. Acad. Sci. USA 2025, 122, e2420288122. [Google Scholar] [CrossRef]
- Baño Medina, J.; Manzanas, R.; Cimadevilla, E.; Fernández, J.; González-Abad, J.; Cofiño, A.S.; Gutiérrez, J.M. Downscaling multi-model climate projection ensembles with deep learning (DeepESD): Contribution to CORDEX EUR-44. Geosci. Model Dev. 2022, 15, 6747–6758. [Google Scholar] [CrossRef]
- Baño-Medina, J.; Manzanas, R.; Gutiérrez, J.M. Configuration and intercomparison of deep learning neural models for statistical downscaling. Geosci. Model Dev. 2019, 13, 2109–2124. [Google Scholar] [CrossRef]
- Pall, P.; Allen, M.; Stone, D.A. Testing the Clausius–Clapeyron constraint on changes in extreme precipitation under CO2 warming. Clim. Dyn. 2007, 28, 351–363. [Google Scholar] [CrossRef]
- Vandal, T.; Kodra, E.; Gosh, S.; Ganguly, A.R. DeepSD: Generating High-Resolution Climate Change Projections Through Single Image Super-Resolution. arXiv 2017, arXiv:1703.03126. [Google Scholar] [CrossRef]
- Maraun, D.; Shepherd, T.G.; Widmann, M.; Zappa, G.; Walton, D.; Gutiérrez, J.M.; Hagemann, S.; Richter, I.; Soares, P.M.; Hall, A.; et al. Towards process-informed bias correction of climate change simulations. Nat. Clim. Change 2017, 7, 764–773. [Google Scholar] [CrossRef]
- Curran, D.; Saleem, H.; Hobeichi, S.; Salim, F.D. Resolution-Agnostic Transformer-based Climate Downscaling. arXiv 2024, arXiv:2411.14774. [Google Scholar]
- Schmude, J.; Roy, S.; Trojak, W.; Jakubik, J.; Civitarese, D.S.; Singh, S.; Kuehnert, J.; Ankur, K.; Gupta, A.; Phillips, C.E.; et al. Prithvi wxc: Foundation model for weather and climate. arXiv 2024, arXiv:2409.13598. [Google Scholar]
- Prasad, A.; Harder, P.; Yang, Q.; Sattegeri, P.; Szwarcman, D.; Watson, C.; Rolnick, D. Evaluating the transferability potential of deep learning models for climate downscaling. arXiv 2024, arXiv:2407.12517. [Google Scholar] [CrossRef]
- Legasa, M.N.; Manzanas, R.; Gutiérrez, J.M. Assessing Three Perfect Prognosis Methods for Statistical Downscaling of Climate Change Precipitation Scenarios. Geophys. Res. Lett. 2023, 50, e2022GL102525. [Google Scholar] [CrossRef]
- Rampal, N.; Gibson, P.B.; Sherwood, S.; Abramowitz, G. On the extrapolation of generative adversarial networks for downscaling precipitation extremes in warmer climates. Geophys. Res. Lett. 2024, 51, e2024GL112492. [Google Scholar] [CrossRef]
- Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
- Wang, F.; Tian, D.; Lowe, L.; Kalin, L.; Lehrter, J. Deep Learning for Daily Precipitation and Temperature Downscaling. Water Resour. Res. 2021, 57, e2020WR029308. [Google Scholar] [CrossRef]
- Quesada-Chacón, D.; Stöger, J.; Güntner, A.; Bernhofer, C. Downscaling CORDEX Through Deep Learning to Daily 1 km Multivariate Ensemble in Complex Terrain. Earth’s Future 2023, 11, e2023EF003531. [Google Scholar] [CrossRef]
- Pathak, J.; Subramanian, S.; Harrington, P.; Raja, S.; Chattopadhyay, A.; Mardani, M.; Kurth, T.; Hall, D.; Li, Z.; Azizzadenesheli, K.; et al. FourCastNet: A Global Data-driven High-resolution Weather Model using Adaptive Fourier Neural Operators. arXiv 2022, arXiv:2202.11214. [Google Scholar] [CrossRef]
- Kumar, R.; Sharma, T.; Vaghela, V.; Jha, S.K.; Agarwal, A. PrecipFormer: Efficient Transformer for Precipitation Downscaling. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), Tucson, AZ, USA, 26 February–6 March 2025; pp. 489–497. [Google Scholar]
- Beucler, T.; Rasp, S.; Pritchard, M.; Gentine, P. Achieving conservation of energy in neural network emulators for climate modeling. arXiv 2019, arXiv:1906.06622. [Google Scholar] [CrossRef]
- Raissi, M.; Perdikaris, P.; Karniadakis, G.E. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 2019, 378, 686–707. [Google Scholar] [CrossRef]
- Harder, P.; Jha, S.; Rolnick, D. Hard-Constrained Deep Learning for Climate Downscaling. J. Mach. Learn. Res. 2022, 23, 1–38. [Google Scholar]
- Leinonen, J.; Nerini, D.; Berne, A. Stochastic Super-Resolution for Downscaling Time-Evolving Atmospheric Fields with a GAN. In Proceedings of the ECML/PKDD Workshop on ClimAI, Virtual, 14–18 September 2020. [Google Scholar]
- Price, I.; Rasp, S. Increasing the Accuracy and Resolution of Precipitation Forecasts Using Deep Generative Models. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics, Valencia, Spain, 28–30 March 2022. [Google Scholar]
- Rampal, N.; Gibson, P.B.; Sood, A.; Stuart, S.; Fauchereau, N.C.; Brandolino, C.; Noll, B.; Meyers, T. High-resolution downscaling with interpretable deep learning: Rainfall extremes over New Zealand. Weather Clim. Extrem. 2022, 38, 100525. [Google Scholar] [CrossRef]
- Tomasi, E.; Franch, G.; Cristoforetti, M. Can AI be enabled to perform dynamical downscaling? A latent diffusion model to mimic kilometer-scale COSMO5.0_CLM9 simulations. Geosci. Model Dev. 2025, 18, 2051–2078. [Google Scholar] [CrossRef]
- Srivastava, P.; El Helou, A.; Vilalta, R.; Li, H.W.; Kumar, V.; Mandt, S. Precipitation Downscaling with Spatiotemporal Video Diffusion. In Advances in Neural Information Processing Systems 37, Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS 2024), Vancouver, BC, Canada, 10–15 December 2024; Curran Associates, Inc.: San Jose, CA, USA, 2024; pp. 19327–19340. [Google Scholar]
- Liu, Y.; Doss-Gollin, J.; Balakrishnan, G.; Veeraraghavan, A. Generative Precipitation Downscaling using Score-based Diffusion with Wasserstein Regularization. arXiv 2024, arXiv:2410.00381. [Google Scholar] [CrossRef]
- Tripathi, S.; Srinivas, V.V.; Nanjundiah, R.S. Downscaling of precipitation for climate change scenarios: A support vector machine approach. J. Hydrol. 2006, 330, 621–640. [Google Scholar] [CrossRef]
- He, X.; Chaney, N.W.; Schleiss, M.; Sheffield, J. Spatial downscaling of precipitation using adaptable random forests. Water Resour. Res. 2016, 52, 8217–8237. [Google Scholar] [CrossRef]
- Cannon, A.J. Quantile regression neural networks: Implementation in R and application to precipitation downscaling. Comput. Geosci. 2011, 37, 1277–1284. [Google Scholar] [CrossRef]
- Vaughan, A.; Adamson, H.; Tak-Chu, L.; Turner, R.E. Convolutional conditional neural processes for local climate downscaling. arXiv 2021, arXiv:2101.07857. [Google Scholar] [CrossRef]
- Baño-Medina, J.; Manzanas, R.; Gutiérrez, J.M. On the suitability of deep convolutional neural networks for continental-wide downscaling of climate change projections. Clim. Dyn. 2021, 57, 2941–2951. [Google Scholar] [CrossRef]
- Soares, P.M.M.; Johannsen, F.; Lima, D.C.A.; Lemos, G.; Bento, V.A.; Bushenkova, A. High-resolution downscaling of CMIP6 Earth system and global climate models using deep learning for Iberia. Geosci. Model Dev. 2024, 17, 229–257. [Google Scholar] [CrossRef]
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Proceedings of the 18th International Conference, Munich, Germany, 5–9 October 2015; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar] [CrossRef]
- Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, Proceedings of the International Workshop, Granada, Spain, 20 September 2018; Springer: Cham, Switzerland, 2018; Volume 11045, pp. 3–11. [Google Scholar] [CrossRef]
- Liu, J.; Shi, C.; Ge, L.; Tie, R.; Chen, X.; Zhou, T.; Gu, X.; Shen, Z. Enhanced Wind Field Spatial Downscaling Method Using UNET Architecture and Dual Cross-Attention Mechanism. Remote Sens. 2024, 16, 1867. [Google Scholar] [CrossRef]
- Pasula, A.; Subramani, D.N. Global Climate Model Bias Correction Using Deep Learning. arXiv 2025, arXiv:2504.19145. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
- Kim, J.; Lee, J.K.; Lee, K.M. Accurate Image Super-Resolution Using Very Deep Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar] [CrossRef]
- Lim, B.; Son, S.; Kim, H.; Nah, S.; Lee, K.M. Enhanced Deep Residual Networks for Single Image Super-Resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; pp. 1132–1140. [Google Scholar] [CrossRef]
- Iotti, M.; Davini, P.; von Hardenberg, J.; Zappa, G. RainScaleGAN: A Conditional Generative Adversarial Network for Rainfall Downscaling. Artif. Intell. Earth Syst. 2025, 4, e240074. [Google Scholar] [CrossRef]
- Accarino, G.; De Rubeis, T.D.; Falcucci, G.; Ubaldi, E.; Aloisio, G. MSG-GAN-SD: A Multi-Scale Gradients GAN for Statistical Downscaling of 2-Meter Temperature over the EURO-CORDEX Domain. AI 2021, 2, 600–620. [Google Scholar] [CrossRef]
- Glawion, L.; Polz, J.; Kunstmann, H.; Fersch, B.; Chwala, C. Global spatio-temporal ERA5 precipitation downscaling to km and sub-hourly scale using generative AI. npj Clim. Atmos. Sci. 2025, 8, 219. [Google Scholar] [CrossRef]
- Kang, M.; Shin, J.; Park, J. StudioGAN: A taxonomy and benchmark of GANs for image synthesis. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 15725–15742. [Google Scholar] [CrossRef]
- Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8110–8119. [Google Scholar]
- Stengel, K.A.; Glaws, A.; Hettinger, D.; King, R.N. Adversarial super-resolution of climatological wind and solar data. Proc. Natl. Acad. Sci. USA 2020, 117, 16805–16815. [Google Scholar] [CrossRef]
- National Renewable Energy Laboratory. Sup3rCC: Super-Resolution for Renewable Energy Resource Data with Climate Change Impacts. Available online: https://www.nrel.gov/analysis/sup3rcc (accessed on 27 May 2025).
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 10684–10695. [Google Scholar] [CrossRef]
- Berry, L.; Brando, A.; Meger, D. Shedding light on large generative networks: Estimating epistemic uncertainty in diffusion models. In Proceedings of the 40th Conference on Uncertainty in Artificial Intelligence, Barcelona, Spain, 15–18 July 2024. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Advances in Neural Information Processing Systems 30, Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: San Jose, CA, USA, 2017; pp. 5998–6008. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar] [CrossRef]
- Zhong, X.; Du, F.; Chen, L.; Wang, Z.; Li, H. Investigating transformer-based models for spatial downscaling and correcting biases of near-surface temperature and wind-speed forecasts. Q. J. R. Meteorol. Soc. 2024, 150, 275–289. [Google Scholar] [CrossRef]
- Sinha, S.; Benton, B.; Emami, P. On the effectiveness of neural operators at zero-shot weather downscaling. Environ. Data Sci. 2025, 4, e21. [Google Scholar] [CrossRef]
- Wang, X.; Choi, J.Y.; Kurihaya, T.; Lyngaas, I.; Yoon, H.J.; Fan, M.; Nafi, N.M.; Tsaris, A.; Aji, A.M.; Hossain, M.; et al. ORBIT-2: Scaling Exascale Vision Foundation Models for Weather and Climate Downscaling. arXiv 2025, arXiv:2505.04802. [Google Scholar] [CrossRef]
- Shi, J.; Shirali, A.; Jin, B.; Zhou, S.; Hu, W.; Rangaraj, R.; Wang, S.; Han, J.; Wang, Z.; Lall, U.; et al. Deep Learning and Foundation Models for Weather Prediction: A Survey. arXiv 2025, arXiv:2501.06907. [Google Scholar] [CrossRef]
- Coordinated Regional Climate Downscaling Experiment (CORDEX). Task Force on Machine Learning. 2024. Describes Ongoing Task Force Activities. Last Website Update Noted as 2025. Available online: https://cordex.org/strategic-activities/taskforces/task-force-on-machine-learning/ (accessed on 26 May 2025).
- Hobeichi, S.; Nishant, N.; Shao, Y.; Abramowitz, G.; Pitman, A.; Sherwood, S.; Bishop, C.; Green, S. Using machine learning to cut the cost of dynamical downscaling. Earth’s Future 2023, 11, e2022EF003291. [Google Scholar] [CrossRef]
- Ghosh, S. SVM-PGSL coupled approach for statistical downscaling to predict rainfall from GCM output. J. Geophys. Res. Atmos. 2010, 115. [Google Scholar] [CrossRef]
- González-Abad, J.; Baño-Medina, J.; Gutiérrez, J.M. Using Explainability to Inform Statistical Downscaling Based on Deep Learning Beyond Standard Validation Approaches. J. Adv. Model. Earth Syst. 2023, 15, e2023MS003641. [Google Scholar] [CrossRef]
- Daw, A.; Karpatne, A.; Watkins, W.D.; Read, J.S.; Kumar, V. Physics-guided neural networks (pgnn): An application in lake temperature modeling. In Knowledge Guided Machine Learning; Chapman and Hall/CRC: Boca Raton, FL, USA, 2022; pp. 353–372. [Google Scholar]
- Beucler, T.; Pritchard, M.; Rasp, S.; Ott, J.; Baldi, P.; Gentine, P. Enforcing analytic constraints in neural networks emulating physical systems. Phys. Rev. Lett. 2021, 126, 098302. [Google Scholar] [CrossRef]
- Schuster, G.T.; Chen, Y.; Feng, S. Review of physics-informed machine-learning inversion of geophysical data. Geophysics 2024, 89, T337–T356. [Google Scholar] [CrossRef]
- González-Abad, J.; Baño-Medina, J. Deep Ensembles to Improve Uncertainty Quantification of Statistical Downscaling Models under Climate Change Conditions. arXiv 2023, arXiv:2305.00975. [Google Scholar] [CrossRef]
- Xiang, L.; Hu, P.; Wang, F.; Yu, J.; Zhang, L. A Novel Reference-Based and Gradient-Guided Deep Learning Model for Daily Precipitation Downscaling. Atmosphere 2022, 13, 511. [Google Scholar] [CrossRef]
- Boateng, D.; Mutz, S.G. pyESDv1. 0.1: An open-source Python framework for empirical-statistical downscaling of climate information. Geosci. Model Dev. Discuss. 2023, 16, 6479–6514. [Google Scholar] [CrossRef]
- Wang, Z.; Bugliaro, L.; Gierens, K.; Hegglin, M.I.; Rohs, S.; Petzold, A.; Kaufmann, S.; Voigt, C. Machine learning for improvement of upper tropospheric relative humidity in ERA5 weather model data. Atmos. Chem. Phys. 2025, 25, 2845–2861. [Google Scholar] [CrossRef]
- Daly, C.; Halbleib, M.; Smith, J.I.; Gibson, W.P.; Doggett, M.K.; Taylor, G.H.; Curtis, J.; Pasteris, P.P. Physiographically sensitive mapping of climatological temperature and precipitation across the conterminous United States. Int. J. Climatol. 2008, 28, 2031–2064. [Google Scholar] [CrossRef]
- Herrera, S.; Cardoso, R.M.; Soares, P.M.; Espírito-Santo, F.; Viterbo, P.; Gutiérrez, J.M. Iberia01: A new gridded dataset of daily precipitation and temperatures over Iberia. Earth Syst. Sci. Data 2019, 11, 1947–1971. [Google Scholar] [CrossRef]
- Cornes, R.C.; van der Schrier, G.; van den Besselaar, E.J.M.; Jones, P.D. An Ensemble Version of the E-OBS Temperature and Precipitation Data Sets. J. Geophys. Res. Atmos. 2018, 123, 9391–9409. [Google Scholar] [CrossRef]
- Technische Universität Dresden. Regionales Klimainformationssystem Sachsen (ReKIS). General Project Portal. A Summary Document “Climate_datasets_Zusammenfassung.pdf” Is Available from the Portal. 2023. Available online: https://rekis.hydro.tu-dresden.de/ (accessed on 26 May 2025).
- Huffman, G.J.; Bolvin, D.T.; Braithwaite, D.; Hsu, K.; Joyce, R.J.; Kidd, C.; Nelkin, E.J.; Sorooshian, S.; Tan, J.; Xie, P. Algorithm Theoretical Basis Document (ATBD), Version 06.3; Integrated Multi-satellitE Retrievals for GPM (IMERG) Algorithm Theoretical Basis Document (ATBD); Technical Report; NASA Goddard Space Flight Center: Washington, DC, USA, 2020. Available online: https://gpm.nasa.gov/resources/documents/algorithm-information/IMERG-V06-ATBD (accessed on 26 May 2025).
- Entekhabi, D.; Njoku, E.G.; O’Neill, P.E.; Kellogg, K.H.; Crow, W.T.; Edelstein, W.N.; Entin, J.K.; Goodman, S.D.; Jackson, T.J.; Johnson, J.T.; et al. The Soil Moisture Active Passive (SMAP) Mission. Proc. IEEE 2010, 98, 704–716. [Google Scholar] [CrossRef]
- Pastorello, G.; Trotta, C.; Canfora, E.; Chu, H.; Christianson, D.; Cheah, Y.W.; Poindexter, C.; Chen, J.; Elbashandy, A.; Humphrey, M.; et al. The FLUXNET2015 dataset and the ONEFlux processing pipeline for eddy covariance data. Sci. Data 2020, 7, 225. [Google Scholar] [CrossRef]
- Sishah, S.; Abrahem, T.; Azene, G.; Dessalew, A.; Hundera, H. Downscaling and validating SMAP soil moisture using a machine learning algorithm over the Awash River basin, Ethiopia. PLoS ONE 2023, 18, e0279895. [Google Scholar] [CrossRef]
- Sha, Y.; Stull, R.; Ghafarian, P.; Ou, T.; Gultepe, I. Deep-Learning-Based Gridded Downscaling of Surface Meteorological Variables in Complex Terrain. Part I: Daily Maximum and Minimum 2-m Temperature. J. Appl. Meteorol. Climatol. 2020, 59, 2057–2073. [Google Scholar] [CrossRef]
- Sarafanov, M.; Kazakov, E.; Nikitin, N.O.; Kalyuzhnaya, A.V. A Machine Learning Approach for Remote Sensing Data Gap-Filling with Open-Source Implementation: An Example Regarding Land Surface Temperature, Surface Albedo and NDVI. Remote Sens. 2020, 12, 3865. [Google Scholar] [CrossRef]
- Huang, X. Evaluating Loss Functions and Learning Data Pre-Processing for Climate Downscaling Deep Learning Models. arXiv 2023, arXiv:2306.11144. [Google Scholar] [CrossRef]
- Choi, H.; Kim, Y.; Kim, D. Enhancing Extreme Rainfall Nowcasting with Weighted Loss Functions in Deep Learning Models. EGU General Assembly 2025, EGU25-19416. Available online: https://meetingorganizer.copernicus.org/EGU25/EGU25-19416.html (accessed on 26 May 2025).
- Fallah, B.; Rakhshandehroo, G.R.; Berg, P.; Wulfmeyer, V.; Hattermann, F.F. Climate model downscaling in central Asia: A dynamical and a neural network approach. Geosci. Model Dev. 2025, 18, 161–180. [Google Scholar] [CrossRef]
- Roberts, N.M.; Lean, H.W. Scale-selective verification of rainfall accumulations from high-resolution forecasts of convective events. Mon. Weather Rev. 2008, 136, 78–97. [Google Scholar] [CrossRef]
- Gneiting, T.; Raftery, A.E. Strictly proper scoring rules, prediction, and estimation. J. Am. Stat. Assoc. 2007, 102, 359–378. [Google Scholar] [CrossRef]
- Annau, N.J.; Cannon, A.J.; Monahan, A.H. Algorithmic Hallucinations of Near-Surface Winds: Statistical Downscaling with GANs to Convection-Permitting Scales. Artif. Intell. Earth Syst. 2023, 2, e230015. [Google Scholar] [CrossRef]
- Lakshminarayanan, B.; Pritzel, A.; Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems 30, Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Curran Associates, Inc.: San Jose, CA, USA, 2017. [Google Scholar]
- Harris, L.; McRae, A.T.T.; Chantry, M.; Dueben, P.D.; Palmer, T.N. A Generative Deep Learning Approach to Stochastic Downscaling of Precipitation Forecasts. J. Adv. Model. Earth Syst. 2022, 14, e2022MS003120. [Google Scholar] [CrossRef]
- Marzban, C.; Sandgathe, S. Verification with variograms. Weather Forecast. 2009, 24, 1102–1120. [Google Scholar] [CrossRef]
- Davis, C.; Brown, B.; Bullock, R. Object-based verification of precipitation forecasts. Part I: Methodology and application to mesoscale rain areas. Mon. Weather Rev. 2006, 134, 1772–1784. [Google Scholar] [CrossRef]
- Davis, C.; Brown, B.; Bullock, R. Object-based verification of precipitation forecasts. Part II: Application to convective rain systems. Mon. Weather Rev. 2006, 134, 1785–1795. [Google Scholar] [CrossRef]
- Huth, R.; Kyselỳ, J.; Pokorná, L. A GCM simulation of heat waves, dry spells, and their relationships to circulation. Clim. Change 2000, 46, 29–60. [Google Scholar] [CrossRef]
- Mendes, D.; Marengo, J.A. Temporal downscaling: A comparison between artificial neural network and autocorrelation techniques over the Amazon Basin in present and future climate change scenarios. Theor. Appl. Climatol. 2010, 100, 413–421. [Google Scholar] [CrossRef]
- Zolina, O.; Simmer, C.; Belyaev, K.; Gulev, S.K.; Koltermann, P. Changes in the duration of European wet and dry spells during the last 60 years. J. Clim. 2013, 26, 2022–2047. [Google Scholar] [CrossRef]
- Fall, C.M.N.; Lavaysse, C.; Drame, M.S.; Panthou, G.; Gaye, A.T. Wet and dry spells in Senegal: Comparison of detection based on satellite products, reanalysis, and in situ estimates. Nat. Hazards Earth Syst. Sci. 2021, 21, 1051–1069. [Google Scholar] [CrossRef]
- Coles, S.G. An Introduction to Statistical Modeling of Extreme Values; Springer Series in Statistics; Springer: London, UK, 2001. [Google Scholar] [CrossRef]
- Vissio, G.; Lembo, V.; Lucarini, V.; Ghil, M. Evaluating the performance of climate models based on Wasserstein distance. Geophys. Res. Lett. 2020, 47, e2020GL089385. [Google Scholar] [CrossRef]
- Perkins, S.; Pitman, A.; Holbrook, N.J.; Mcaneney, J. Evaluation of the AR4 climate models’ simulated daily maximum temperature, minimum temperature, and precipitation over Australia using probability density functions. J. Clim. 2007, 20, 4356–4376. [Google Scholar] [CrossRef]
- Sha, Y.; Stull, R.; Ghafarian, P.; Ou, T.; Gultepe, I. Deep-Learning-Based Gridded Downscaling of Surface Meteorological Variables in Complex Terrain. Part II: Daily Precipitation. J. Appl. Meteorol. Climatol. 2020, 59, 2075–2092. [Google Scholar] [CrossRef]
- Wood, A.W.; Leung, L.R.; Sridhar, V.; Lettenmaier, D.P. Hydrologic implications of dynamical and statistical approaches to downscaling climate model outputs. Clim. Change 2004, 62, 189–216. [Google Scholar] [CrossRef]
- Pierce, D.W.; Cayan, D.R.; Thrasher, B.L. Statistical downscaling using localized constructed analogs (LOCA). J. Hydrometeorol. 2014, 15, 2558–2585. [Google Scholar] [CrossRef]
- Doblas-Reyes, F.J.; Sörensson, A.A.; Almazroui, M.; Dosio, A.; Gutowski, W.J.; Haarsma, R.; Hamdi, R.; Hewitson, B.; Kwon, W.-T.; Lamptey, B.L.; et al. Linking Global to Regional Climate Change. In Climate Change 2021: The Physical Science Basis; Contribution of Working Group I to the Sixth Assessment Report of the IPCC; Cambridge University Press: Cambridge, UK, 2021; pp. 1363–1512. [Google Scholar] [CrossRef]
- Basile, S.; Crimmins, A.R.; Avery, C.W.; Hamlington, B.D.; Kunkel, K.E. Appendix 3. Scenarios and Datasets. In Fifth National Climate Assessment; USGCRP: Washington, DC, USA, 2023. [Google Scholar] [CrossRef]
- Harilal, N.; Singh, M.; Bhatia, U. Augmented Convolutional LSTMs for Generation of High-Resolution Climate Change Projections. IEEE Access 2021, 9, 25208–25218. [Google Scholar] [CrossRef]
- Maraun, D.; Widmann, M.; Gutierrez, J.M.; Kotlarski, S.; Chandler, R.E.; Hertig, E.; Huth, R.; Wibig, J.; Wilcke, R.A.I.; Themeßl, M.J.; et al. VALUE: A framework to validate downscaling approaches for climate change studies. Earth’s Future 2015, 3, 1–14. [Google Scholar] [CrossRef]
- Pérez, A.; Santa Cruz, M.; San Martín, D.; Gutiérrez, J.M. Transformer-based super-resolution downscaling for regional reanalysis: Full domain vs tiling approaches. arXiv 2024, arXiv:2410.12728. [Google Scholar]
- Gama, J.; Žliobaitė, I.; Bifet, A.; Pechenizkiy, M.; Bouchachia, A. A survey on concept drift adaptation. ACM Comput. Surv. CSUR 2014, 46, 44. [Google Scholar] [CrossRef]
- Geirhos, R.; Jacobsen, J.H.; Michaelis, C.; Zemel, R.; Brendel, W.; Bethge, M.; Wichmann, F.A. Shortcut learning in deep neural networks. Nat. Mach. Intell. 2020, 2, 665–673. [Google Scholar] [CrossRef]
- Cavaiola, M.; Tuju, P.E.; Mazzino, A. Accurate and efficient AI-assisted paradigm for adding granularity to ERA5 precipitation reanalysis. Sci. Rep. 2024, 14, 26158. [Google Scholar] [CrossRef]
- Legasa, M.; Manzanas, R.; Calviño, A.; Gutiérrez, J.M. A posteriori random forests for stochastic downscaling of precipitation by predicting probability distributions. Water Resour. Res. 2022, 58, e2021WR030272. [Google Scholar] [CrossRef]
- Baño-Medina, J. Understanding deep learning decisions in statistical downscaling models. In Proceedings of the 10th International Conference on Climate Informatics, Virtual, 22–25 September 2020; pp. 79–85. [Google Scholar]
- Boulaguiem, Y.; Zscheischler, J.; Vignotto, E.; van der Wiel, K.; Engelke, S. Modeling and simulating spatial extremes by combining extreme value theory with generative adversarial networks. Environ. Data Sci. 2022, 1, e5. [Google Scholar] [CrossRef]
- Lee, J.; Park, S.Y. WGAN-GP-Based Conditional GAN with Extreme Critic for Precipitation Downscaling in a Key Agricultural Region of the Northeastern U.S. IEEE Access 2025, 13, 46030–46041. [Google Scholar] [CrossRef]
- Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.k.; Woo, W.c. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. In Advances in Neural Information Processing Systems 28, Proceedings of the 29th Annual Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Curran Associates, Inc.: San Jose, CA, USA, 2015; pp. 802–810. [Google Scholar]
- Miao, Q.; Liu, Y.; Liu, T.; Sorooshian, S. Improving Monsoon Precipitation Prediction Using Combined Convolutional and Long Short Term Memory Neural Network. Water 2019, 11, 977. [Google Scholar] [CrossRef]
- Anh, D.T.; Bae, D.J.; Jung, K. Downscaling rainfall using deep learning LSTM and feedforward neural networks. Int. J. Climatol. 2019, 39, 2502–2518. [Google Scholar] [CrossRef]
- Yang, F.; Ye, Q.; Wang, K.; Sun, L. Successful Precipitation Downscaling Through an Innovative Transformer-Based Model. Remote Sens. 2024, 16, 4292. [Google Scholar] [CrossRef]
- Hernanz, A.; Rodriguez-Camino, E.; Navascués, B.; Gutiérrez, J.M. On the limitations of deep learning for statistical downscaling of climate change projections: The transferability and the extrapolation issues. Atmos. Sci. Lett. 2024, 25, e1195. [Google Scholar] [CrossRef]
- Vandal, T.; Kodra, E.; Gosh, S.; Gunter, L.; Gonzalez, J.; Ganguly, A.R. Statistical downscaling of global climate models with image super-resolution and uncertainty quantification. arXiv 2018, arXiv:1811.03605. [Google Scholar] [CrossRef]
- Gretton, A.; Borgwardt, K.M.; Rasch, M.J.; Schölkopf, B.; Smola, A. A kernel two-sample test. J. Mach. Learn. Res. 2012, 13, 723–773. [Google Scholar]
- Székely, G.J.; Rizzo, M.L. Energy statistics: A class of statistics based on distances. J. Stat. Plan. Inference 2013, 143, 1249–1272. [Google Scholar] [CrossRef]
- Dutta, S.; Innan, N.; Yahia, S.B.; Shafique, M. AQ-PINNs: Attention-Enhanced Quantum Physics-Informed Neural Networks for Carbon-Efficient Climate Modeling. arXiv 2024, arXiv:2409.01522. [Google Scholar]
- Radke, T.; Fuchs, S.; Wilms, C.; Polkova, I.; Rautenhaus, M. Explaining neural networks for detection of tropical cyclones and atmospheric rivers in gridded atmospheric simulation data. Geosci. Model Dev. 2025, 18, 1017–1039. [Google Scholar] [CrossRef]
- Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar] [CrossRef]
- Lundberg, S.M.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30, Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Curran Associates, Inc.: San Jose, CA, USA, 2017; pp. 4765–4774. [Google Scholar]
- van Zyl, C.; Ye, X.; Naidoo, R. Harnessing eXplainable artificial intelligence for feature selection in time series energy forecasting: A comparative analysis of Grad-CAM and SHAP. Appl. Energy 2024, 353, 122079. [Google Scholar] [CrossRef]
- O’Loughlin, R.J.; Li, D.; Neale, R.; O’Brien, T.A. Moving beyond post hoc explainable artificial intelligence: A perspective paper on lessons learned from dynamical climate modeling. Geosci. Model Dev. 2025, 18, 787–802. [Google Scholar] [CrossRef]
- Mamalakis, A.; Barnes, E.A.; Ebert-Uphoff, I. Investigating the fidelity of explainable artificial intelligence methods for applications of convolutional neural networks in geoscience. Artif. Intell. Earth Syst. 2022, 1, e220012. [Google Scholar] [CrossRef]
- Gupta, H.V.; Kling, H.; Yilmaz, K.K.; Martinez, G.F. Decomposition of the mean squared error and NSE performance criteria: Implications for improving hydrological modelling. J. Hydrol. 2009, 377, 80–91. [Google Scholar] [CrossRef]
- Zscheischler, J.; Westra, S.; Van Den Hurk, B.J.; Seneviratne, S.I.; Ward, P.J.; Pitman, A.; AghaKouchak, A.; Bresch, D.N.; Leonard, M.; Wahl, T.; et al. Future climate risk from compound events. Nat. Clim. Change 2018, 8, 469–477. [Google Scholar] [CrossRef]
- Zscheischler, J.; Martius, O.; Westra, S.; Bevacqua, E.; Raymond, C.; Horton, R.M.; van den Hurk, B.; AghaKouchak, A.; Jézéquel, A.; Mahecha, M.D.; et al. A typology of compound weather and climate events. Nat. Rev. Earth Environ. 2020, 1, 333–347. [Google Scholar] [CrossRef]
- Mazdiyasni, O.; AghaKouchak, A. Substantial increase in concurrent droughts and heatwaves in the United States. Proc. Natl. Acad. Sci. USA 2015, 112, 11484–11489. [Google Scholar] [CrossRef]
- Addison, H.; Kendon, E.; Ravuri, S.; Aitchison, L.; Watson, P.A. Machine learning emulation of a local-scale UK climate model. arXiv 2022, arXiv:2211.16116. [Google Scholar] [CrossRef]
- Gerges, F.; Boufadel, M.C.; Bou-Zeid, E.; Nassif, H.; Wang, J.T.L. A Novel Bayesian Deep Learning Approach to the Downscaling of Wind Speed with Uncertainty Quantification. In Advances in Knowledge Discovery and Data Mining, Proceedings of the 26th Pacific-Asia Conference, PAKDD, Chengdu, China, 16–19 May 2022; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2022; Volume 13281, pp. 55–66. [Google Scholar] [CrossRef]
- Gal, Y.; Ghahramani, Z. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML), New York, NY, USA, 19–24 June 2016; Balcan, M.F., Weinberger, K.Q., Eds.; PMLR: Cambridge, MA, USA, 2016; Volume 48, pp. 1050–1059. [Google Scholar]
- Gerges, F.; Boufadel, M.C.; Bou-Zeid, E.; Nassif, H.; Wang, J.T.L. Bayesian Multi-Head Convolutional Neural Networks with Bahdanau Attention for Forecasting Daily Precipitation in Climate Change Monitoring. In Machine Learning and Knowledge Discovery in Databases, Proceedings of the European Conference, ECML PKDD 2022, Grenoble, France, 19–23 September 2022; Cerquitelli, T., Monreale, A., Mikut, R., Moccia, S., Raedt, L.D., Eds.; Proceedings, Part V; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2022; Volume 13717, pp. 416–431. [Google Scholar] [CrossRef]
- Merkel, D. Docker: Lightweight Linux Containers for Consistent Development and Deployment. Linux J. 2014, 2014, 2. [Google Scholar]
- Dormann, C.F.; Elith, J.; Bacher, S.; Buchmann, C.; Carl, G.; Carré, G.; Marquéz, J.R.G.; Gruber, B.; Lafourcade, B.; Leitão, P.J.; et al. Collinearity: A review of methods to deal with it and a simulation study evaluating their performance. Ecography 2013, 36, 27–46. [Google Scholar] [CrossRef]
- O’brien, R.M. A caution regarding rules of thumb for variance inflation factors. Qual. Quant. 2007, 41, 673–690. [Google Scholar] [CrossRef]
- Reimers, N.; Gurevych, I. Reporting score distributions makes a difference: Performance study of lstm-networks for sequence tagging. arXiv 2017, arXiv:1707.09861. [Google Scholar] [CrossRef]
- Dodge, J.; Gururangan, S.; Card, D.; Schwartz, R.; Smith, N.A. Show your work: Improved reporting of experimental results. arXiv 2019, arXiv:1909.03004. [Google Scholar] [CrossRef]
- Cohen, J.; Cohen, P.; West, S.G.; Aiken, L.S. Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences; Routledge: Boca Raton, FL, USA, 2013. [Google Scholar]
- Düsterhus, A.; Hense, A. Advanced information criterion for environmental data quality assurance. Adv. Sci. Res. 2012, 8, 99–104. [Google Scholar] [CrossRef]
- Kling, H.; Fuchs, M.; Paulin, M. Runoff conditions in the upper Danube basin under an ensemble of climate change scenarios. J. Hydrol. 2012, 424–425, 264–277. [Google Scholar] [CrossRef]
- Roberts, D.R.; Bahn, V.; Ciuti, S.; Boyce, M.S.; Elith, J.; Guillera-Arroita, G.; Hauenstein, S.; Lahoz-Monfort, J.J.; Schröder, B.; Thuiller, W.; et al. Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography 2017, 40, 913–929. [Google Scholar] [CrossRef]
- Valavi, R.; Elith, J.; Lahoz-Monfort, J.J.; Guillera-Arroita, G. blockCV: An r package for generating spatially or environmentally separated folds for k-fold cross-validation of species distribution models. bioRxiv 2018, 357798. [Google Scholar] [CrossRef]
- Mahoney, M.J.; Johnson, L.K.; Silge, J.; Frick, H.; Kuhn, M.; Beier, C.M. Assessing the performance of spatial cross-validation approaches for models of spatially structured data. arXiv 2023, arXiv:2303.07334. [Google Scholar] [CrossRef]
- Brogli, R.; Heim, C.; Mensch, J.; Sørland, S.L.; Schär, C. The pseudo-global-warming (PGW) approach: Methodology, software package PGW4ERA5 v1. 1, validation, and sensitivity analyses. Geosci. Model Dev. 2023, 16, 907–926. [Google Scholar] [CrossRef]
- Climate Change AI. Data Gaps (Beta). Available online: https://www.climatechange.ai/dev/datagaps (accessed on 27 May 2025).
- World Climate Research Programme. WCRP Grand Challenges (Ended in 2022). Official Community Theme Summary Page. 2022. Available online: https://www.wcrp-climate.org/component/content/category/26-grand-challenges (accessed on 13 August 2025).
- Hersbach, H. Decomposition of the continuous ranked probability score for ensemble prediction systems. Weather Forecast. 2000, 15, 559–570. [Google Scholar] [CrossRef]
- Glenn, W.B. Verification of forecasts expressed in terms of probability. Mon. Weather Rev. 1950, 78, 1–3. [Google Scholar]
- European High-Level Expert Group on Artificial Intelligence. Ethics Guidelines for Trustworthy AI; European Commission, Digital Strategy: Brussels, Belgium, 2019. [Google Scholar]
- AI, N. Artificial Intelligence Risk Management Framework (AI RMF 1.0). 2023. Available online: https://nvlpubs.nist.gov/nistpubs/ai/nist.ai (accessed on 27 May 2025).
- Adams, P.; Hewitson, B.; Vaughan, C.; Wilby, R.; Zebiak, S.; Eitland, E.; Shumake-Guillemot, J. Call for an ethical framework for climate services. WMO Bull. 2015, 64, 51–54. [Google Scholar]
- Mastrandrea, M.D.; Field, C.B.; Stocker, T.F.; Edenhofer, O.; Ebi, K.L.; Frame, D.J.; Held, H.; Kriegler, E.; Mach, K.J.; Matschoss, P.R.; et al. Guidance Note for Lead Authors of the IPCC Fifth Assessment Report on Consistent Treatment of Uncertainties; IPCC: Geneva, Switzerland, 2010. [Google Scholar]
- CORDEX. CORDEX Experiment Design for Dynamical Downscaling of CMIP6. Available online: https://cordex.org/wp-content/uploads/2021/02/CORDEX-CMIP6_exp_design_draft_SOD_ln.pdf (accessed on 1 June 2025).
- Sørland, S.L.; Schär, C.; Lüthi, D.; Kjellström, E. Bias patterns and climate change signals in GCM-RCM model chains. Environ. Res. Lett. 2018, 13, 074017. [Google Scholar] [CrossRef]
- Diez-Sierra, J.; Iturbide, M.; Gutiérrez, J.M.; Fernández, J.; Milovac, J.; Cofiño, A.S.; Cimadevilla, E.; Nikulin, G.; Levavasseur, G.; Kjellström, E.; et al. The worldwide C3S CORDEX grand ensemble: A major contribution to assess regional climate change in the IPCC AR6 Atlas. Bull. Am. Meteorol. Soc. 2022, 103, E2804–E2826. [Google Scholar] [CrossRef]
- Hawkins, E.; Sutton, R. The potential to narrow uncertainty in regional climate predictions. Bull. Am. Meteorol. Soc. 2009, 90, 1095–1108. [Google Scholar] [CrossRef]
- Hawkins, E.; Sutton, R. The potential to narrow uncertainty in projections of regional precipitation change. Clim. Dyn. 2011, 37, 407–418. [Google Scholar] [CrossRef]
- Bhardwaj, T. Climate Justice Hangs in the Balance Will AI Divide or Unite the Planet. Available online: https://www.downtoearth.org.in/climate-change/climate-justice-hangs-in-the-balance-will-ai-divide-or-unite-the-planet (accessed on 11 January 2026).
- Jacob, D.; St. Clair, A.L.; Mahon, R.; Marsland, S.; Murisa, M.N.; Buontempo, C.; Pulwarty, R.S.; Siddiqui, M.R.; Grossi, A.; Steynor, A.; et al. Co-production of climate services: Challenges and enablers. Front. Clim. 2025, 7, 1507759. [Google Scholar] [CrossRef]
- Mehrabi, N.; Morstatter, F.; Saxena, N.; Lerman, K.; Galstyan, A. A survey on bias and fairness in machine learning. ACM Comput. Surv. CSUR 2021, 54, 115. [Google Scholar] [CrossRef]
- Savannah Software Solutions. The Role of AI in Climate Modeling: Exploring How Artificial Intelligence Is Improving Predictions and Responses to Climate Change. Available online: https://savannahsoftwaresolutions.co.ke/the-role-of-ai-in-climate-modeling-exploring-how-artificial-intelligence-is-improving-predictions-and-responses-to-climate-change/ (accessed on 27 May 2025).
- Sustainability-Directory. AI Bias in Equitable Climate Solutions. Available online: https://sustainability-directory.com/question/ai-bias-equitable-climate-solutions/ (accessed on 27 May 2025).
- Amnuaylojaroen, T. Advancements and challenges of artificial intelligence in climate modeling for sustainable urban planning. Front. Artif. Intell. 2025, 8, 1517986. [Google Scholar] [CrossRef]
- Mitchell, M.; Wu, S.; Zaldivar, A.; Barnes, P.; Vasserman, L.; Hutchinson, B.; Spitzer, E.; Raji, I.D.; Gebru, T. Model cards for model reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency, Atlanta, GA, USA, 29–31 January 2019; pp. 220–229. [Google Scholar]
- Gebru, T.; Morgenstern, J.; Vecchione, B.; Vaughan, J.W.; Wallach, H.; Iii, H.D.; Crawford, K. Datasheets for datasets. Commun. ACM 2021, 64, 86–92. [Google Scholar] [CrossRef]
- American Bar Association. Climate Change and Responsible AI Affect Cybersecurity and Digital Privacy Conflicts. SciTech Lawyer 2025, 21. [Google Scholar]
- World Meteorological Organization. 2024 State of Climate Services. 2024. Assesses Global Climate Services Capacity and Gaps. Available online: https://wmo.int/publication-series/2024-state-of-climate-services (accessed on 27 May 2025).
- Mastrandrea, M.D.; Mach, K.J.; Plattner, G.K.; Edenhofer, O.; Stocker, T.F.; Field, C.B.; Ebi, K.L.; Matschoss, P.R. The IPCC AR5 guidance note on consistent treatment of uncertainties: A common approach across the working groups. Clim. Change 2011, 108, 675. [Google Scholar] [CrossRef]
- UNDP Climate Promise. What Are Climate Misinformation and Disinformation and How Can We Tackle Them? Available online: https://climatepromise.undp.org/news-and-stories/what-are-climate-misinformation-and-disinformation-and-how-can-we-tackle-them (accessed on 27 May 2025).
- ISO/IEC 42001:2023; Artificial Intelligence—Management System. ISO: Geneva, Switzerland; IEC: Geneva, Switzerland, 2023.
- ISO/IEC 23894:2023; Information Technology—Artificial Intelligence—Guidance on Risk Management. ISO: Geneva, Switzerland; IEC: Geneva, Switzerland, 2023.
- Golding, N.; Lambkin, K.; Wilson, L.; De Troch, R.; Fischer, A.M.; Hygen, H.O.; Hama, A.M.; Dyrrdal, A.V.; Jamsin, E.; Termonia, P.; et al. Developing national frameworks for climate services: Experiences, challenges and learnings from across Europe. Clim. Serv. 2025, 37, 100530. [Google Scholar] [CrossRef]
- World Meteorological Organization (WMO). National Framework for Climate Services (NFCS) Factsheet; World Meteorological Organization (WMO): Geneva, Switzerland, 2018. [Google Scholar]
- World Meteorological Organization (WMO). WMO–HMEI Code of Ethics Guiding Public–Private Engagement; World Meteorological Organization (WMO): Geneva, Switzerland, 2024. [Google Scholar]
- EY. AI and Sustainability: Opportunities, Challenges and Impact. Available online: https://www.ey.com/en_nl/insights/climate-change-sustainability-services/ai-and-sustainability-opportunities-challenges-and-impact (accessed on 27 May 2025).
- Giorgi, F.; Jones, C.; Asrar, G.R. Addressing climate information needs at the regional level: The CORDEX framework. WMO Bull. 2009, 58, 175–183. [Google Scholar]
- Eyring, V.; Bony, S.; Meehl, G.A.; Senior, C.A.; Stevens, B.; Stouffer, R.J.; Taylor, K.E. Overview of the Coupled Model Intercomparison Project Phase 6 (CMIP6) experimental design and organization. Geosci. Model Dev. 2016, 9, 1937–1958. [Google Scholar] [CrossRef]
- WeAdapt. Justice and Equity in Climate Change Adaptation: Overview of an Emerging Agenda. Available online: https://weadapt.org/knowledge-base/gender-and-social-equality/justice-and-equity-in-climate-change-adaptation-overview-of-an-emerging-agenda/ (accessed on 27 May 2025).






| Architecture | Key Mechanisms/Characteristics | Strengths in Downscaling | Limitations/Weaknesses | UQ Capabilities/Robustness to Non-Stat. and Extremes | Typical Climate Variables | Typical Input Res. | Typical Output Res. | Key Refs |
|---|---|---|---|---|---|---|---|---|
| SVM (Support Vector Machines) | Kernel-based supervised learning; finds optimal hyperplane in transformed feature space; can use nonlinear kernels for complex relationships. | Performs well with limited data; robust to high-dimensional predictor spaces; strong baseline for PP downscaling. | Choice of kernel and hyperparameters critical; may underperform on highly non-stationary or extreme events; less scalable to massive training datasets. | UQ typically via bootstrapping or ensembles; deterministic by default; robustness to non-stationarity depends on training sample diversity. | Precip, Temp | GCM scale (e.g., 50–250 km) | Station/grid scale | [34,64] |
| Random Forests (RF, AP-RF, Prec-DWARF) | Ensemble of decision trees trained on bootstrap samples; output is mean/majority vote; AP-RF extends with predictive distribution outputs. | Handles nonlinear predictor–predictand relationships; naturally ranks predictor importance; AP-RF produces stochastic samples. | May smooth fine-scale details; bias in extremes without specialized treatment; interpretability less direct than single trees. | Yes for AP-RF (predictive distribution via gamma parameters); deterministic for standard RF; moderate robustness to non-stationarity if trained on diverse climates. | Precip | 0.25–1° | 0.125°/site-level | [35,112] |
| CNN (SRCNN, U-Net, ResNet) | Convolutional layers, pooling, shared weights. U-Net: encoder–decoder w/skip connections. ResNet: residual blocks. | Spatial feature extraction, pattern recognition; U-Nets preserve fine details; ResNets enable deeper learning. | Overfitting; extrapolation issues; can be overly smooth under MSE loss; plain CNNs struggle with depth. | UQ via ensembles; robustness to non-stationarity often limited without targeted strategies (e.g., PGW training). Standard CNNs may smooth extremes unless using specialized losses or architectures. | Temp, Precip, Wind, Solar Rad. | 25–250 km | 1–25 km | [10,13,37,38,40,113] |
| GAN (CGAN, MSG-GAN, evtGAN, Sup3rCC) | Generator and Discriminator trained adversarially. Conditional GANs (CGANs) use input conditions. Sup3rCC uses GANs to learn and inject spatio-temporal features from historical high-res data into coarse GCM outputs for renewable energy resource variables. | Perceptually realistic outputs, sharp details, better extreme event statistics, spatial variability. Sup3rCC provides high-resolution (4 km hourly) realistic data for wind, solar, temp, humidity, pressure, tailored for energy system analysis and computationally efficient compared to dynamical downscaling. | Training instability (mode collapse), difficult evaluation, potential artifacts, may not capture the full statistical distribution. Sup3rCC does not represent specific historical weather events, but historical/future climate conditions, and does not reduce GCM uncertainty. | UQ via ensembles, but it can be challenging to calibrate. Potential for better extreme event generation. Robustness to Non-Stationarity is an active research area; can learn spurious correlations if not carefully designed/trained. Sup3rCC aims for physically realistic outputs by learning from historical data. | Temp, Precip, Wind, Solar Rad. Sup3rCC specialized for renewable energy variables (wind, solar, temp, humidity, pressure). | GCM scale (e.g., 25–100 km) | 1–12 km. Sup3rCC: 4 km hourly. | [28,47,48,53,88,114,115] |
| LSTM/ConvLSTM | Recurrent memory cells (LSTM); ConvLSTM embeds convolutions into gates. | Captures long-range temporal dependencies; suitable for sequence modeling; CNN–LSTM hybrids. | High complexity; ConvLSTM outperforms pure LSTM on spatial data; very long-range spatial dependencies can be limited. | UQ via ensembles or Bayesian RNNs; can model temporal non-stationarity if reflected in training data but may struggle with unseen future shifts and rare extremes without augmentation. | Precip, Runoff, other time-evolving vars. | Gridded time series | Gridded time series | [106,116,117,118] |
| Transformer (ViT, PrecipFormer, etc.) | Self-attention for global context; captures long-range spatio-temporal interactions. | Excellent at modeling long-range dependencies; strong transfer potential, especially in hybrid architectures. | Quadratic attention cost (being mitigated by sparse/linearized variants); relatively new in downscaling; large data requirements. | UQ via attention-weighted ensembles; promising for non-stationarity when pre-trained on diverse climates; attention can focus on localized antecedent signatures of extremes, aiding detection though not guaranteeing tail magnitude accuracy. | Temp, Precip, Wind, multiple vars. | Various (e.g., 50 km, 250 km) | Various (e.g., 0.9 km, 7 km, 25 km) | [15,16,24,57,108,119] |
| Diffusion Model (LDM, STVD) | Iterative denoising process; LDMs operate in latent space. | High-quality, diverse samples; stable training; explicit probabilistic outputs; good spatial detail. | Computationally intensive (though LDMs mitigate cost); relatively nascent for downscaling; slow sampling. | Excellent UQ via learned distributions and ensemble generation; promising for capturing tail behavior and fine-grained spatial detail of extremes; robustness to non-stationarity is an active research area, but shows potential when trained on diverse climate data. | Temp, Precip, Wind | 100–250 km | 2–10 km | [20,31,32,33,54] |
| Multi-task Foundation Models (e.g., Prithvi-WxC, FourCastNet, ORBIT-2) | Large pre-trained (often Transformer-based) models fine-tuned for downscaling. | Zero/few-shot potential; multi-variable support; leverage extensive pre-training. | Very high pre-training cost; uncertain generalization to new locales/tasks without adaptation; bias propagation risks. | UQ via large-ensemble sampling; pre-training on diverse climates can enhance robustness to non-stationarity and extremes, but careful domain adaptation is essential. | Multiple vars | Coarse GCM/Reanalysis | Fine (task-dependent) | [23,60] |
| Issue | How to Diagnose | What to Report | Common Mitigations |
|---|---|---|---|
| Collinearity | Pairwise matrix; VIF; condition indices [140,141] | List flagged predictor groups; VIF summary; grouped-ablation deltas | Group-wise ablation; regularization; PCA/PLS; domain-driven pruning |
| Suppressors/ confounding | Marginal vs partial association (“sign-flip”); hierarchical ablations [144] | Predictors with sign flips; whether gains persist on held-out/OOD splits | Group-wise modeling; constrain feature sets; regime-specific evaluation |
| Seed variance/instability | Repeat training for S seeds; examine score distributions [142] | Mean ± std (or CI) across seeds; rank stability | Deterministic settings; longer training; ensembles; robust selection criteria |
| Ablation interpretability | Drop-one and drop-group; permutation importance stability | skill per feature/group across regimes | Feature grouping; consistent reporting across seasons/regions |
| Challenge | What It Uniquely Targets | Primary Evaluation Axis/Stress Test |
|---|---|---|
| Non-stationarity | Temporal drift and regime changes that break historical mappings | Out-of-time validation (train on earlier decades, test on later decades or “warm” periods); explicit drift diagnostics |
| Transferability | Cross-domain generalization beyond the training domain (GCM, scenario, region, data source) | Leave-one-domain-out tests (e.g., leave-one-GCM-out, leave-one-region-out), and combined spatial+temporal OOD tests |
| Causal/mechanism-aware ML | Learning stable, physically meaningful relations (invariants) rather than spurious correlations | Mechanism-based sanity checks, robustness under interventions, and improved performance under non-stationarity and cross-domain transfer |
| Grand Challenge | Measurable Objective (Examples) | Recommended Reporting |
|---|---|---|
| Non-stationarity/OOD robustness | Limit performance degradation under out-of-time tests (“warm” periods) and explicit domain shift | Report IID vs. OOD skill side-by-side (e.g., ratio ), plus drift diagnostics and failure cases |
| Transferability (cross-domain) | Demonstrate stability across domains (e.g., leave-one-GCM-out, leave-one-region-out) | Report cross-domain mean ± std; identify worst-case domain; include at least one strict domain-holdout stress test |
| Physical consistency | Quantify physical violations and enforce verifiable constraints | Report conservation/constraint diagnostics (e.g., water-budget error, non-negativity, physically plausible ranges); include “physics scorecards” alongside accuracy |
| Uncertainty quantification (UQ) | Provide calibrated predictive distributions, not only point estimates | Report empirical coverage of prediction intervals (e.g., 90%/95%); reliability diagrams; proper scores such as CRPS and Brier score [87,153,154] |
| Extremes | Reduce bias in tail behavior and rare-event risk metrics | Report tail metrics (e.g., P95/P99 quantile error), exceedance skill (e.g., CSI/FSS for threshold events), and return-level bias from EVT fits |
| Technical Failure Mode | Societal Risk Pathway | Normative Obligation | Actionable Controls (Examples) |
|---|---|---|---|
| Non-stationarity/ domain shift (Section 9.1) | Overconfident use of invalid projections; maladaptation when future regimes differ from training | Reliability under shift; precaution in deployment | OOD stress tests (cross-GCM/PGW/time splits); shift-aware reporting; monitoring for drift |
| Extreme-event bias (Section 9.3) | Under-/over-estimation of high-impact hazards; inequitable disaster preparedness | Safety for high-consequence tails | Tail-focused evaluation (event metrics + tail magnitude); physics checks during extremes |
| Uncalibrated or missing UQ (Section 9.4) | False confidence; inability to manage risk or compare options | Honest uncertainty communication; calibrated decision support | Probabilistic verification; calibration diagnostics; communicate limits using calibrated language |
| Sparse/biased observations and capacity gaps (Section 6.2) | Systematic degradation in data-poor regions; inequitable climate services | Equity and inclusion | Dataset documentation; stratified evaluation by region/vulnerability; co-production with users |
| Weak reproducibility/benchmarking (Section 9.5) | Unverifiable claims; erosion of trust and accountability | Auditability and accountability | Model/data cards; versioned pipelines; shared benchmarks and minimal validation standards |
| Domain | Minimum Auditable Checks (Report as Numbers/Artifacts) |
|---|---|
| Fairness/equity | Stratified MAE/RMSE and bias; worst-group performance; disparity gaps; tail metrics stratified by regime (e.g., CSI/FSS at P95/P99). |
| Transparency | Model card + dataset datasheet; intended/non-intended use; training/eval split policy; versioning and changelog [170,171]. |
| Uncertainty | Calibration/coverage at multiple levels; proper scoring (e.g., CRPS); tail reliability; uncertainty communication protocol [87,174]. |
| Governance | Named accountable roles; independent evaluation gate; drift monitoring triggers; incident-response procedure; documentation aligned to AI RMF/management-system standards [156,176,177]. |
| Component | Minimum Requirements (Must Be Reported) |
|---|---|
| Generalization | In-domain test and at least one explicit OOD test (cross-GCM, cross-region, or scenario/PGW warm-test when applicable); report performance deltas (in vs. OOD). |
| Initialization sensitivity | Multi-seed evaluation with mean ± std (or confidence intervals) for core metrics; report whether conclusions hold across seeds. |
| Extremes | Threshold-based scores (e.g., CSI/FSS at P95/P99) and tail magnitude errors (e.g., conditional MAE above P95) stratified by key regimes. |
| Physical consistency | At least one physics/budget diagnostic relevant to the variable (e.g., conservation or closure proxy), reported alongside skill metrics. |
| Uncertainty | If probabilistic, calibration/coverage and proper scoring (e.g., CRPS) and a short uncertainty-communication protocol [87,174]. |
| Transparency and governance | Release documentation (model card + datasheet), versioning/changelog, and named accountable roles aligned with risk/governance frameworks [156,170,171]. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Najafi, H.; Lagerwall, G.L.; Obeysekera, J.; Liu, J. Machine Learning in Climate Downscaling: A Critical Review of Methodologies, Physical Consistency, and Operational Applications. Water 2026, 18, 271. https://doi.org/10.3390/w18020271
Najafi H, Lagerwall GL, Obeysekera J, Liu J. Machine Learning in Climate Downscaling: A Critical Review of Methodologies, Physical Consistency, and Operational Applications. Water. 2026; 18(2):271. https://doi.org/10.3390/w18020271
Chicago/Turabian StyleNajafi, Hamed, Gareth Lynton Lagerwall, Jayantha Obeysekera, and Jason Liu. 2026. "Machine Learning in Climate Downscaling: A Critical Review of Methodologies, Physical Consistency, and Operational Applications" Water 18, no. 2: 271. https://doi.org/10.3390/w18020271
APA StyleNajafi, H., Lagerwall, G. L., Obeysekera, J., & Liu, J. (2026). Machine Learning in Climate Downscaling: A Critical Review of Methodologies, Physical Consistency, and Operational Applications. Water, 18(2), 271. https://doi.org/10.3390/w18020271

