Machine Learning in Climate Downscaling: A Critical Review of Methodologies, Physical Consistency, and Operational Applications

Najafi, Hamed; Lagerwall, Gareth Lynton; Obeysekera, Jayantha; Liu, Jason

doi:10.3390/w18020271

Open AccessReview

Machine Learning in Climate Downscaling: A Critical Review of Methodologies, Physical Consistency, and Operational Applications

¹

Knight Foundation School of Computing and Information Sciences, Florida International University, 11200 SW 8th St., Miami, FL 33199, USA

²

The Everglades Foundation, 18001 Old Cutler Road, Palmetto Bay, FL 33157, USA

³

Institute of Environment, Florida International University, 11200 SW 8th St., Miami, FL 33199, USA

^*

Author to whom correspondence should be addressed.

Water 2026, 18(2), 271; https://doi.org/10.3390/w18020271

Submission received: 16 November 2025 / Revised: 30 December 2025 / Accepted: 12 January 2026 / Published: 21 January 2026

(This article belongs to the Section New Sensors, New Technologies and Machine Learning in Water Sciences)

Download

Browse Figures

Versions Notes

Abstract

High-resolution climate projections are essential for regional risk assessment; however, Earth System Models (ESMs) operate at scales far too coarse for local impacts. This review examines how machine learning (ML) downscaling can bridge this divide and addresses a key knowledge gap: how to achieve reliable, physically consistent downscaling under future climate change. This article synthesizes ML downscaling developments from 2010 to 2025, spanning early statistical methods to modern deep learning (e.g., convolutional neural networks (CNNs), generative adversarial networks (GANs), diffusion models, and transformers). The analysis introduces a new taxonomy of model families and frames the discussion around the “performance paradox”—the tendency for models with excellent historical skill to falter under non-stationary climate shifts. Our analysis finds that convolutional approaches efficiently capture spatial structure but tend to smooth out extremes, whereas generative models better reproduce high-intensity events at the cost of greater complexity. The study also highlights emerging solutions like physics-informed models and improved uncertainty quantification to tackle persistent issues of physical consistency and trust. Finally, the synthesis outlines a practical roadmap for operational ML downscaling, emphasizing standardized evaluation, out-of-distribution stress tests, and hybrid physics–ML approaches to bolster confidence in future projections.

Keywords:

climate downscaling; machine learning; deep learning; transferability; physical consistency; explainable AI; uncertainty quantification

1. Introduction: The Imperative for High-Resolution Climate Projections and the Rise of Machine Learning

1.1. Positioning This Review in the Literature

While ML-based climate downscaling is a burgeoning field, this review provides a broad, critical synthesis of methods, persistent challenges, and future research trajectories from 2010 to 2025. Unlike focused empirical studies such as the influential intercomparison by Vandal et al. [1] (daily precipitation in a single region), we do not present a new empirical analysis; instead, we synthesize findings across diverse variables, geographies, and model architectures.

Relative to prior comprehensive reviews, we directly address gaps they left. For example, Rampal et al. [2] survey recent ML advances and emphasize interpretability and transparency, but do not offer a unifying taxonomy of downscaling methods or examine generalization failure under changing climate conditions. In contrast, we introduce the first taxonomy linking ML architectures to specific downscaling challenges (e.g., spatial detail vs. extremes) and center our analysis on the “performance paradox”—the breakdown of apparent historical skill under future non-stationarity. We further differentiate this work by proposing a concrete operational evaluation protocol (Section 7) and by charting research priorities toward physically consistent, trustworthy, and operationally viable models.

Specifically, this review contributes the following:

1.: Creating a novel taxonomy mapping model families (CNNs, GANs, Transformers, Diffusion Models) to the downscaling challenges they address.
2.: Conducting a critical analysis of the “performance paradox” and its implications for robustness under non-stationary climate change.
3.: Proposing a practical evaluation protocol and research priorities to guide the community toward physically consistent, trustworthy, and operationally viable models.

To clarify positioning, reviews like that of Maraun et al. [3] reflect the pre-deep learning state-of-the-art; benchmark efforts such as the VALUE project [4] emphasize experimental intercomparison; and perspective papers like that of Reichstein et al. [5] advocate for DL across Earth system science broadly. Conversely, this review provides a deep, practical examination focused specifically on climate downscaling and model transferability under climate change.

1.2. Overview of the Review’s Scope and Objectives

This review provides a comprehensive, critical synthesis of machine-learning-based climate downscaling. We frame the discussion around three research questions:

RQ1:: Evolution of Methodologies: How have ML approaches for climate downscaling evolved from classical algorithms to the current deep learning architectures, and what are the primary capabilities and intended applications of each major model class?
RQ2:: Persistent Challenges: What are the critical, cross-cutting challenges that limit the operational reliability of contemporary ML downscaling models, particularly regarding their physical consistency, generalization under non-stationary climate conditions, and overall trustworthiness?
RQ3:: Emerging Solutions and Future Trajectories: Which methodological frontiers—including physics-informed learning (PIML), robust uncertainty quantification (UQ), and explainable AI (XAI)—show the most promise for addressing these challenges and guiding future research?

These questions address gaps in prior surveys. We include RQ1 (Methodological Evolution) to document the progression from classical algorithms to modern deep architectures. We include RQ2 (Persistent Challenges) to synthesize cross-cutting barriers to operational reliability, including physical consistency, generalization under non-stationarity, the representation of extremes, and uncertainty. Finally, RQ3 (Emerging Solutions/Future Directions) highlights methodological frontiers (PIML, UQ, and XAI) that may mitigate these limitations and guide future work. Scope: We focus on spatial downscaling of climate-model outputs, covering both super-resolution (gridded, image-like downscaling) and pointwise statistical downscaling. We do not review purely temporal disaggregation (e.g., monthly-to-daily) except where it is tightly coupled with spatial downscaling (we note such cases briefly, e.g., [6]). We prioritize spatial resolution because it is the dominant bottleneck for local impact assessment and because a comprehensive treatment of temporal downscaling is beyond the scope of this article. Within this scope, we include studies across major climate variables (precipitation, temperature, etc.), diverse geographic regions, and a range of ML methodologies, requiring ML to be the primary downscaling component (see Section 2 for inclusion criteria). We synthesize how model and data choices shape performance, compare strengths and limitations across approaches, and discuss efforts to improve extremes, transferability, uncertainty, and physical consistency. Ultimately, this review aims to be a practical resource for researchers and practitioners navigating the rapidly evolving landscape of ML-based climate downscaling. A key finding highlighted in this review is the “performance paradox”—the tendency for ML models to show excellent skill on historical (in-sample) data yet fail to generalize under future, out-of-distribution climate scenarios—leading to an associated “trust deficit” in their predictions. This notion aligns with long-recognized concerns about the stationarity assumption in downscaling: for example, Lanzante et al. [7] demonstrated that empirical relationships learned from historical data may break down in a warmer climate, which is exactly the issue we term the performance paradox. Throughout this paper, we use “performance paradox” as an umbrella framing (synonymous with a transferability or non-stationarity crisis) to emphasize this critical challenge.

Organization of the Review. The remainder of this article is structured to address the research questions in turn. Section 4 covers the historical and methodological evolution of ML downscaling (answering RQ1). Section 7, Section 8 and Section 9 then critically examine persistent challenges and evaluation issues in current models (addressing RQ2). Finally, Section 5, Section 6, Section 7, Section 8, Section 9 and Section 10 explore emerging solutions and future research needs, including physics-informed methods and uncertainty quantification (addressing RQ3). This structure ensures that each objective is met in sequence, and it provides a logical progression from past developments to present hurdles and on to future opportunities.

2. Review Methodology

(This section summarizes how we surveyed and selected the literature, complementing the broad scope in the Introduction.) To ensure a comprehensive and reproducible synthesis, we implemented a structured literature search and screening process spanning January 2010 to early 2025.

2.1. Search Strategy and Data Sources

We searched Google Scholar, arXiv (cs.LG and physics.ao-ph), Web of Science, and IEEE Xplore using the following Boolean query: (‘machine learning’ OR ‘deep learning’ OR ‘convolutional neural network’ OR ‘GAN’ OR ‘transformer’) AND (‘downscaling’ OR ‘super-resolution’ OR ‘bias correction’) AND (‘precipitation’ OR ‘temperature’ OR ‘climate change’). AI-driven search tools (Undermind.ai) were employed to identify relevant literature using keywords such as ‘machine learning’ and ‘downscaling,’ while the Gemini V2.5 model was utilized to generate automated summaries of the retrieved papers.

2.2. Inclusion and Exclusion Criteria

Eligibility was based on two criteria: (i) Methodological: ML must be the primary downscaling engine; purely statistical methods (e.g., BCSD, quantile mapping) were excluded, except as benchmarks. (ii) Problem definition: The study must address spatial or spatio-temporal downscaling; purely temporal disaggregation was excluded unless integrated with spatial covariates. To ensure quality, we included only studies reporting quantitative validation against observational datasets (e.g., ERA5, TRMM, station data). This process yielded the key literature tracing the evolution from classical ML (SVMs, Random Forests) to contemporary deep learning architectures.

3. Background: The Downscaling Problem

3.1. The Scale Gap in Climate Modeling and the Need for Downscaling

Global Climate Models (GCMs) are fundamental for understanding and projecting climate change; however, their coarse spatial resolution (typically 50–300 km) limits regional and local impact assessment. This is insufficient for decision-making in sectors such as agriculture, hydrology, energy resource management, urban planning, and disaster risk preparedness, which require high-resolution climate information [3,8]. Downscaling bridges this long-recognized “scale gap” by transforming coarse GCM outputs into finer-scale projections for localized impact studies and adaptation planning. The growing use of machine learning (ML) in this domain is driven by this unmet need for high-resolution data, together with the limitations of pre-ML downscaling approaches.

3.2. Sectoral Implications of the Scale Gap

The mismatch between GCM resolution (∼100 km) and surface-process scales (∼1–10 km) creates a bottleneck for impact assessment. Hydrology and water resources: Hydrological models (e.g., flash-flood forecasting in complex terrain) are sensitive to precipitation intensity, but GCMs spread rainfall over large grid cells, producing “drizzle bias” that dilutes convective extremes. In the Andes and Himalayas, this averaging can systematically underestimate peak runoff and landslide risk; without robust downscaling to at least 5 km, catchment response times cannot be accurately simulated (as illustrated in efforts such as GloFAS). Agriculture and food security: Crop models (e.g., DSSAT, APSIM) respond non-linearly to temperature; for example, maize can experience reproductive failure above 35 °C. Within topographically diverse GCM grid cells, averaging cool mountain tops with hot valley floors can mask sterilization thresholds, an “averaging artifact” that risks overestimating food security in vulnerable micro-climates. Urban planning and energy: The Urban Heat Island (UHI) operates at neighborhood scales far below GCM grids, so coarse projections miss canopy thermal properties and can underestimate cooling demand and heat-related mortality. For renewables, wind power scales as

P \propto v^{3}

, so smoothing wind gradients yields disproportionate errors in energy potential, motivating downscaling for site selection.

3.3. Limitations of Traditional Downscaling Methods

Two traditional approaches dominate: dynamical downscaling (DD) and statistical downscaling (SD). Dynamical downscaling (DD) runs physics-based Regional Climate Models (RCMs) at higher resolution over limited areas using GCM boundary conditions. While physically consistent, DD is extremely computationally intensive; for example, the budget to simulate the globe at 100 km would be insufficient to dynamically downscale a Spain-sized region to 10 km [9]. This cost limits large ensembles, multiple scenarios, and long periods needed for uncertainty assessment, and RCMs can also introduce systematic biases [10]. Statistical downscaling (SD) (e.g., regression, weather generators, analog methods) learns empirical links between large-scale predictors and local predictands [11]. Its central vulnerability—shared by ML—is the stationarity assumption that historical relationships persist under future forcing. This is increasingly challenged by non-stationary climate change and underpins the covariate and concept drift issues discussed later (Section 7.3, Section 8.5 and Section 9.1). For instance, Clausius–Clapeyron implies a non-linear rise in atmospheric water vapor capacity (∼7% per °C), intensifying convective precipitation [12]; models trained on lower-intensity historical rainfall may “extrapolate blind” and miss future extremes with no historical analog. Likewise, aridification-driven shifts in land–atmosphere coupling can alter local temperature responses to synoptic forcing, rendering historical correlations obsolete.

The Fidelity–Cost Trilemma

In practice, DD offers high physical fidelity at extreme computational cost, whereas SD is efficient but may be unreliable under climate change because of its empirical nature. ML promises a middle ground: once trained, it can downscale rapidly (enabling large ensembles like SD) while capturing complex non-linear relationships (approaching DD). However, training sophisticated ML introduces new challenges (e.g., enhancing high-resolution texture without sacrificing accuracy). These considerations motivate the emergence of ML—and deep learning in particular—as a promising alternative to purely dynamical or statistical approaches.

3.4. Emergence and Promise of ML in Transforming Statistical Downscaling

The application of machine learning (ML), especially modern deep learning, to climate downscaling marks a paradigm shift. ML models can learn complex, non-linear mappings from coarse inputs to fine outputs directly from data, potentially overcoming limits of manually specified statistical methods. This data-driven approach can capture the fine-grained spatial structure that earlier methods missed (e.g., recovering orographic precipitation details that simple interpolation would smooth out). At the same time, ML-based downscaling inherits the stationarity challenge of SD—an issue we scrutinize later—so its performance requires critical examination. The community’s growing interest reflects both clear successes (e.g., early deep networks resolving features conventional methods could not) and recognition of ML-specific failure modes, setting the stage for the detailed review that follows.

The recent literature illustrates a landscape of both breakthrough capabilities and persistent limitations. On the advancement front, Vandal et al. [13] demonstrated with ‘DeepSD’ that stacked super-resolution networks could effectively recover topographic nuances that standard interpolation misses. However, significant failure modes remain. Maraun et al. [14] argued that statistical adjustments can disrupt the spatiotemporal consistency of climate fields, creating physically implausible scenarios. Furthermore, in their intercomparison study, Vandal et al. [1] highlighted that while deep learning models excel at mean-state prediction, they frequently fail to reproduce the heavy-tailed distributions of precipitation extremes, often converging to the mean and underestimating risk.

The “deep learning revolution” in climate downscaling signifies more than just incremental improvements in performance metrics. It represents a fundamental shift in how the downscaling problem is approached: moving away from the explicit definition of statistical relationships (as in traditional SD) towards the learning of complex, often implicit, functions directly from observational or model-generated data. This paradigm shift offers immense potential for capturing intricate details and dependencies that were previously intractable. However, this increased power is accompanied by new and significant challenges, particularly concerning the interpretability of these “black-box” models and ensuring the physical consistency of their outputs. This review undertakes a critical examination of the advancements, capabilities, and persistent challenges associated with ML-based downscaling, focusing on the period between 2010 and 2025.

To resolve complex spatial dependencies, two dominant architectures have emerged. Convolutional Neural Networks (CNNs), particularly the U-Net architecture, use learnable kernels to capture local spatial hierarchies, effectively modeling how immediate topographic features like elevation gradients force local precipitation. However, for longer-range dependencies, Vision Transformers (ViTs) are gaining traction. Unlike CNNs which are limited by their receptive fields, ViTs employ self-attention mechanisms to model global interactions, theoretically allowing the model to capture large-scale atmospheric teleconnections that modulate local climate conditions across vast distances.

4. The Evolution of Machine Learning Approaches in Climate Downscaling

This section addresses RQ1 by synthesizing how downscaling methods evolved from CNN/U/Net baselines to generative models (GANs, diffusion) and transformers/foundation models, including cross-resolution/region transfer and multi-task adaptation for downscaling [15,16,17,18,19,20].

The evolution of ML in downscaling can be conceptually bifurcated into two distinct trajectories: the pursuit of spatial fidelity and the quest for probabilistic reliability. The first trajectory, dominated by the transition from simple SRCNNs to complex GANs, focused on spatial super-resolution—maximizing the textural realism and sharpness of the output fields. The second trajectory has focused on generalization and uncertainty quantification (UQ). This lineage has evolved from simple ensemble methods (random forests) to sophisticated probabilistic frameworks like Bayesian neural networks and, most recently, denoising diffusion probabilistic models (DDPMs), which prioritize learning the full conditional probability distribution of the climate state rather than a single deterministic realization.

To visually frame this critical analysis, we introduce a conceptual roadmap in Figure 1. This framework organizes the landscape of ML-based downscaling around the central challenge of the “performance paradox and trust deficit”—the tendency for models to show high skill on historical data but fail to generalize robustly, thereby limiting their operational trust. The roadmap connects this core problem to three primary axes of downscaling challenges: achieving high Spatial Fidelity, accurately representing Extreme Events, and ensuring robust Generalization and Uncertainty Quantification (UQ).

Each axis is mapped to the model families best suited to address it: CNNs and U-Nets for spatial structure, generative models like GANs and Diffusion for realism and extremes, and Transformers for generalization and long-term predictions. The figure further details the core strengths, inherent trade-offs, and key methodological innovations for each family, with the overarching goal of achieving Physical Consistency serving as a cross-cutting objective for the entire field. This layered framework will guide the subsequent sections as we delve into the evolution, challenges, and future trajectories of these technologies.

4.1. Early Applications and Classical ML Benchmarks

The application of machine learning to climate downscaling began with the exploration of classical ML algorithms, which served as important precursors to the current deep learning era.

Before the deep-learning era, researchers applied classical ML algorithms (like support vector machines and random forests) to climate downscaling as a bridge beyond purely linear methods. These early studies established the feasibility of ML but also revealed its limitations relative to simple statistical baselines. For example, Vandal et al. [1] found that common machine-learning models for precipitation downscaling did not consistently outperform well-tuned bias-correction methods (e.g., BCSD), especially in reproducing extremes. In practice, many stakeholders continued to use traditional approaches (e.g., quantile mapping and LOCA), meaning ML models had to demonstrate clear gains to justify their added complexity.

Some improvements were noted: Tripathi et al. [34] showed that Support Vector Machines could capture non-linear precipitation relationships beyond linear regression, and He et al. [35] reported that a specialized random forest variant improved heavy-rainfall estimates. Techniques like Quantile Regression Neural Networks [36] were also introduced to directly predict distributional quantiles, enhancing probabilistic downscaling. However, overall lessons from this period were that classical ML required careful predictor selection and often struggled with extremes and spatial coherence, yielding only modest advantages over simpler methods. These limitations—notably the difficulty of capturing complex multi-scale dependencies—paved the way for deep learning. In fact, by 2019, an intercomparison showed that even early deep neural nets lagged behind simpler approaches for some metrics, underscoring the need for more advanced architectures and better loss functions [1].

Their results highlighted that classical ML models did not consistently outperform bias-correction baselines like BCSD—for example, machine learning offered little to no improvement in simulating extreme precipitation indices compared to quantile mapping. In general, classical ML tended to succeed in capturing moderate non-linear patterns when ample training data were available, but often failed to add value for distribution extremes or novel climate conditions. Bias-correction methods, by explicitly calibrating distributions, remained hard to beat in those aspects. This finding reinforced the idea that ML models must demonstrate clear, reliable gains (especially for extremes and other application-relevant metrics) to justify their greater complexity.

Ultimately, the inability of shallow ML models to fully capture non-linear climate relationships (especially for extremes and multi-scale spatial patterns) motivated a turn to deep learning. Researchers recognized that convolutional neural networks (CNNs) and related architectures could automatically learn spatial features and might overcome the shortcomings of classical ML. The following sections thus examine how the field evolved from these early ML efforts into the deep-learning era of CNNs, U-Nets, GANs, and beyond.

4.2. The Deep Learning Paradigm Shift

The trajectory of ML-based climate downscaling shifted with the rapid adoption of deep learning (DL), inspired by computer-vision super-resolution tasks that are conceptually analogous to spatial downscaling [37].

4.2.1. Pioneering Work with Convolutional Neural Networks (CNNs)

CNNs were among the first DL architectures to show strong potential in downscaling because they learn hierarchical spatial features from gridded data via convolutional filters, pooling, and weight sharing, reducing reliance on hand-engineered predictors [11]. A seminal line of work introduced “DeepSD” as stacked SRCNN-based super-resolution, incrementally refining coarse inputs (∼100 km) toward regional scales (∼12 km) to reduce error propagation and leverage multivariate predictors beyond coarse precipitation [13]. While this CNN family improved RMSE relative to bicubic interpolation and statistical baselines, it remained deterministic and typically optimized with MSE/RMSE, which can oversmooth texture and attenuate extremes; it also lacked explicit physical constraints and offered limited out-of-distribution analysis under warming scenarios [13].

Subsequent studies clarified when CNNs add value and where they remain fragile. Configuration/intercomparison work showed that naïve CNNs do not automatically dominate strong statistical baselines and that performance depends on architecture, loss, and domain shifts [11]. Continental-scale deployments demonstrated feasibility without local hand-engineering, though mean-temperature gains were sometimes modest, motivating hybrid or physics-aware objectives [38]. Residual CNNs stabilized deeper training and better preserved high frequencies, improving extremes relative to plain CNNs but still leaving uncertainty implicit [21]. U-Net style models enabled daily 1 km multivariate downscaling (e.g., precipitation, temperature, radiation, vapor pressure, wind) through 2100 with reproducible pipelines, while highlighting the need to quantify cross-variable physical constraints (e.g., energy/water balances) [8,22]. Rigorous regional applications (e.g., Iberia CMIP6) further operationalized DL-based statistical downscaling with careful splits and diagnostics in multi-ESM settings, but continued to surface concerns about robustness under future extremes and transfer to distinct climates [39]. Additional evaluations comparing CNNs to alternative methods reinforce the importance of robust validation and baseline choice [11,18].

4.2.2. Architectural Innovations

U-Nets

Originally designed for biomedical segmentation [40], the U-Net architecture has proven effective in climate downscaling, outperforming statistical benchmarks [8]. Its symmetric encoder–decoder structure uses skip connections to retain fine-grained spatial details, which is crucial for high-resolution reconstruction [8]. While variants like U-Net++ [41] and U-Net_DCA [42] offer refined performance for specific tasks, the standard architecture also excels in GCM bias correction [43]. However, standard U-Nets are inherently static, processing time steps independently (

X_{t} \to Y_{t}

). Because they lack internal memory to model dynamics—such as storm persistence or lagged responses—they suffer from temporal incoherence, often requiring integration with Recurrent Neural Networks or 3D convolutions to resolve this limitation.

Residual Networks (ResNets)

To overcome the training instabilities of plain deep CNNs (e.g., vanishing gradients), researchers adopted Residual Networks [44]. To address the challenges of training very deep neural networks, ResNets introduced the concept of “residual learning” [44]. Instead of learning a direct mapping, ResNet layers learn a residual mapping with reference to the layer inputs, facilitated by shortcut or skip connections that perform identity mapping and are added to the output of stacked layers. This formulation makes optimization easier and allows for the construction of significantly deeper networks without suffering from degradation or vanishing gradients. In climate downscaling, architectures like the Super-Resolution Deep Residual Network (SRDRN) developed by Wang et al. [21] have demonstrated the benefits of this approach. Other ResNet-based models, such as Very Deep Super-Resolution (VDSR) [45] and Enhanced Deep Super-Resolution (EDSR) [46], have also shown superior performance in image super-resolution, outperforming simpler SR-CNNs by leveraging deeper architectures and residual learning, which has inspired similar approaches for downscaling tasks like temperature. In the downscaling context, using residual blocks helped CNNs preserve high-frequency details (e.g., daily temperature variability) that earlier CNNs were blurring out. The introduction of ResNet-based architectures thus directly addressed one limitation of early CNNs: it became feasible to build deeper models that capture multi-scale features without degradation.

Generative Adversarial Networks (GANs)

A common limitation of deterministic super-resolution models trained only with pointwise losses (e.g., MAE/RMSE) is spatial averaging: the model minimizes mean error by producing fields that are too smooth, which suppresses sharp gradients, localized structure, and intensity peaks [1,47,48]. GAN-based downscalers address this by adding an adversarial objective: a discriminator learns to distinguish generated outputs from real high-resolution fields, and the generator is trained to fool it, which can recover higher-frequency spatial texture and sharper features that pixelwise objectives tend to wash out [1,49].

Computational trade-offs: These realism gains typically come with (i) higher training cost because two networks must be trained jointly (each iteration requires generator and discriminator forward/backward passes, increasing compute and GPU memory footprint), (ii) training instability and tuning burden (sensitivity to learning-rate balance, update ratios, and regularization, as well as risks of oscillatory dynamics or mode collapse that demand careful monitoring and hyperparameter sweeps), and (iii) a verification burden that is stricter than for purely regression-based models [50,51]. In scientific forecasting, adversarial realism does not automatically imply physical fidelity: GANs can introduce plausible-looking but spurious fine-scale variability or distort tail behavior, so operational use typically requires additional constraints (e.g., conditional training, physics-aware penalties) and expanded diagnostics focused on extremes, conservation checks, and spatial spectra [28,29,30,52]. Inference can be fast once trained (a single generator pass); however, overall operational feasibility depends on whether the up-front training/tuning and the added validation requirements are budgeted alongside latency needs [53].

Diffusion Models

While GANs improved texture and extremes, they can suffer from stability and mode-collapse issues (sometimes failing to represent the full variability of the data). Recently, diffusion models have emerged as an alternative generative approach that avoids some pitfalls of GANs. Diffusion-based downscaling models learn the data distribution by progressively adding and removing noise, which stabilizes training and can better capture uncertainty in predictions. By design, diffusion models address the limitation of GANs by providing a more straightforward way to generate a diverse ensemble of outcomes (the ability to represent extremes). Initial studies applying diffusion models to climate data have shown they can produce sharp results comparable to GANs but with more consistent coverage of the distribution. These models learn to reverse a gradual noising process, starting from a simple noise distribution and iteratively refining it to generate a data sample [20]. This iterative process allows them to capture complex, high-dimensional distributions with high fidelity, making them exceptionally well-suited for generating realistic and physically plausible climate fields. Key innovations in diffusion-based downscaling include Latent Diffusion Models (LDMs), which mitigate computational costs by operating in compressed latent space to achieve high fidelity [31,54], and Spatio-Temporal Video Diffusion (STVD), which adapts video generation techniques to capture temporal precipitation dynamics better than GANs [32]. Hybrid frameworks further optimize efficiency by combining coarse physical models with generative refinement, reducing costs by over 97% while maintaining physical consistency [9], while distributional corrections improve the representation of extremes [33]. Although diffusion models offer distinct advantages over GANs—such as stable training without mode collapse and inherent capabilities for uncertainty quantification [20,55]—they remain computationally expensive because of iterative sampling and require further research to ensure physical consistency in this emerging field [31].

Transformers

All the above methods (CNNs, GANs, diffusion) primarily focus on spatial detail at a given scale or region. However, they often use limited context (CNNs have finite receptive fields, and GANs/diffusion are typically local). To capture long-range dependencies and push toward a more globally consistent downscaling, researchers have begun employing Transformer architectures.

Originally developed for NLP [56], Transformers and Vision Transformers (ViTs) [57] have been adopted in downscaling for their ability to model long-range teleconnections via self-attention, offering a critical advantage over the local receptive fields of CNNs [23]. Transformers, with their self-attention mechanism, allow modeling relationships across an entire field or even multiple regions. This innovation directly tackles the limitation of CNN-based models’ locality: for instance, Vision Transformers have been used to learn continental-scale climate patterns and to transfer learning across regions. Transformers excel when abundant training data are available (as in global reanalyses), enabling cross-region generalization and even “zero-shot” downscaling in new domains [23]. At the same time, they are computationally heavy and data-hungry, so their benefits become most apparent in data-rich, global applications, whereas for smaller regions a well-trained CNN or GAN may suffice. To mitigate the quadratic complexity (

O (N^{2})

) of standard attention, specialized architectures like SwinIR [58] and PrecipFormer employ window-based mechanisms to efficiently capture localized dynamics [24]. A major innovation is zero-shot generalization, where models like EarthViT adapt to new resolutions without retraining [15], often outperforming neural operators in these settings [59]. Furthermore, Transformers form the backbone of foundation models such as Prithvi-WxC [16], ORBIT-2 [60], and FourCastNet [23], shifting the paradigm toward fine-tuning large, pre-trained representations to enhance transferability and reduce data requirements [17,61].

In summary, each successive architecture was introduced to resolve the specific weaknesses of its predecessor: deeper ResNets to stabilize and refine CNN outputs, GANs (and later diffusion models) to generate realistic extremes beyond the capacity of mean-based predictors, and Transformers to integrate long-range context and facilitate transferability. This methodological evolution underpins the narrative of our review—linking the emergence of each new model class to the challenges identified in earlier ones. The progression to specialized DL architectures (U-Nets, GANs, Transformers, Diffusion Models) provides a diverse toolkit for climate downscaling, though each brings distinct trade-offs between strengths like spatial feature extraction and limitations such as training instability. Consequently, no single model is universally superior; as illustrated in Figure 2, the choice is increasingly task-dependent, governed by specific needs for physical consistency or extreme event accuracy. This complexity underscores the need for systematic intercomparison, a goal currently pursued by the CORDEX ML Task Force [62].

The progression to specialized DL architectures (U-Nets, GANs, Transformers, Diffusion Models) provides a diverse toolkit for climate downscaling, though each brings distinct trade-offs between strengths like spatial feature extraction and limitations such as training instability. Consequently, no single model is universally superior; as illustrated in Figure 2, the choice is increasingly task-dependent, governed by specific needs for physical consistency or extreme event accuracy. This complexity underscores the need for systematic intercomparison, a goal currently pursued by the CORDEX ML Task Force [62].

This complex evolution, which involves a critical trade-off between model power and interpretability, is visualized in the timeline in Figure 3. The figure illustrates how the progression of ML models has been mirrored by an evolution in the scientific problem formulation itself, leading to a “trust deficit” that the community is now actively working to address.

Operational Constraints and Deployability

Beyond benchmark skill scores, whether deep learning can realistically replace or complement statistical baselines depends on operational factors: (i) data requirements (coverage, record length, and representativeness), (ii) deployment cost (hardware, storage, and maintaining inference pipelines), and (iii) inference latency (especially for time-sensitive applications). Deterministic CNN-style downscalers are typically inexpensive at inference, whereas iterative generative samplers (e.g., diffusion-based methods) can incur higher latency unless accelerated [20,54]. Training large models may also be costly even when downstream inference is efficient [16]. These constraints are central to operational feasibility and should be weighed alongside accuracy when assessing whether DL systems can replace established statistical methods in practice [63].

Figure 3. A conceptual timeline illustrating the parallel evolution of ML models, scientific problem formulations, and the challenge of trust in climate downscaling from 2010 to 2025. As model power and realism have increased (right-hand axis), the inherent explainability of the models has decreased (left-hand axis), creating a “trust deficit”. This has spurred the development of new scientific questions and a counter-movement focused on rebuilding trust through XAI and PIML. Key references by era: Classical models (2010–2016)—Ghosh [64], He et al. [35]; Deep learning shift (2016–2021)—Vandal et al. [13], Baño-Medina et al. [11]; Frontier models (2021–2025)—Leinonen et al. [28], Tomasi et al. [31], Curran et al. [15]; Trust deficit solutions—González-Abad et al. [65], Harder et al. [27].

5. The Physical Frontier: Hybrid and Physics-Informed Downscaling

Deep learning can capture complex statistical patterns; however, purely data-driven downscaling may yield physically implausible outputs that violate conservation laws (e.g., mass and energy) [25]. This lack of physical grounding contributes to the “trust deficit” and can trigger failures under out-of-distribution future climates where historical relationships may not hold. In response, two major frontiers have emerged: Physics-Informed Machine Learning (PIML) and hybrid physics–ML frameworks. We group these into the following: (1) injecting physical knowledge into ML via soft constraints (physics-based loss penalties) or hard constraints (architectures that enforce laws exactly), and (2) hybrid dynamical–ML approaches that split the task between a physical model (e.g., an RCM) and a learned ML component. These and other methodological interventions to address common failure modes are summarized in Figure 4. The subsections follow this structure, separating general principles from implementations to clarify how physics enters ML and to support comparison of each approach’s merits and limitations.

5.1. The Imperative for Physical Consistency

Physical consistency is essential for scientific credibility and reliable impact assessment, not a purely academic concern. For instance, a downscaler that does not conserve water mass can yield unrealistic runoff in hydrological models, and thermodynamically inconsistent temperature–humidity combinations can distort heat-stress assessments. Multiple studies emphasize that embedding physical laws into learning provides strong regularization, steering models toward solutions that remain accurate on training data while generalizing more robustly to unseen conditions [25].

5.2. Architectural Integration of Physical Laws: PIML

Physics-Informed Neural Networks (PINNs) and related PIML techniques integrate domain knowledge in the form of physical laws directly into the model’s architecture or training process [26,66]. This is typically achieved through two main strategies:

Soft Constraints: This is the most common approach, where the standard data-fidelity loss term ( $L_{data}$ ) is augmented with a physics-based penalty term ( $L_{physics}$ ) [66]. The total loss becomes $L_{total} = L_{data} + λ L_{physics}$ , where $λ$ is a weighting hyperparameter. $L_{physics}$ is formulated as the residual of a governing differential equation (e.g., the continuity equation for mass conservation). By minimizing this residual across the domain, the network is encouraged, but not guaranteed, to find a physically consistent solution. This method is flexible and has been used to penalize violations of conservation laws [25] and to solve complex PDEs [26]. A common example is enforcing mass conservation in precipitation downscaling. If x is the value of a single coarse-resolution input pixel and ${{\hat{y}}_{i}}_{i = 1}^{n}$ are the n corresponding high-resolution output pixels from the neural network, a soft constraint can be added to the loss function to penalize deviations from the conservation of mass. In other words, the sum of the smaller pixels cannot be larger than the value of the corresponding coarse pixel. The total loss, $L_{total}$ , becomes a weighted sum of the data fidelity term (e.g., Mean Squared Error, $L_{MSE}$ ) and a physics penalty term:

$L_{total} = L_{MSE} + λ_{phys} {∥\frac{1}{n} \sum_{i = 1}^{n} {\hat{y}}_{i} - x∥}^{2}$

(1)

where $λ_{phys}$ is a hyperparameter that controls the strength of the physical penalty. Minimizing this loss encourages, but does not guarantee, that the mean of the high-resolution patch matches the coarse-resolution value.
Hard Constraints (Constrained Architectures): This approach modifies the neural network architecture itself to strictly enforce physical laws by design. For example, Harder et al. [27] introduced specialized output layers that guarantee mass conservation by ensuring that the sum of the high-resolution output pixels equals the value of the coarse-resolution input pixel. Such methods provide an absolute guarantee of physical consistency for the constrained property, which can improve both performance and generalization. While more difficult to design and potentially less flexible than soft constraints, they represent a more robust method for embedding inviolable physical principles [27]. In contrast to soft constraints, a hard constraint enforces the physical law by design, often through a specialized, non-trainable output layer. Continuing the mass conservation example, let ${{\tilde{y}}_{i}}_{i = 1}^{n}$ be the raw, unconstrained outputs from the final hidden layer of the network. A multiplicative constraint layer can be designed to produce the final, constrained outputs ${y_{i}}$ that are guaranteed to conserve mass:

$y_{j} = {\tilde{y}}_{j} \cdot \frac{x \cdot n}{\sum_{i = 1}^{n} {\tilde{y}}_{i}} for j = 1, \dots, n$

(2)

This layer rescales the raw outputs $\tilde{y}$ such that their sum is precisely equal to $n \cdot x$ , thereby strictly enforcing the conservation law $\frac{1}{n} \sum y_{j} = x$ at every forward pass, without the need for a penalty term in the loss function.

Practical Guidance—Soft vs. Hard Constraints: Choosing between these approaches depends on the problem requirements and constraints. Hard constraints are most suitable when an inviolable physical law must be strictly satisfied and can be encoded in the model structure (e.g., conservation-style constraints) [27,67]. Hard constraints provide guarantees but often require problem-specific engineering and can reduce flexibility if mis-specified. Soft constraints are preferable when multiple physical relationships or complex equations need to be considered simultaneously; they are typically easier to implement and more flexible, at the cost of only approximate enforcement. However, they require careful tuning of the physics penalty weight (too weak ⇒ the model ignores physics; too strong ⇒ optimization instability or accuracy degradation), and their effectiveness can vary across variables and regimes [68]. However, they demand careful tuning of the weighting

λ_{phys}

: if set too low, the physical rule may be neglected; if too high, training can become unstable, or model accuracy may degrade because of over-prioritizing the constraint. In summary, use hard constraints when absolute fidelity to a known law is critical and achievable (e.g., conserving precipitation volume), and use soft constraints to gently steer the model when dealing with complex or multiple physics (accepting minor violations in exchange for more flexibility).

5.3. Hybrid Frameworks: Merging Dynamical and Statistical Strengths

By contrast, a purely ML generative approach (e.g., a diffusion model applied end-to-end) does not require an RCM upfront, but must learn fine-scale structure and physical realism from data alone and may incur higher inference latency because of iterative sampling [20,54]. The hybrid strategy trades pipeline complexity (running an RCM plus an ML component) for greater physical anchoring. The ML stage refines an intermediate state already shaped by dynamical constraints, which can improve robustness relative to unconstrained purely statistical generators [9,68].

Hybrid models seek to combine the strengths of traditional physics-based dynamical models (RCMs) with the efficiency and pattern-recognition capabilities of ML. Instead of replacing the physics entirely, ML is used to augment or accelerate parts of the physical modeling chain. A state-of-the-art example is the dynamical–generative downscaling framework proposed by Lopez-Gomez et al. [9]. This multi-stage approach involves the following:

1.: An initial, computationally inexpensive dynamical downscaling step using an RCM to bring coarse ESM output to an intermediate resolution (e.g., from 100 km to 45 km). This step grounds the output in a physically consistent dynamical state.
2.: A subsequent generative ML step, using a conditional diffusion model, to perform the final super-resolution to the target scale (e.g., from 45 km to 9 km). The diffusion model learns to add realistic, high-frequency spatial details.

This hybrid strategy is powerful because it leverages the RCM for what it does best—ensuring physical consistency and generalization across different GCMs—while using the diffusion model for its strengths: computational efficiency and generating high-fidelity, stochastic textures. This approach was shown to reduce the computational cost of the most expensive downscaling stage by over 97.5% while producing outputs with lower errors and more realistic spatial spectra than traditional statistical methods [9]. Such hybrid frameworks represent a pragmatic and powerful path toward scalable, physically credible, and computationally tractable downscaling of large climate model ensembles.

However, hybrid approaches carry their own complexities. They still require an initial RCM simulation, which means any biases or errors in that dynamical model can propagate to the final output. For example, if an RCM underestimates extreme rainfall, the ML stage can only refine that biased baseline. There is also an added operational overhead in running and coupling two systems (an RCM plus an ML model), which may limit scalability. By contrast, a purely ML generative approach (e.g., using a diffusion model end-to-end) does not need a physics model upfront but must learn all scales from data and can struggle with physical realism or require vast training data. The hybrid strategy thus trades some complexity for physical reliability; it anchors the fine-scale generation in a physically consistent intermediate state, alleviating the burden on the ML component to capture fundamentals. This has been shown to improve stability and generalization over standalone generative models, which might otherwise produce physically implausible outputs or degrade under climate shifts. Nonetheless, hybrids inherit the requirement of maintaining an RCM pipeline and careful bias correction between the two stages.

5.4. Enforcing Physical Realism in Practice

5.4.1. The Frontier of Physics-Informed Machine Learning (PIML)

Physics-Informed Machine Learning (PIML) integrates physical knowledge directly into ML to improve accuracy, generalizability, and physical consistency—a critical requirement in climate downscaling where credibility depends on adherence to physical laws.

The Promise of Physics–ML Integration

As highlighted by Harder et al. [27] and related studies [25], embedding physical constraints during training can (i) enforce conservation of mass and energy [8]; (ii) maintain thermodynamic consistency (e.g., among temperature, humidity, and precipitation); (iii) reduce data requirements by using physical laws as strong regularization [26]; and (iv) improve extrapolation to unseen conditions by relying on principles expected to hold even when statistical relationships shift.

A common framework augments a data-driven loss

L_{data}

with physics-based penalties

L_{physics}

(e.g., conservation equations or PDE residuals), yielding the following:

L_{total} = L_{data} + λ_{physics} L_{physics} + λ_{reg} L_{regularization},

(3)

as in Raissi et al. [26].

Implementation Approaches for PIML

Hard constraints: Modify the architecture or add constraint layers to strictly satisfy selected laws [27], e.g., enforcing that total downscaled precipitation matches the coarse-grid input (water mass conservation). Pros: guarantees consistency for enforced laws. Cons: harder to design and can reduce flexibility if constraints are overly restrictive or misspecified.
Soft constraints via loss functions: Add penalty terms for physical-law violations to the training objective [26]. Pros: flexible and can incorporate multiple principles, including complex non-linear PDEs. Cons: encourages but does not guarantee satisfaction; performance depends on tuning $λ_{physics}$ .
Hybrid statistical–dynamical models: Combine ML with components of dynamical models [8], using ML to emulate expensive parameterizations within an RCM or to learn corrective terms for RCM biases, thereby leveraging the physical basis of dynamical components.

Figure 4. A schematic of the primary methodological interventions to address the common failures of a standard ML downscaling pipeline. To overcome the “trust deficit”, researchers employ techniques to (1) enforce physical realism using PIML, (2) enhance extreme event representation with specialized loss functions, (3) ensure trust and scientific validity through XAI, and (4) robustly quantify uncertainty and test for out-of-distribution generalization. Key references by intervention: Intervention 1 (PIML)—Raissi et al. [26], Harder et al. [27]; Intervention 2 (Extremes)—Rampal et al. [30], Cannon [36]; Intervention 3 (XAI)—González-Abad et al. [65], Rampal et al. [30]; Intervention 4 (UQ & Generalization)—González-Abad and Baño-Medina [69], Lanzante et al. [7].

Case Studies and Results

The application of PIML techniques in climate downscaling is an active area of research, with emerging studies demonstrating their potential. Emerging studies and conceptual analyses present a conceptual comparison showing that physics-informed DL can lead to improvements in RMSE, extreme event capture, energy conservation, and transferability compared to standard DL.

For instance, Harder et al. [27] showed that their hard-constrained methods not only enforced conservation but also improved predictive performance on various climate datasets. Similarly, physics-informed loss functions are being developed to improve the representation of extreme events and other physical properties [70].

Several studies report measurable gains from incorporating physical information. For example, Harder et al. [27] show that enforcing conservation via a hard constraint can reduce bias and improve predictive skill relative to an unconstrained baseline, while guaranteeing the targeted physical property. Likewise, physics-guided loss formulations can improve the representation of extremes and other tail behaviors compared with standard objectives [70]. Surveys further note that the magnitude of these gains depends on the variable, regime, and constraint fidelity, and that improved physical consistency does not necessarily yield uniformly better pointwise error metrics [68].

Integrating physical principles into ML is a promising route to mitigate key limitations of purely data-driven downscaling by steering learning with established knowledge, aiming for outputs that are both statistically skillful and scientifically credible for climate-change understanding and adaptation. However, developing robust, generalizable, and computationally efficient PIML for the climate system’s multi-scale dynamics remains challenging. In practice, PIML can be substantially more expensive to train than standard DL, particularly when

L_{physics}

requires evaluating residuals of complex partial differential equations at many spatio-temporal points per optimization step [68]. Scaling to multi-physics, multi-scale settings is non-trivial, and the weighting factor

λ_{physics}

in Equation (3) typically demands careful, problem-specific tuning (often via extensive hyperparameter searches). Moreover, designing accurate yet tractable constraints for all relevant processes (e.g., cloud microphysics, radiative transfer, boundary layer dynamics) is difficult; even hard constraints that guarantee laws such as mass conservation (e.g., Harder et al. [27]) can become architecturally complex beyond relatively simple conservation principles. These issues motivate future work on more efficient PIML training, automated constraint learning/discovery, and more robust approaches to tuning

λ_{physics}

.

The preceding sections traced architectural evolution and how hybrid physics–ML designs seek to restore physical credibility without sacrificing statistical skill. However, across this model diversity, inconsistent evaluation remains a major barrier to cumulative progress: datasets, metrics, event definitions, and splits are rarely aligned, limiting fair comparisons and obscuring failure modes (e.g., non-stationarity, extremes, process-level errors). The next section therefore proposes a prescriptive, model-agnostic evaluation protocol—covering data splits, baseline anchors, event-focused metrics, physical diagnostics, and uncertainty tests—to ensure results are comparable and decision-relevant.

Limitations and Open Challenges: Despite their promise, PIML and hybrid approaches have notable drawbacks. PIML training often costs far more than standard DL (sometimes several-fold) due to computing physics residuals, and balancing accuracy against physical enforcement is delicate because

L_{physics}

weighting often requires extensive tuning. Physics-based loss calculations can increase training time substantially—often by tens of percent up to several-fold for otherwise identical models—since evaluating PDE residuals at many points may double or triple the training time. Memory use can also rise because training back-propagates through physics computations, sometimes requiring multi-GPU setups. Even so, the upfront training cost is typically lower than running a high-resolution physics-based model for every scenario, and once trained, PIML inference remains extremely fast (seconds per field), orders of magnitude faster than integrating an RCM over the same period; thus, PIML shifts cost to training while preserving cheap deployment.

Not all relevant physics is easily expressible as constraints (e.g., complex cloud microphysics or multivariate relationships), and hard-constraint designs, while guaranteeing specific conservations, can be inflexible and hard to generalize beyond the enforced law. Hybrid models also depend on RCM quality: they achieve “the best of both worlds” only when RCM biases are small and ML can focus on fine-scale correction; in practice, hybrids still need bias correction and introduce additional pipeline complexity. Generalization to new regions or higher resolutions can be fragile, with instabilities or errors if either component (RCM or ML) is pushed beyond its design or training regime. Recognizing these failure modes (e.g., instability, variable-specific weakness, high compute demand) is essential, and ongoing work targets more efficient PIML training, automated constraint discovery, and robust hybrid calibration.

6. Data, Variables, and Preprocessing Strategies in ML-Based Downscaling

Broadly, decisions in this stage can be grouped by their focus on (i) data reliability (using sources and preprocessing steps that maximize statistical quality and consistency), (ii) physical relevance (including predictors and variables that carry physical meaning for the downscaling task), and (iii) application-specific sensitivity (choices tailored to how the downscaled output will be used, such as preserving extremes for risk assessment). We structure this section accordingly: first discussing the common data sources for predictors and targets (emphasizing reliability and coverage), then the selection of variables (physical relevance), followed by feature engineering and preprocessing techniques (addressing issues like bias and non-stationarity that affect both reliability and end-use). By considering these aspects, one can ensure the training dataset is not only statistically sound but also physically appropriate and aligned with the intended downstream application.

The efficacy of any ML-based downscaling approach is profoundly influenced by the quality, characteristics, and processing of the input and target datasets, as well as the choice of variables. The selection of appropriate data sources and preprocessing strategies is crucial for training robust models that can generalize effectively.

6.1. Common Predictor Datasets (Low-Resolution Inputs)

The primary sources of low-resolution predictor data for ML downscaling models include global reanalysis products and outputs from GCMs and RCMs.

ERA5 Reanalysis: The fifth-generation ECMWF atmospheric reanalysis, ERA5, is extensively used as a source of predictor variables, particularly for training models in a “perfect-prognosis” framework [71,72]. ERA5 provides a globally complete and consistent, high-resolution (relative to GCMs, typically 31 km or 0.25°) gridded dataset of many atmospheric, land-surface, and oceanic variables from 1940 onwards, assimilating a vast number of historical observations. Its physical consistency and observational constraint make it an ideal training ground for ML models to learn relationships between large-scale atmospheric states and local climate variables. Often, models trained on ERA5 are subsequently applied to downscale GCM projections.
CMIP5/CMIP6 GCM Outputs: Outputs from the Coupled Model Intercomparison Project Phase 5 (CMIP5) and Phase 6 (CMIP6) GCMs are indispensable when the objective is to downscale future climate projections under various emission scenarios (e.g., Representative Concentration Pathways—RCPs, or Shared Socioeconomic Pathways—SSPs). These GCMs provide the large-scale atmospheric forcing necessary for projecting future climate change. However, their coarse resolution and inherent biases necessitate downscaling and often bias correction before their outputs can be used for regional impact studies [10,72].
CORDEX RCM Outputs: Data from the Coordinated Regional Climate Downscaling Experiment (CORDEX) are also used, particularly when ML techniques are employed for further statistical refinement of RCM outputs, as RCM emulators, or in hybrid downscaling approaches. CORDEX provides dynamically downscaled climate projections over various global domains, offering higher resolution than GCMs and incorporating regional climate dynamics. However, these outputs may still require further downscaling for very local applications or may possess biases that ML can help correct.

6.2. High-Resolution Reference Datasets (Target Data)

The selection of high-resolution reference data, or “ground truth”, is critical for training and validating supervised ML downscaling models.

Gridded Observational Datasets: Products like PRISM (Parameter-elevation Regressions on Independent Slopes Model) for North America [8,73], Iberia01 for the Iberian Peninsula [74], E-OBS for Europe [75], and regional datasets like REKIS [76] are commonly used [8]. PRISM, for example, provides high-resolution (e.g., 800 m or 4 km) daily temperature and precipitation data across the conterminous United States, incorporating physiographic influences like elevation and coastal proximity into its interpolation [73]. These datasets are invaluable for training models in a perfect-prognosis setup, where historical observations are used as the target.
Satellite-Derived Products: Satellite observations offer global or near-global coverage and are increasingly used as reference data. Notable examples include the Global Precipitation Measurement (GPM) mission’s Integrated Multi-satellitE Retrievals for GPM (IMERG) products for precipitation [77] and the Soil Moisture Active Passive (SMAP) mission for soil moisture [78]. GPM IMERG, for instance, provides precipitation estimates at resolutions like 0.1° and 30 min intervals, with various products (Early, Late, and Final Run) catering to different latency and accuracy requirements [77].
Regional Reanalyses or High-Resolution Simulations: In some cases, outputs from high-resolution regional reanalyses or dedicated RCM simulations (sometimes run specifically for the purpose of generating training data) are used as the “truth” data, especially when high-quality gridded observations are scarce [18].
FluxNet: For variables related to land surface processes and evapotranspiration, data from the FluxNet network of eddy covariance towers provide valuable site-level observational data for model validation [79]. These towers measure exchanges of carbon dioxide, water vapor, and energy between ecosystems and the atmosphere.

The choice between these predictor and target datasets is contingent on the specific downscaling objective (e.g., future projections versus historical analysis), data availability for the region of interest, and the variables being downscaled. While ERA5 and CMIP6 GCMs are standard choices for predictor data, the target data often comes from gridded observations or specialized high-resolution model runs.

6.3. Key Downscaled Variables

The primary focus of ML-based downscaling has historically been on the following:

Daily Precipitation and 2-m Temperature: These are the most commonly downscaled variables because of their direct relevance for impact studies (e.g., agriculture, hydrology, health). This includes mean, minimum, and maximum temperatures.
Multivariate Downscaling: There is a growing trend towards downscaling multiple climate variables simultaneously (e.g., temperature, precipitation, wind speed, solar radiation, humidity). This is important for ensuring physical consistency among the downscaled variables.
Spatial/Temporal Scales: Typical downscaling efforts aim to increase resolution from GCM/Reanalysis scales of 25–100 km to target resolutions of 1–10 km, predominantly at a daily temporal resolution.

6.4. Feature Engineering and Selection

The process of selecting and engineering input features is critical for the success of ML-based downscaling.

Static Predictors: High-resolution static geographical features such as topography (including elevation, slope, and aspect), land cover type, soil properties, and climatological averages are frequently incorporated as additional predictor variables. These features provide crucial local context that is often unresolved in coarse-scale GCM or reanalysis outputs. For instance, orography heavily influences local precipitation patterns and temperature lapse rates, while land cover affects surface energy balance and evapotranspiration [37,73]. The inclusion of these static predictors allows ML models to learn how large-scale atmospheric conditions interact with local surface characteristics to produce fine-scale climate variations.
Dynamic Predictors: For specific variables like soil moisture, dynamic predictors such as Land Surface Temperature (LST) and Vegetation Indices (e.g., NDVI, EVI) derived from satellite remote sensing are often used, as these variables capture short-term fluctuations related to surface energy and water balance [80].
Dimensionality Reduction and Collinearity: When dealing with many potential predictors, dimensionality reduction techniques like Principal Component Analysis (PCA) are sometimes employed to reduce the number of input features while retaining most of the variance. This can help to mitigate issues related to collinearity among predictors and reduce computational load. Regularization techniques (e.g., L1 or L2 regularization) embedded within many ML models also implicitly handle collinearity by penalizing large model weights.

The careful selection and engineering of features, particularly the integration of high-resolution static geographical information, significantly enhances the ability of ML models to capture local climate nuances. This suggests that the models are not merely learning statistical correlations from atmospheric variables alone but are also learning the complex interactions between these variables and fixed surface characteristics.

6.5. Data Preprocessing Challenges

Several challenges related to data preprocessing must be addressed to ensure the development of robust and reliable ML downscaling models.

Data-Scarce Areas: A significant hurdle is the availability of sufficient high-quality, high-resolution reference data for training and validation, especially in many parts of the developing world or in regions with complex terrain where observational networks are sparse [22]. In data-scarce regimes, deep learning models are prone to overfitting. However, this limitation is increasingly mitigated via Transfer Learning. Recent studies demonstrate that models pre-trained on data-rich domains (e.g., North America) or global reanalysis datasets (ERA5) learn universal atmospheric features—such as adiabatic lapse rates and frontal boundary structures—that are physically valid globally. By fine-tuning these pre-trained backbones with limited local data, robust performance can be achieved even in regions with sparse observational networks, effectively leveraging the “learned physics” from data-abundant regions.
Imbalanced Data for Extreme Events: Extreme climatic events (e.g., heavy precipitation, heatwaves) are, by definition, rare. This leads to imbalanced datasets where extreme values are underrepresented, potentially biasing ML models (trained with standard loss functions like MSE) to perform well on common conditions but poorly on these critical, high-impact events. This issue often hinders models from learning the specific characteristics of extremes.
Ensuring Domain Consistency: Predictor variables derived from GCM simulations may exhibit different statistical properties (e.g., means, variances, distributions) and systematic biases compared to reanalysis data (like ERA5) often used for model training. This mismatch, known as a domain or covariate shift, can degrade model performance and is a critical pre-processing consideration. This occurs because GCMs typically exhibit systematic biases and statistical properties that differ from the reanalysis data used for training. Even during historical periods, this discrepancy violates the fundamental ML assumption that training and application data are drawn from identical distributions (IID), resulting in performance degradation. Techniques such as bias correction of GCM predictors, working with anomalies by removing climatological means from both predictor and predictand data to focus on changes, or more advanced domain adaptation methods are employed to mitigate this critical issue and enhance consistency [81].
Quality Control and Gap-Filling: Observational and satellite-derived datasets frequently require substantial preprocessing steps, including quality control to remove erroneous data, and gap-filling techniques (e.g., interpolation) to handle missing values because of sensor malfunction or environmental conditions (like cloud cover for satellite imagery) [82].

The pervasive data-imbalance problem for extreme events highlights a potential mismatch between generic ML progress and climate-science needs: standard training objectives often fall short when faithfully capturing extremes is the priority, motivating domain-specific adjustments to architectures, losses, and/or data handling.

Domain shift mitigation: The above techniques help but have limits. Training on anomalies (departures from a baseline climatology) can reduce first-order biases and focus the model on changes, but it implicitly treats future change as a largely linear offset; shifts in variance or genuinely new extremes may not be captured by anomaly training alone. Simple bias correction of GCM inputs (e.g., mean/variance adjustments) can improve consistency over the training period, yet does not ensure robustness to structural differences in future climates (such as altered variability patterns or spatial error modes), since some biases and distribution shifts extend beyond mean/state. More advanced domain-adaptation approaches (e.g., transfer learning or adversarial methods) aim to bridge reanalysis and GCM domains more directly but require additional effort and show case-dependent success. Consequently, residual domain shift often persists even after bias correction and anomaly-based preprocessing, making rigorous validation under simulated “future” conditions essential. For example, pseudo-global warming (PGW) experiments—where reanalysis fields are shifted toward future mean climates—can test whether anomaly-trained downscalers truly generalize; degradation under PGW indicates preprocessing has not fully addressed the shift. Overall, standardization, bias correction, and anomaly training are useful first-line defenses, not panaceas: robust transferability typically requires careful stress-testing and, when needed, fine-tuning across multiple climate inputs and/or dedicated domain-adaptation algorithms.

6.6. Quantitative Benchmarks and Methodological Uncertainties in Preprocessing

The selection of preprocessing techniques is not merely a formatting step but a fundamental transformation that imposes structural priors on the downscaling model. Recent empirical benchmarks highlight that suboptimal choices in normalization and regridding can introduce systematic biases that architectural complexity cannot mitigate.

6.6.1. Normalization Sensitivity and Extremes

While range-based scaling is standard in computer vision, it can be methodologically problematic for heavy-tailed climate variables like precipitation. Huang [83] demonstrates that naïve pre-processing strategies for heavy-tailed variables can suppress gradients and that careful, often non-linear, pre-processing is critical for performance. To address the skewness of precipitation data, Choi et al. [84] demonstrate that applying weighted loss functions (e.g., Focal or Dice loss) significantly improves the recall of extreme events compared to unweighted baselines, offering a robust alternative to standard transforms. Failure to adopt such tail-aware strategies often leads to “poor performance” on extremes that is actually a preprocessing artifact rather than a model deficiency.

6.6.2. Regridding Artifacts and Representativeness

The interpolation method used to align coarse GCM predictors with fine-scale targets introduces distinct structural biases. As discussed by Lanzante et al. [7], representativeness issues and scale mismatches between predictors and predictands can lead to significant interpretability pitfalls. Standard bilinear interpolation often acts as a low-pass filter, effectively reducing the magnitude of high-percentile precipitation intensity relative to the original field. Conversely, conservative remapping preserves total water mass but can introduce block-like artifacts. Furthermore, Lanzante et al. [7] warn that these scale discrepancies create issues conceptually similar to an “error-in-variables” problem, where neglecting the representativeness error leads to overconfident projections.

6.6.3. The Bias Correction Paradox

Applying bias correction (e.g., Quantile Mapping) to GCM predictors prior to downscaling involves a trade-off between statistical fidelity and physical consistency. While pre-correction reduces the domain shift, Fallah et al. [85] warn that univariate correction can disrupt multivariate dependencies (e.g., the Clausius–Clapeyron relationship). This “preprocessing paradox” suggests that aggressive bias correction may improve historical validation metrics while degrading the physical plausibility of future projections.

7. A Prescriptive Protocol for Model Evaluation

To move beyond inconsistent evaluation practices and enable robust intercomparison, we outline a prescriptive, multi-faceted evaluation protocol. A model’s true utility is not captured by a single metric; at minimum, evaluation must be tailored to the variable being downscaled and must jointly assess spatial structure, extremes, and (when applicable) probabilistic skill. Adhering to such a protocol is a prerequisite for operational readiness.

7.1. Variable-Specific Minimum Suites

7.1.1. Protocol for Precipitation Downscaling

Precipitation is intermittent, heavy-tailed, and spatially complex; evaluation must therefore prioritize diagnostics sensitive to spatial structure and extremes rather than relying solely on pixel-wise errors such as RMSE, which suffer from the “double penalty” and encourage blurring. A minimum suite should include the following:

RMSE (baseline): Report for average error and legacy comparison, while acknowledging that it can penalize realistic high-frequency variability.
FSS (primary spatial metric): Report the Fraction Skill Score (FSS) across multiple intensity thresholds and neighborhood sizes [86]. We recommend thresholds relevant to hydrological impacts (depending on local severity/return period), e.g., 1, 5, and 20 mm/day, and reporting FSS as a function of neighborhood size; 10, 20, 40, and 80 km can serve as illustrative scales, but should be chosen to identify the scale at which forecasts become skillful.
Extreme-value fidelity: Report bias or absolute error at a high quantile (e.g., the 99th or 99.5th percentile) to directly assess rare, intense event magnitude. Complementary indices such as Rx1day and R99p help confirm tail behavior.
PSD (spatial realism): Plot the 1D radially averaged power spectrum versus reference data. An overly steep slope indicates excessive smoothing; a shallow slope or high-frequency bumps can indicate unrealistic noise or GAN-induced artifacts.
Probabilistic calibration (generative models): Report CRPS and RPSS to assess whether predictive distributions encompass observed outcomes; for probabilistic ensembles (e.g., GAN/Diffusion), CRPS is the primary overall skill score [87].

7.1.2. Protocol for Temperature Downscaling

Temperature is smoother and more continuous than precipitation, but evaluation must still capture bias, spatial variability, and (if probabilistic) calibration:

RMSE and Bias: Report RMSE and mean bias (downscaled minus reference) as standard accuracy/systematic-error metrics.
PSD: Verify realistic spatial variability and detect over-smoothing.
Distributional metrics (e.g., Wasserstein distance): Compare full distributions to capture shifts in shape and tails beyond mean/variance.
Reliability diagram (probabilistic models): Assess calibration against the 1:1 diagonal.

By adopting these variable-specific suites, researchers can provide richer, more comparable performance assessments.

7.2. Comparative Analysis and State of the Art

No single architecture is universally superior; the state-of-the-art is task-dependent:

Spatial structure and deterministic accuracy: U-Net and ResNet-based CNNs remain strong contenders, especially for smoother variables (e.g., temperature), because of inductive bias for local patterns and topographically induced variations [8].
Perceptual realism and sharp textures: GANs can be highly effective but require careful evaluation to avoid “hallucinated” features [88]. Sup3rCC provides an operational GAN-based renewable-energy application [53].
Probabilistic outputs and UQ: Diffusion models are emerging as state-of-the-art because of stable training and high-fidelity, diverse ensembles [9,31], often outperforming GANs on distributional metrics. As a simple, strong epistemic-UQ baseline, report deep ensembles [89] with CRPS and reliability diagnostics.
Transferability and zero-shot generalization: Transformer-based foundation models represent the cutting edge, enabling generalization to new resolutions/regions with minimal fine-tuning and improved operational scalability [15].

7.3. Validation Under Non-Stationarity

Statistical downscaling (including ML) is fundamentally challenged by climate non-stationarity under anthropogenic forcing: relationships learned from historical data may fail in a warmer world with altered dynamics and novel states. As Lanzante et al. [7] notes, historical large-to-local relationships may not hold under future conditions. In our framing (see Section 8.5), this appears as near-certain covariate and concept drift: models learn period-specific

P (X)

and

P (Y | X)

, and when future

P (X)

or

P (Y | X)

changes, performance degrades, forming a basis for the “transferability crisis”. We thus treat non-stationarity as a time-evolving distributional shift (a specific class of OOD input) manifesting as covariate and concept drift.

7.3.1. Pseudo-Global Warming (PGW) Experiments

PGW modifies historical meteorological data to reflect future warming by adding climate-change signals (e.g., temperature anomalies, humidity changes) derived from GCMs onto observed historical patterns. Training/evaluating ML models on PGW datasets enables systematic assessment of extrapolation to physically plausible future conditions; studies report potential improvements when models are exposed to more future-representative conditions.

7.3.2. Transfer Learning and Domain Adaptation

Transfer learning/domain adaptation aims to leverage knowledge from a source domain to improve a related target domain [8]. In downscaling,

Pre-train on large, diverse datasets (e.g., multiple GCMs, long records) to learn general/invariant atmospheric features.
Fine-tune on smaller target datasets (e.g., a region, future period, or new GCM) [17], improving generalization and reducing target-data needs; careful validation is required to avoid importing source-domain biases. Prasad et al. [17] shows pre-training can enhance zero-shot transferability for some tasks, though fine-tuning often remains necessary for optimal performance on distinct targets such as different GCM outputs.

7.3.3. Process-Informed Architectures and Predictor Selection

To move beyond purely statistical pattern recognition, this direction seeks to embed physical understanding via the following:

Encoding known physical relationships into the architecture (e.g., layers/connections that mimic processes or constraints).
Using physically motivated predictors (e.g., potential temperature, specific humidity, circulation indices) instead of large, collinear, or causally weak predictor sets.

Implementation remains an active research area with limited widespread adoption to date.

7.3.4. Validation Strategies for Non-Stationary Conditions

Random train/test splits are insufficient when distributions shift. More robust strategies include the following:

Perfect Model Framework (Pseudo-Reality): Treat high-resolution GCM/RCM output as “truth” [7]; train on its coarsened version and reconstruct the original truth. This enables testing across different climate states (historical vs. future periods) with known truth, directly probing extrapolation.
Cross-GCM Validation: Train on a subset of GCMs and test on withheld GCMs to assess generalization across structural differences and biases.
Temporal Extrapolation (Out-of-sample): Train on earlier periods and test on the most recent record or distinct climatic periods (e.g., warmest historical years as future proxies) [8].
Process-Based Evaluation: Verify physically plausible inter-variable relationships (e.g., temperature–precipitation scaling, wind–pressure) and key processes (diurnal cycles, seasonal transitions, extremes) under different conditions; XAI can help assess whether mechanisms are physically sound.

Addressing non-stationarity requires moving beyond empirical pattern-matching toward models with physical grounding and/or validation explicitly targeting robustness under shifting climates. When applying these strategies, it is also crucial to account for the non-IID nature of climate data (Section 9.5): temporal extrapolation should use temporal blocked CV (not random CV) to avoid leakage, and Perfect Model comparisons across climate states should ensure spatial independence within each period via spatial blocking or LLO CV for a fair generalization test. Figure 5 summarizes how key validation techniques map to the challenges they address (from leakage under spatio-temporal autocorrelation to OOD generalization under future scenarios).

7.4. A Multi-Faceted Toolkit for Model Evaluation

Evaluation metrics implicitly define downscaling objectives: over-reliance on pixel-wise RMSE can yield overly smooth fields that miss spatial variability and extremes crucial for impact studies. We therefore summarize a holistic toolkit in Table 1 and define a minimum viable operational standard via three tiers.

Uncertainty Baselines

For epistemic UQ, report deep ensembles [89] as a baseline, scored with strictly proper rules such as CRPS [87] and assessed via reliability diagrams.

7.5. Tier 1: Mandatory Baseline Diagnostics

Baseline error (RMSE/MAE) and bias: Necessary for legacy comparison and diagnosing systematic wet/dry drifts, despite the “double penalty” effect.
Texture and spectral realism (PSD): Mandatory to detect “spectral drop-off” (blurring); models that miss the correct $k^{- 5 / 3}$ (or similar) slope are physically deficient.
Distributional sanity checks (QQ plot): Required to detect “distributional collapse” (regression to the mean), especially in tails.

7.6. Tier 2: Essential Operational Standards

Spatial verification (FSS): FSS decouples displacement from intensity error and prevents selecting smooth, unrealistic models [86].
Extreme-value fidelity: Report high-quantile bias (e.g., $P_{99}$ ) or tail-dependence metrics to ensure rare events are captured.

Table 1. A multi-faceted evaluation toolkit for ML-based downscaling ^a.

Category	Metric	Description and Use Case	When to Use	Key Refs
Pixel-wise Accuracy	RMSE/MAE	Root Mean Squared Error/Mean Absolute Error. Standard metrics for average error, but can be misleading for skewed distributions (e.g., precipitation) and penalize realistic high-frequency variations.	Standard baseline, but use with caution; supplement with other metrics.	[11]
Spatial Structure	Structural Similarity (SSIM)	Measures perceptual similarity between images based on luminance, contrast, and structure. Better than RMSE for assessing preservation of spatial patterns.	To evaluate preservation of spatial patterns and textures.	[46]
	Power Spectral Density (PSD)	Compares the variance at different spatial frequencies. Crucial for diagnosing overly smooth outputs (loss of high-frequency power) or GAN-induced artifacts (spurious power).	To diagnose smoothing or unrealistic high-frequency noise.	[88,90]
	Variogram Analysis	Geostatistical tool that quantifies spatial correlation as a function of distance. Comparing nugget, sill, and range diagnoses noise, variance suppression, and incorrect spatial correlation length.	To quantitatively assess spatial dependency structure and diagnose over-smoothing.	[91]
	Method for Object-based Diagnostic Evaluation (MODE)	Identifies and compares attributes (e.g., area, location, orientation, intensity) of distinct objects (e.g., storms). Provides diagnostic information on specific spatial biases beyond grid-point errors.	For detailed diagnostic evaluation of precipitation fields, avoiding the “double penalty” issue.	[92,93]
Temporal Coherence	Temporal Autocorrelation	Measures the correlation of a time series with itself at a given lag (e.g., lag-1 for daily data). Assesses the model’s ability to reproduce temporal persistence or “memory”.	To diagnose unrealistic temporal “flickering” or lack of persistence in time series.	[94,95]
Temporal Coherence	Wet/Dry Spell Characteristics	Quantifies the statistics of consecutive days above/below a threshold (e.g., 1 mm/day for precipitation). Key metrics include mean/max spell duration, frequency, and cumulative intensity.	Essential for impact studies related to droughts and floods; evaluates temporal clustering of events.	[96,97]
Extreme Events	Fraction Skill Score (FSS)	A neighborhood-based verification metric that assesses the skill of forecasting events exceeding a certain threshold across different spatial scales. Mitigates the “double penalty” issue.	Essential for verifying precipitation fields at specific thresholds.	[86,90]
	Quantile-based scores (e.g., 99th percentile error)	Directly evaluates the accuracy of specific quantiles (e.g., p95, p99), focusing on performance in the tails of the distribution.	To specifically quantify performance on rare, high-impact events.	[36]
	Return Level/Period Consistency	Compares the magnitude of extreme events for given return periods (e.g., the 1-in-100-year event) between the downscaled output and observations, often using Extreme Value Theory.	For climate impact studies where long-term risk from extremes is key.	[98]
Distributional Similarity	Wasserstein Distance (Earth Mover’s Distance)	Measures the “work” required to transform one probability distribution into another. A robust measure of similarity between the full distributions of the downscaled and reference data.	For a rigorous comparison of the entire statistical distribution.	[33,99]
	CRPS (Continuous Ranked Probability Score)	For probabilistic forecasts, measures the integrated squared difference between the predicted cumulative distribution function (CDF) and the observed value. A proper scoring rule that generalizes MAE.	Gold standard for evaluating probabilistic/ensemble forecast skill.	[87,90]
	Perkins Skill Score (PSS)	Measures the common overlapping area between two probability density functions (PDFs). An intuitive, distribution-agnostic metric of overall distributional similarity.	To provide a robust, integrated score of distributional overlap, common in climate model evaluation.	[100]
Uncertainty Quantification (UQ)	Reliability Diagram	Plots observed frequencies against forecast probabilities for binned events to assess calibration. A perfectly calibrated model lies on the diagonal.	To assess if forecast probabilities are statistically reliable.	[90]
Uncertainty Quantification (UQ)	PIT Histogram	Probability Integral Transform. For a calibrated ensemble, the PIT values of the observations should be uniformly distributed. Deviations indicate biases or incorrect spread.	To diagnose issues with ensemble spread and bias.	[90]
Physical Consistency	Conservation Error	Directly measures the violation of a conservation law (e.g., mass, energy) by comparing the aggregated high-resolution output to the coarse-resolution input value.	When conservation of a quantity is a critical physical constraint.	[25]
	Multivariate Correlations	Assesses whether the physical relationships and correlations between different downscaled variables (e.g., temperature and humidity) are preserved realistically.	Essential for multi-variable downscaling to ensure physical coherence.	[9]
	Clausius-Clapeyron Scaling	Verifies if the intensity of extreme precipitation scales with temperature at the physically expected rate (7%/°C). Tests if the model has learned a fundamental thermodynamic relationship.	Critical for assessing the credibility of future projections of extremes under warming.	[12]

Note(s): ^a Recommended Minimum Evaluation Suite: For a robust evaluation, we recommend a core set of metrics. For spatial structure, report Variogram parameters (sill, range) and key MODE diagnostics (e.g., centroid error, area bias). For temporal coherence, report lag-1 autocorrelation and mean/max dry spell duration. For extremes, report FSS at relevant thresholds and the error at a high quantile (e.g., 99th). For distributional skill, report CRPS (if applicable) and Perkins Skill Score. For physical consistency, report Conservation Error and C-C Scaling rate.

7.7. Tier 3: Advanced and Probabilistic Standards

Probabilistic calibration (CRPS): For generative models (Diffusion, GANs), deterministic evaluation is ill-posed; CRPS is a strictly proper scoring rule and gold standard for ensemble assessment [87].
Non-stationarity stress tests (PGW): PGW injects thermodynamic signals into historical dynamics to stress-test extrapolation (e.g., Clausius–Clapeyron scaling) prior to deployment.
Physical consistency: Require conservation checks (mass/energy budget closure) to detect fundamental physical violations.

7.8. Diagnostic Visualization Suite

A comprehensive model audit should go beyond aggregate scores and make error modes visible. The literature consistently highlights a small set of diagnostic visualizations that expose why a downscaling model succeeds or fails and whether improvements are robust or artifact-driven. Figure 6 and Figure 7 provide exemplary illustrations of these diagnostic approaches from Sha et al. [101], demonstrating parity plot analysis for distributional consistency and transferability assessment, as well as spatial fidelity comparisons that reveal structural differences between traditional and deep learning methods. We recommend reporting the following diagnostics (at least for representative regions/seasons and for both the training and independent test periods):

Parity (predicted vs. reference) plots: Reveal additive bias (intercept), multiplicative bias (slope), heteroscedasticity (error increasing with intensity), and tail distortion (systematic under-/over-estimation of extremes). When stratified by season, elevation/coast distance, or intensity bins, they also expose regime-dependent failures that are otherwise hidden by a single MAE/RMSE. Figure 6 demonstrates this approach, showing both in-domain and out-of-domain (transfer) performance to assess model generalization.
Confusion matrices for thresholded events: For impact-relevant exceedances (e.g., above P95/P99 or fixed hydrologic thresholds), a confusion matrix makes the trade-off between misses and false alarms explicit. Reporting derived skill measures (e.g., POD/recall, FAR, CSI, precision, and bias) alongside the matrix clarifies whether the model is conservative (many misses) or over-active (many false alarms).
Correlation heatmaps and dependence checks: Heatmaps of inter-variable correlations (and, when relevant, lag/lead correlations) help verify whether the model preserves physical coherence rather than matching marginal distributions only. Comparing correlation structure between the reference and predictions can reveal “physically implausible” dependence patterns even when pixel-wise errors are small.
Spatial error maps and structural diagnostics: Maps of mean bias, MAE/RMSE, and (for precipitation) event-based structure metrics (e.g., neighborhood/FSS-style summaries) localize systematic errors (orographic/coastal bands, convective hotspots, land–sea boundaries). This directly supports model development by tying failures to geography and process regimes. Figure 7 exemplifies this diagnostic approach, comparing spatial outputs and difference maps to reveal how deep learning methods recover the fine-scale topographic detail that traditional methods smooth out.

These diagnostics are complementary: parity plots diagnose distributional behavior, confusion matrices connect performance to decision thresholds, correlation checks test physical consistency, and spatial maps expose structured errors. Together they provide a transparent “audit trail” of model limitations and help prevent over-interpreting improvements that are confined to a narrow regime.

7.9. Operational Relevance: Beyond Statistical Skill

A model’s utility for real-world deployment depends on practical considerations beyond its performance on a test set.

Computational Cost: Dynamical downscaling is exceptionally expensive, limiting its use for large ensembles. ML offers a computationally cheaper alternative by orders of magnitude [9,63]. However, costs vary within ML: inference with CNNs is fast, while the iterative sampling of diffusion models is slower. Training large foundation models requires massive computational resources, but once trained, fine-tuning and inference can be efficient [16]. The hybrid dynamical–generative approach offers a compelling trade-off, drastically cutting the cost of the most expensive part of the physical simulation pipeline [9].
Interpretability: As discussed in Section 9.2.2, the “black-box” nature of deep learning is a major barrier to operational trust. The ability to use XAI tools to verify that a model is learning physically meaningful relationships, rather than spurious “shortcuts”, is crucial for deployment in high-stakes applications.
Robustness and Generalization: The single most important factor for operational relevance is a model’s ability to generalize to out-of-distribution (OOD) data, namely future climate scenarios. As detailed in Section 9.1, models that fail under covariate or concept drift are not operationally viable for climate projection. Therefore, rigorous OOD evaluation using techniques like cross-GCM validation and Pseudo-Global Warming (PGW) experiments is a prerequisite for deployment.
Baselines: Always include strong classical comparators (e.g., BCSD/quantile-mapping and LOCA) as default references alongside modern DL models; these remain common operational choices in hydrologic and climate-service pipelines [102,103]. Formal assessments and national products continue to operationalize statistical interfaces between GCMs and impacts—bias adjustment and empirical/statistical downscaling (e.g., LOCA2, STAR-ESDM)—as default pathways, which underscores why ML downscalers must demonstrate clear, application-relevant added value [104,105].

8. Critical Investigation of Model Performance and Rationale

A critical aspect of advancing ML-based downscaling involves understanding not only which models perform well, but why they do so, and conversely, what factors impede their learning and generalization capabilities. This requires a careful examination of the rationale behind model choices and a comparative analysis of their strengths and weaknesses in specific downscaling tasks.

8.1. Rationale for Model Choices

The selection of a particular ML architecture for climate downscaling is often guided by the inherent strengths of the architecture in handling specific types of data and learning particular kinds of patterns.

CNNs/U-Nets for Spatial Patterns: These architectures are predominantly chosen for their proficiency in learning hierarchical spatial features from gridded data. Convolutional layers are adept at identifying local patterns, while pooling layers capture broader contextual information. U-Nets, with their encoder–decoder structure and skip connections, are particularly favored for tasks requiring precise spatial localization and preservation of fine details, making them well-suited for downscaling variables like temperature and precipitation where spatial structure is paramount [8].
LSTMs/ConvLSTMs for Temporal Dependencies: When the temporal evolution of climate variables and their sequential dependencies are critical (e.g., for daily precipitation sequences or hydrological runoff forecasting), LSTMs and ConvLSTMs are preferred because of their recurrent nature and ability to capture long-range temporal patterns.
GANs/Diffusion Models for Realistic Outputs and Extremes: These generative models are selected when the objective is to produce downscaled fields that are not only statistically accurate but also perceptually realistic, with sharp gradients and a better representation of the full statistical distribution, including extreme events [8].
Transformers for Long-Range Dependencies: The increasing adoption of Transformer architectures is driven by their powerful self-attention mechanisms, which allow them to model global context and long-range dependencies in both spatial and temporal dimensions, a capability that can be beneficial for complex climate system dynamics [16,57].

The selection process reflects an understanding that different climate variables possess distinct characteristics—for example, the relatively smooth spatial continuity of temperature versus the highly intermittent and patchy nature of precipitation. This distinction often guides the choice towards architectures whose inductive biases align with these characteristics.

8.2. Strategic Framework for Architecture Selection

Selecting an ML architecture requires navigating trade-offs between computational cost, physical fidelity, and specific predictand properties. We propose the following decision logic to guide operational deployment:

1.: Resource-Constrained/Mean-State Applications: If inference latency must be minimal, CNNs (U-Net, ResNet) are optimal. Their deterministic nature ensures speed, though they illustrate the trade-off of potentially smoothing high-frequency variability compared to generative baselines.
2.: Risk Assessment/Extremes: For applications requiring accurate heavy tails, Diffusion Models are the state of the art. Tomasi et al. [31] demonstrate that latent diffusion models can mimic kilometer-scale dynamics with high fidelity, recovering the small-scale variance lost by standard CNNs.
3.: Texture Synthesis/Energy Assessment: For renewable energy where spectral realism is paramount, GANs offer a compromise. They generate sharp textures fast, though they carry risks of mode collapse.
4.: Continental Domains/Teleconnections: For domains driven by remote teleconnections, spectral or operator-based architectures with global receptive fields, such as FourCastNet, capture global context better than local CNNs.

8.3. The Coherent Pipeline: Linking Loss, Architecture, and Validation

Model success is not determined by isolated choices (e.g., architecture alone) but by the coherence of the entire modeling pipeline. A mismatch between components leads to failure modes often misattributed to the architecture:

Loss–Architecture Alignment: Using a generative architecture (GAN) with a dominant MSE loss negates the generative benefit, collapsing the output back to the mean. Conversely, utilizing a probabilistic metric like CRPS requires a stochastic architecture (Diffusion or Dropout-Ensemble) to be mathematically meaningful.
Predictor–Physics Alignment: The architecture must be supplied with predictors that carry the relevant physical signal. For example, a Vision Transformer designed to capture teleconnections will fail if the input domain is too small to contain the large-scale driver (predictor mismatch).
Validation–Goal Alignment: A pipeline designed for extremes (using Weighted Loss) will appear to fail if evaluated strictly on RMSE (which penalizes variance). It must be validated using FSS or Tail-Dependence metrics.

8.4. Factors Contributing to Model Success

Several factors consistently contribute to the successful application of ML models in climate downscaling:

Appropriate Architectural Design: Matching the model architecture to the inherent characteristics of the data and the downscaling task is paramount. For instance, CNNs are well-suited for gridded spatial data, while LSTMs excel with time series. The incorporation of architectural enhancements like residual connections and the skip connections characteristic of U-Nets have proven crucial for training deeper models and preserving fine-grained spatial detail.
Effective Feature Engineering: The performance of ML models is significantly boosted by the inclusion of relevant predictor variables. In particular, incorporating high-resolution static geographical features such as topography, land cover, and soil type provides essential local context that coarse-resolution GCMs or reanalysis products inherently lack. This allows the model to learn how large-scale atmospheric conditions are modulated by local surface characteristics.
Quality and Representativeness of Training Data: The availability of sufficient, high-quality, and representative training data is fundamental. Data augmentation techniques, such as rotation or flipping of input fields, can expand the training set and improve model generalization, especially for underrepresented phenomena like extreme events [21,106].
Appropriate Loss Functions: The choice of loss function used during model training significantly influences the characteristics of the downscaled output. While standard loss functions like MSE are common, they can lead to overly smooth predictions and poor representation of extremes. Tailoring loss functions to the specific task—for example, using quantile loss, Bernoulli-Gamma loss for precipitation (which models occurrence and intensity separately), Dice loss for imbalanced data, or the adversarial loss in GANs for perceptual quality—can lead to substantial improvements in capturing critical aspects of the climate variable’s distribution [8]. Studies show that L1 and L2 loss functions perform differently depending on data balance, with L2 often being better for imbalanced data like precipitation [83].
Rigorous Validation Frameworks: The use of robust validation strategies, including out-of-sample testing and standardized evaluation metrics beyond simple error scores (e.g., the VALUE framework [107]), is crucial for assessing true model skill and generalizability.

8.5. Factors Hindering Model Learning

Comparative Susceptibility to Physical Inconsistency

Different architectures exhibit distinct “signatures” of physical violation that practitioners must mitigate:

CNNs (Spectral Smoothing): Purely statistical CNNs minimize pixel-wise error, encouraging averaging. Physically, this manifests as a violation of energy conservation at small scales—the Power Spectral Density (PSD) drops off rapidly at high wavenumbers.
GANs (Structural Hallucination): While GANs restore PSD, they are prone to structural hallucinations. Annau et al. [88] explicitly document how GANs can generate realistic-looking wind features in physically inconsistent locations to satisfy the discriminator, leading to artifacts in geophysical fields.
Transformers (Boundary Artifacts): Models that process data in patches, such as certain Vision Transformers, face the risk of boundary artifacts. As analyzed by Pérez et al. [108], tiling approaches can lead to performance degradation or discontinuities at patch borders if not managed with careful overlap strategies.

Despite their successes, ML models can be hindered by several factors that impede their ability to learn effectively or generalize reliably:

Overfitting: Models may learn noise, data-set-specific artifacts, or spurious correlations that appear predictive in-sample but fail under domain shift (e.g., new regions, new GCMs, or future climates), especially with highly flexible DL models and limited or non-diverse training data. The following is how current ML practice addresses this critique:

(i) regularization and capacity control (weight decay, dropout, spectral/weight normalization where appropriate, and conservative architecture sizing), (ii) early stopping and robust training protocols (learning-rate schedules, checkpoint selection on out-of-sample validation, and monitoring of extreme-focused diagnostics rather than loss alone), (iii) data strategies (augmentation, regime-balanced sampling for rare extremes, and careful handling of leakage in spatio-temporal splits), (iv) validation designed to expose overfit (spatial blocking, temporal blocking, cross-region tests, and cross-GCM/PGW-style tests when the model is intended for future projection), and (v) uncertainty-aware stabilization (ensembles or MC-dropout-style approximations) to reduce variance and improve calibration. In combination, these practices respond directly to the standard overfitting critique in high-dimensional climate settings by making generalization a first-class design constraint rather than an afterthought.

Poor Generalization (The “Transferability Crisis”), Covariate Shift, Concept Drift, and Shortcut Learning: A major and persistent challenge is the failure of models to extrapolate reliably to conditions significantly different from those encountered during training. This ’transferability crisis’ is the core of the “performance paradox” and is rooted in the violation of the stationarity assumption. It can be rigorously framed using established machine learning concepts:
−
Covariate Shift: This occurs when the distribution of input data, $P (X)$ , changes between training and deployment, while the underlying relationship $P (Y | X)$ remains the same [109]. In downscaling, this is guaranteed when applying a model trained on historical reanalysis (e.g., ERA5) to the outputs of a GCM, which has its own systematic biases and statistical properties. It also occurs when projecting into a future climate where the statistical distributions of atmospheric predictors (e.g., mean temperature, storm frequency) have shifted.
−
Concept Drift: This is a more fundamental challenge where the relationship between predictors and the target variable, $P (Y | X)$ , itself changes [109]. Under climate change, the physical processes linking large-scale drivers to local outcomes might be altered (e.g., changes in atmospheric stability could alter lapse rates). A mapping learned from historical data may therefore become invalid.
−
Shortcut Learning: This phenomenon provides a mechanism to explain why models are so vulnerable to these shifts [110]. Models often learn “shortcuts”—simple, non-robust decision rules that exploit spurious correlations in the training data—instead of the true underlying physical mechanisms [110]. For example, a model might learn to associate a specific GCM’s known regional cold bias with a certain type of downscaled precipitation pattern. This shortcut works perfectly for that GCM but fails completely when applied to a different, unbiased GCM or to the real world, leading to poor OOD performance. The finding by González-Abad et al. [65] that models may rely on spurious teleconnections is a prime example of shortcut learning in this domain.
Lack of Physical Constraints: Purely data-driven ML models, optimized solely for statistical accuracy, can produce outputs that are physically implausible or inconsistent (e.g., violating conservation laws). This lack of physical grounding can severely limit the trustworthiness and utility of downscaled projections.
Data Limitations: Insufficient training data, particularly for rare or extreme events, remains a significant bottleneck. Data scarcity in certain geographical regions also poses a challenge for developing globally applicable models. Furthermore, the lack of training data that adequately represents the full range of potential future climate states can hinder a model’s ability to project future changes accurately.
Inappropriate Model Complexity: Choosing an inappropriate level of model complexity can be detrimental. Models that are too simple may underfit the data, failing to capture complex relationships. Conversely, overly complex models are prone to overfitting, may be more difficult to train, and can be computationally prohibitive.
Training Difficulties (e.g., Vanishing/Exploding Gradients): In very deep neural networks, especially plain CNNs without architectural aids like residual connections, the gradients used for updating model weights can become infinitesimally small (vanishing) or excessively large (exploding), hindering the learning process.
Input Data Biases and Inconsistencies: Systematic biases present in GCM outputs, or inconsistencies between the statistical characteristics of training data (e.g., reanalysis) and application data (e.g., GCM outputs from a different model or future period), representing a significant covariate shift as discussed previously, can significantly degrade downscaling performance. Preprocessing steps, such as bias correction of predictors or working with anomalies by removing climatology, are often crucial for mitigating these issues [7].

The difficulty in generalizing to out-of-sample conditions, often linked to models learning superficial statistical correlations rather than robust physical mechanisms, represents a core impediment. This suggests that high performance on historical test data does not automatically translate to reliability for future climate projections, necessitating specific strategies to enhance model robustness and physical understanding.

8.6. Comparative Analysis of ML Approaches

To contextualize the performance of modern architectures, we benchmark against established shallow learning techniques. As demonstrated in Vandal et al. [1], Support Vector Regression (PCASVR) often struggles with the high-dimensional spatial dependencies inherent in gridded climate data. Their analysis (Figure 8a) reveals that PCASVR yields a median daily RMSE nearly double that of Autoencoder networks (AE). Furthermore, qualitative analysis of spatial bias (Figure 8b) indicates that SVM-based approaches fail to capture regional coherence, producing scattered noise compared to the physically consistent fields generated by deep learning architectures.

AI-assisted ERA5 precipitation learned a mapping from ERA5 atmospheric proxies (and satellite estimates) to high-resolution observed precipitation, effectively performing bias-correction and downscaling in one step [111]. Solved: a practical route to operational products with good transfer to unseen regions. Open: explicit uncertainty and physical conservation. A comparative understanding of different ML architectures is essential for selecting appropriate methods for specific downscaling tasks. Table 2 provides a synthesis of dominant ML architectures used in climate downscaling, outlining their key mechanisms, strengths, limitations, typical applications, and resolutions.

The development and application of these diverse architectures indicate that the field is moving towards more tailored solutions. The “best” model is not a fixed entity but depends on a nuanced understanding of the problem at hand—the specific climate variable, the geographical context, the importance of physical consistency versus perceptual realism, and the need to capture extreme events. This highlights a crucial understanding: model performance is an emergent property of the entire downscaling pipeline, encompassing not just the architecture but also the quality of input features, the appropriateness of the loss function, and the rigor of the validation strategy. A sophisticated architecture can falter if other components of this pipeline are suboptimal, particularly for challenging tasks like accurately downscaling extreme precipitation or ensuring physical consistency under future climate scenarios.

9. Overarching Challenges in ML-Based Climate Downscaling

Despite the significant advancements and demonstrated potential of machine learning in climate downscaling, several overarching challenges persist. These challenges often interrelate and collectively hinder the operational deployment and full realization of ML’s capabilities in providing robust and trustworthy high-resolution climate information. Key among these are issues of transferability and domain adaptation, ensuring physical consistency and interpretability, effectively representing extreme events, quantifying uncertainties, and addressing practical aspects like reproducibility and data limitations.

Evidence snapshot: Several of the challenges summarized in this section are supported by quantitative evidence in recent benchmarking and out-of-distribution tests. For example, transferability failures have been measured directly: Hernanz et al. [120] reported that a CNN downscaling model’s error increased sharply (approximately doubling in their out-of-sample experiment) when evaluated on a future-climate scenario outside the training distribution, illustrating the severity of domain shift in practice. Similarly, Legasa et al. [18] found that projected precipitation-change signals can exhibit substantially larger biases when an ML downscaler is applied to a different driving GCM than the one used for training, reinforcing that cross-GCM generalization is a central bottleneck. In contrast, explicitly physics-constrained approaches remain comparatively rare in the downscaling literature; however, where applied, they provide concrete gains in physical realism (e.g., hard-constrained conservation in CNN-based downscaling [27]). Finally, while uncertainty quantification (UQ) methods exist (e.g., statistical and ensemble-based approaches [69,121]), deterministic reporting still dominates many ML downscaling studies, motivating the need for broader, calibrated UQ adoption [55].

Conceptual framework: It is useful to distinguish between the root causes of these limitations and their downstream manifestations. Fundamentally, issues of non-stationarity and distributional shift in climate data act as root causes that propagate through the modeling pipeline. For example, the inevitable shifts in input distributions (covariate and concept drift) are a primary driver behind downstream failures such as degraded cross-GCM transferability and breakdowns under future climate scenarios. Likewise, data limitations (e.g., sparse observations or biased training datasets) are a root cause that manifests as poor extreme-event representation and biased predictions in underserved regions. By mapping each observed challenge to an underlying mechanism—e.g., non-stationarity → loss of generalization, or lack of physical constraints → physically inconsistent outputs—we impose a hierarchy: the field’s grand challenges (Section 10 and Section 11) often stem from a few fundamental causes. This framing clarifies that tackling root issues like distributional shifts or missing physics will alleviate multiple downstream symptoms (such as performance paradox, extreme-event errors), thereby bringing greater scientific rigor to the analysis of ML downscaling limitations.

Framework—Root Causes vs. Manifestations: Many of these challenges stem from a few fundamental root causes (such as climate non-stationarity and distributional shifts), which then lead to downstream manifestations (like degraded cross-GCM transferability or poor extreme-event representation). For example, the violation of stationarity under climate change is a root cause that directly drives transferability failures and underperformance for extremes. By mapping each observed issue to an underlying mechanism—e.g., covariate and concept drift (root causes) leading to model generalization breakdowns and extreme-value biases (manifestations)—we clarify their relationships. In the discussion below, we explicitly distinguish these levels: addressing a root cause (say, making models robust to distributional change) can alleviate multiple manifestations at once (improving cross-domain generalization and extreme-event fidelity, in this case). This conceptual hierarchy adds rigor to our analysis of the challenges.

9.1. Transferability and Domain Adaptation: The Achilles’ Heel

A central and frequently cited limitation of ML-based downscaling—especially deep learning—is poor transferability/generalizability to data distributions different from those seen in training [7]. Recent intercomparison efforts (e.g., Legasa et al. [18]) now assess transferability directly by testing how well ML methods (including CNNs and Random Forests) preserve the climate-change signals projected by the driving GCM [7].

The root cause often traces to violations of the stationarity assumption, compounded by models learning statistical shortcuts rather than robust physical relationships. As discussed in Section 8.5, these failures can be framed as covariate shift and concept drift. Covariate shift occurs when the predictor distribution

P (X)

changes between training and deployment, which is nearly inevitable when moving from historical to future projections or from reanalysis products to GCM outputs. For example, future GCM climates present different statistical distributions of large-scale predictors (e.g., shifted mean temperatures and altered variability) than the historical training data. Likewise, models trained on reanalysis such as ERA5 (

X_{train}

) often face covariate shift when applied to GCM outputs (

X_{test}

), which exhibit systematic biases and distinct statistical properties even for the same historical period; this connects to “Ensuring Domain Consistency” (Section 6.5) and “Input Data Biases and Inconsistencies” (Section 8.5).

Concept drift—a change in the mapping

P (Y | X)

between predictors and predictands—is also likely under climate change because the physical processes linking large-scale drivers (X) to local variables (Y) may evolve (e.g., via altered atmospheric stability or land-surface feedbacks), so a historical mapping

f : X \to Y

may become suboptimal or invalid. Moreover, different GCMs (with distinct physical parameterizations) can represent

P (Y | X)

differently; thus, applying a model trained on one GCM’s representation to another can induce concept drift even if the large-scale predictors X are perfectly harmonized.

The “performance paradox”—excellent skill on historical in-distribution tests yet catastrophic failure under these shifts—follows when models encounter covariate and concept shifts without robustness, often because they have learned shortcuts that hold only in the training distribution.

These models learn relationships (

P (Y | X)

and from

P (X)

) specific to the historical period. When future

P (X)

changes (covariate shift) or future

P (Y | X)

changes (concept drift), the model’s performance degrades. This provides a more fundamental explanation for the “transferability crisis” than merely stating models learn “statistical shortcuts”; they learn shortcuts that are only valid for the training distribution and are not robust to these inevitable distributional shifts. This “Achilles’ heel” manifests in several ways:

Extrapolation to Future Climates: Models trained exclusively on historical climate data often struggle to perform reliably when applied to future climate scenarios characterized by significantly different mean states, altered atmospheric dynamics, or novel patterns of variability. Studies by Hernanz et al. [120] demonstrated catastrophic drops in CNN performance when applied to future projections or GCMs not included in the training set. The models may learn statistical relationships that are valid for the historical period but do not hold under substantial climate change.
Cross-GCM/RCM Transfer: Because of inherent differences in model physics, parameterizations, resolutions, and systematic biases, ML models trained to downscale the output of one GCM or RCM often exhibit degraded performance when applied to outputs from other climate models. This limits the ability to readily apply a single trained downscaling model across a multi-model ensemble.
Spatial Transferability: A model developed and trained for a specific geographical region may not transfer effectively to other regions with different climatological characteristics, topographic complexities, or land cover types. Local adaptations are often necessary, which can be data-intensive.

The fundamental root cause of these transferability issues often lies in the stationarity assumption implicitly made by many ML models. These models learn statistical correlations from the training data, and if these correlations are spurious or specific only to the training period’s climate regime, they will not generalize to non-stationary future conditions where the underlying physical relationships (

P (Y | X)

) may change.

Several mitigation efforts and active research frontiers are addressing this challenge:

Domain Adaptation Techniques: These methods aim to explicitly adapt a model trained on a “source” domain (e.g., historical data from one GCM) to perform well on a “target” domain (e.g., future data from a different GCM) where labeled high-resolution data may be scarce or unavailable [8]. Transfer learning operates by exploiting the hierarchical feature extraction of CNNs. The initial layers of a deep network typically learn fundamental, domain-agnostic primitives such as edges, gradients, and textural patterns. By “freezing” the weights of these lower layers and retraining only the final, high-level distinct layers on the target dataset, the dimensionality of the optimization problem is drastically reduced. This approach allows the model to adapt to new high-dimensional datasets without the massive sample sizes typically required to learn fundamental feature representations from scratch.
Training on Diverse Data: A common strategy is to pre-train ML models on a wide array of data encompassing multiple GCMs, varied historical periods, and diverse geographical regions. The hypothesis is that exposure to greater variability will help the model learn more robust and invariant features that generalize better. For instance, Prasad et al. [17] found that training on diverse datasets (ERA5, MERRA2, NOAA CFSR) led to good zero-shot transferability for some tasks, though fine-tuning was still necessary for others, such as the two-simulation transfer involving NorESM data.
Pseudo-Global Warming (PGW) Experiments: This approach involves training or evaluating models using historical data that has been perturbed to mimic certain aspects of future climate conditions (e.g., by adding a GCM-projected warming signal). This allows for a more systematic assessment of a model’s extrapolation capabilities under changed climatic states.
Causal Machine Learning: There is growing interest in developing ML approaches that aim to learn underlying causal physical processes rather than just statistical correlations. Such models are hypothesized to be inherently more robust to distributional shifts.

The challenge of transferability is that strong performance on a historical test split does not guarantee reliability under covariate shift (e.g., different forcings, observational networks, or model regimes). Accordingly, studies often recommend reporting distribution-focused diagnostics in both the source and target domains (e.g., parity or quantile–quantile summaries, stratified by season and intensity) to verify that bias, variance, and tail behavior remain stable after transfer. Figure 6 demonstrates this diagnostic approach, comparing model performance between training and transfer domains to reveal distributional consistency. The key practical criteria are a near-unity relationship between predictions and references, residuals that do not drift across regimes, and no systematic degradation in the extremes; otherwise, additional calibration or domain-adaptation steps are needed before projections are used for impact assessment.

Case Studies (Quantitative Case Studies)

Cross-model transfer (temperature UNet emulator). In a pseudo-reality experiment, daily RMSE for a UNet emulator rose from ∼0.9 °C when evaluated on the same driving model used for training (UPRCM) to ∼2–2.5 °C when applied to unseen ESMs; for warm extremes (99th percentile) under future climate, biases were mostly within $[- 0.5, 2]$ °C but reached up to 5 °C in some locations, and were larger than a linear baseline [120].
GAN downscaling artifacts (near-surface winds). Deterministic GAN super-resolution exhibited systematic low-variance (low-power) bias at fine scales and, under some partial frequency-separation settings, isolated high-power spikes at intermediate wavenumbers; allowing the adversarial loss to act across all frequencies restored fine-scale variance, but it also raised pixelwise errors via the double-penalty effect [88].
Classical SD variability and bias pitfalls (VALUE intercomparison). In a 50+ method cross-validation over Europe, several linear-regression SD variants showed very large precipitation biases—sometimes worse than raw model outputs—while some MOS techniques systematically under- or over-estimated variability (e.g., ISI-MIP under, DBS over), underscoring that method class alone does not guarantee robustness [4].

9.2. Physical Consistency and Interpretability

Beyond statistical accuracy, two crucial aspects for the trustworthiness and scientific utility of ML-downscaled climate data are their physical consistency and the interpretability of the models that produce them.

9.2.1. Ensuring Physically Plausible Outputs

A significant concern with purely data-driven ML models is their potential to produce outputs that violate fundamental physical laws, such as the conservation of mass or energy, or exhibit thermodynamically inconsistent relationships between variables. For example, downscaled precipitation fields might not conserve water mass when aggregated back to the coarse scale, or predicted temperature and humidity fields might imply unrealistic atmospheric stability.

To move beyond performance evaluation on in-distribution test sets and rigorously assess a model’s generalization capabilities for climate projection, the following minimum set of OOD tests is recommended:

Cross-Model Generalization: Train the downscaling model on data from one climate data source (e.g., ERA5 reanalysis) and test its performance on an entirely different, unseen source (e.g., historical simulations from a CMIP6 model). This tests robustness to systematic biases and different statistical properties (covariate shift).
Future Climate Extrapolation (PGW): Evaluate the trained model on a Pseudo-Global Warming (PGW) dataset. PGW experiments modify historical data to represent future warmed conditions, providing a controlled test of the model’s ability to extrapolate to novel climate states.
Cross-Region Transfer: For models intended for broad applicability, train on one or more geographic regions and test on a held-out region with distinct climatological or topographical characteristics. This assesses the model’s ability to learn generalizable physical relationships rather than region-specific correlations.
Covariate Shift Detection and Adaptation: Before applying the model, quantify the distributional shift between the training predictors and the target application predictors using a metric like Maximum Mean Discrepancy (MMD) or energy distance [122,123]. In climate settings, Wasserstein distance is also used for model/field comparison [99]. Try lightweight adaptations—target-domain re-normalization, spatially aware cross-validation, and (where feasible) drift-aware fine-tuning—and reassess performance; see general drift/adaptation guidance [109] and transferability caveats in downscaling [120].

Within the body of work we surveyed, explicit enforcement of physical laws (e.g., conservation of mass or energy) is relatively uncommon compared to purely data-driven approaches. While we did not attempt to quantify this precisely, our reading of the literature indicates that physics-aware methods remain underrepresented. This lack of physical grounding can lead to scientifically questionable or misleading results, undermining confidence in ML-based downscaling. Efforts to address this challenge fall broadly into two categories:

Physics-Informed Neural Networks (PINNs) and Constrained Learning:
−
Soft Constraints: This approach involves incorporating penalty terms into the model’s loss function that discourage violations of known physical laws. The total loss becomes a weighted sum of a data-fidelity term and a physics-based regularization term (e.g., $L_{total} = L_{data} + λ_{physics} L_{physics}$ ). Physics-informed loss functions have been explored to guide models towards more physically realistic solutions. While soft constraints can reduce the frequency and magnitude of physical violations, they may not eliminate them entirely and can introduce a trade-off between statistical accuracy and physical consistency [25].
−
Hard Constraints: These methods aim to strictly enforce physical laws by design, either by modifying the neural network architecture itself or by adding specialized output layers that ensure the predictions satisfy the constraints. Harder et al. [27] introduced additive, multiplicative, and softmax-based constraint layers that can guarantee, for example, mass conservation between low-resolution inputs and high-resolution outputs. Such hard-constrained approaches have been shown to not only ensure physical consistency but also, in some cases, improve predictive performance and generalization [27]. The rationale for PINNs includes reducing the dependency on large datasets and enhancing model robustness by ensuring physical consistency, especially in data-sparse regions or for out-of-sample predictions [26]. Recent work explores Attention-Enhanced Quantum PINNs (AQ-PINNs) for climate modeling applications like fluid dynamics, aiming for improved accuracy and computational efficiency [124].
Hybrid Dynamical–Statistical Models: Another avenue is to combine the strengths of ML with traditional physics-based dynamical models (RCMs). This can involve using ML to emulate computationally expensive components of RCMs, to statistically post-process RCM outputs (e.g., for bias correction or further downscaling), or to develop hybrid frameworks where ML and dynamical components interact [8,18]. For example, “dynamical–generative downscaling” approaches combine an initial stage of dynamical downscaling with an RCM to an intermediate resolution, followed by a generative AI model (like a diffusion model) to further refine the resolution to the target scale. This leverages the physical consistency of RCMs and the efficiency and generative capabilities of AI [9]. Such hybrid models aim to achieve a balance between computational feasibility, physical realism, and statistical skill.

The integration of physical knowledge, whether through soft or hard constraints or via hybrid modeling, is increasingly recognized as crucial for developing ML downscaling methods that are not only accurate but also scientifically credible and reliable for climate change applications.

9.2.2. Explainable AI (XAI): Unmasking the “Black Box”

Deep learning models are often criticized as “black boxes” because their complex internal mechanisms obscure the reasoning behind predictions [8]. In climate science, where process understanding and trust in projections are paramount, this opacity is a major barrier. XAI methods aim to illuminate how models arrive at decisions; however, we observed that relatively few reviewed studies incorporate XAI.

The Need for Interpretability

XAI is important for the following:

Model validation and debugging: Identifying which input features the model relies on helps determine whether it learned scientifically meaningful relationships or is exploiting spurious correlations/artifacts in the training data—a shortcut-learning failure mode where models can appear “right for the wrong reasons”.
Scientific discovery: Highlighting unexpected learned relationships may reveal new insights into climate processes.
Building trust: Models whose decision logic aligns with physical understanding are more likely to be trusted by domain scientists and policymakers.
Identifying biases: XAI can expose hidden biases in the model or in the training data.

Common XAI Techniques Applied to Downscaling

Saliency maps and feature attribution: Integrated Gradients, DeepLIFT, and Layer-Wise Relevance Propagation (LRP) attribute an output (e.g., a high-resolution pixel) back to input features (e.g., coarse predictor fields), revealing influential regions or variables [8]. González-Abad et al. [65] proposed aggregated saliency maps for CNN-based downscaling and found models may rely on spurious teleconnections or ignore important physical predictors. LRP has also been adapted for climate semantic-segmentation tasks (e.g., tropical cyclone and atmospheric river detection) to test whether CNNs use physically plausible input patterns [125].
Grad-CAM: Gradient-weighted Class Activation Mapping produces coarse localization maps of input regions important for predicting a class (or, when adapted, a specific regression output) [126]. While useful for visualization, Grad-CAM may not clearly differentiate between input variables [125].
SHAP: SHapley Additive exPlanations quantify each feature’s contribution to a prediction using cooperative game theory [127]. SHAP can reveal features that degrade forecast accuracy, though in some contexts it may inaccurately rank important features [128].

Challenges in XAI for Climate Downscaling

Faithfulness and Plausibility: Ensuring that explanations truly reflect the model’s internal decision-making process (faithfulness) and are consistent with physical understanding (plausibility) is challenging [129]. Different XAI methods can yield different, sometimes conflicting, explanations for the same prediction [130].
Relating Attributions to Physical Processes: While methods like integrated gradients are mathematically sound, the resulting attribution maps can be difficult to directly relate to specific, understandable physical processes or mechanisms.
Standardization: Methodologies and reporting standards for XAI in climate downscaling remain inconsistent, making comparisons across studies difficult. Different XAI methods can yield conflicting explanations for the same prediction, and there is a lack of consensus on benchmark metrics, hindering systematic evaluation [129].
Beyond Post Hoc Explanations: Current XAI often provides post hoc explanations. There is a growing call to move towards building inherently interpretable models or to integrate interpretability considerations into the model design process itself, drawing lessons from how dynamical climate models are understood at a component level. This involves striving for “component-level understanding” where model behaviors can be attributed to specific architectural components or learned representations.

The development and integration of robust and meaningful XAI techniques are vital for transforming ML downscaling models from “black boxes” into transparent and trustworthy scientific tools.

9.3. Representation of Extreme Events

Climate extremes, such as heavy precipitation, droughts, heatwaves, and cold spells, often have the most significant societal and environmental impacts. Therefore, the ability of downscaling methods to accurately represent the characteristics of these extreme events (e.g., frequency, intensity, duration, spatial extent) is of paramount importance [37].

9.3.1. The Challenge

Standard ML models, particularly those trained with common loss functions like MSE, tend to perform poorly in capturing extreme events. This is often because of the following:

Data Imbalance: Extreme events are rare by definition, leading to their under-representation in training datasets—an issue long recognized in extreme value analysis [98]. Models optimized to minimize average error across all data points may thus prioritize fitting common, non-extreme values, effectively “smoothing over” or underestimating extremes. In precipitation downscaling, tail-aware training (e.g., quantile losses) has been used precisely to counter this tendency [36]; empirical studies also note that standard DL architectures can underestimate heavy precipitation and smooth spatial variability in extremes [37,120].
Loss Function Bias: MSE loss, for example, penalizes large errors quadratically, which might seem beneficial for extremes. However, because extremes are infrequent, their contribution to the total loss can be small, and the model may learn to predict values closer to the mean to minimize overall MSE, thereby underpredicting the magnitude of extremes. This regression-to-the-mean behavior under quadratic criteria is well documented in hydrologic error decompositions [131]; tail-focused alternatives such as quantile (pinball) losses offer a direct mitigation [36].
Failure to Capture Compound Extremes: Models may also struggle to capture the co-occurrence of multiple extreme conditions (e.g., concurrent heat and drought), which requires learning cross-variable dependence structures. Reviews of compound events highlight the prevalence and impacts of such co-occurrences and the difficulty for standard single-target pipelines to reproduce them [132,133]; see also evidence on changing risks of concurrent heat–drought in the U.S. [134].

9.3.2. Specialized Approaches for Extremes

Recognizing these limitations, researchers have developed and applied various specialized techniques:

Tailored Loss Functions: Using loss functions that give more weight to extreme values or are specifically designed for tail distributions. Examples include the following:
−
Weighted Loss Functions: Assigning higher weights to errors associated with extreme events (e.g., the $L_{extreme}$ term in Equation (1) from the original document [84]).
−
Quantile Regression: Quantile Regression (QR) offers a powerful approach by directly modeling specific quantiles of a variable’s conditional distribution, which inherently allows for a detailed focus on the distribution’s tails and thus on extreme values. For instance, Quantile Regression Neural Networks (QRNNs), as implemented by Cannon [36], provide a flexible, nonparametric, and non-linear method. This approach avoids restrictive assumptions about the data’s underlying distribution shape, a significant advantage for complex climate variables like precipitation where parametric forms are often inadequate. A key feature of the QRNN presented is its ability to handle mixed discrete-continuous variables, such as precipitation amounts (which include zero values alongside a skewed distribution of positive amounts). This is achieved through censored quantile regression, making the model adept at representing both the occurrence and varying intensities of precipitation, including extremes.
Cannon [36] notes this was the first implementation of a censored quantile regression model that is non-linear in its parameters. Furthermore, the methodology allows for the full predictive probability density function (pdf) to be derived from the set of modeled quantiles. This enables more comprehensive probabilistic assessments, such as estimating arbitrary prediction intervals, calculating exceedance probabilities for critical thresholds (i.e., performing extreme value analysis), and evaluating risks associated with different outcomes. To enhance model robustness and mitigate overfitting, especially when data for extremes might be sparse, Cannon [36] incorporates techniques like weight penalty regularization and bootstrap aggregation (bagging). The practical relevance to downscaling is demonstrated through an application to a precipitation downscaling task, where the QRNN model showed improved skill over linear quantile regression and climatological forecasts. Importantly, the paper also suggests that QRNNs could be a “viable alternative to parametric ANN models for non-stationary extremes”, a crucial consideration for climate change impact studies where the characteristics of extreme events are expected to evolve. The Quantile-Regression-Ensemble (QRE) algorithm trains members on distinct subsets of precipitation observations corresponding to specific intensity levels, showing improved accuracy for extreme precipitation [121].
−
Bernoulli-Gamma or Tweedie Distributions: For precipitation, which has a mixed discrete-continuous distribution (zero vs. non-zero amounts, and varying intensity), loss functions based on these distributions (e.g., minimizing Negative Log-Likelihood—NLL) can better model both occurrence and intensity, including extremes [121].
−
Dice Loss and Focal Loss: These are explored for handling sample imbalance in heavy precipitation forecasts, with Dice Loss showing similarity to threat scores and effectively suppressing false alarms while improving hits for heavy precipitation [84].
Generative Models (GANs and Diffusion Models): These models, by learning the underlying data distribution, can be better at generating realistic extreme events compared to deterministic regression models [1]. Diffusion models, in particular, have shown promise in capturing the fine spatial features of extreme precipitation and reproducing intensity distributions more accurately than GANs or CNNs [135].
Data Augmentation: Techniques to artificially increase the representation of extreme events in the training data, as used in the SRDRN model [21].
Architectural Modifications: Designing model architectures or components specifically to handle extremes, such as the gradient-guided attention model for discontinuous precipitation by Xiang et al. [70] or multi-scale gradient processing in GANs. Beyond tailored loss functions and data augmentation, the architectural choices within generative frameworks and other advanced models are also pivotal for addressing the severe class imbalance inherent in extreme events and for capturing their unique characteristics. For instance, some GAN variants, such as evtGAN, integrate Extreme Value Theory to better model the tails of distributions associated with rare events. Other architectural improvements, like the use of multi-scale gradients in MSG-GAN-SD, aim for more stable training dynamics, which is a general challenge in GANs [48,114]. Diffusion models, while noted for their stable training and ability to capture fine spatial details of extremes such as precipitation [18], might inherently be better at representing multimodal distributions and capturing tail behavior because of their iterative refinement process. This could make them less prone to the averaging effects that often cause simpler architectures to underestimate extremes. Similarly, attention mechanisms in Transformers, if appropriately designed, could learn to focus on subtle precursors or localized features indicative of rare, high-impact events, thereby complementing specialized loss functions in a synergistic manner. Effectively tackling extreme events thus necessitates a holistic approach where the model architecture itself is capable of learning and representing the complex, often subtle, features that characterize these rare phenomena, rather than relying solely on adjustments to the loss function or data handling.
Extreme Value Theory (EVT) Integration: Combining ML with EVT provides a statistical framework for modeling the tails of distributions. For instance, evtGAN [114] combines GANs with EVT to model spatial dependencies in temperature and precipitation extremes [114]. Models using Generalized Pareto Distribution (GPD) for tails can incorporate covariates from climate models to improve estimates [98].

The accurate representation of extreme events remains an active and critical research area. The limitations of standard ML approaches in this regard highlight the necessity for domain-specific adaptations and the integration of statistical theories of extremes to ensure that downscaled projections are useful for risk assessment and adaptation planning.

9.4. Uncertainty Quantification (UQ)

Climate projections are inherently uncertain, arising from multiple sources including future emission scenarios, GCM structural differences, internal climate variability, and the downscaling process itself. Quantifying these uncertainties is essential for robust decision-making. However, explicit modeling of predictive uncertainty beyond a single deterministic output remains relatively uncommon in the surveyed literature. While we did not quantify them, uncertainty-aware methods clearly represent a minority of existing approaches, underscoring the need for broader adoption and careful calibration [55].

Sources of Uncertainty

Aleatoric Uncertainty: Represents inherent randomness or noise in the data and the process being modeled (e.g., unpredictable small-scale atmospheric fluctuations).
Epistemic Uncertainty: Arises from limitations in model knowledge, including model structure, parameter choices, and limited training data. This uncertainty is, in principle, reducible with more data or better models.
Scenario Uncertainty: Uncertainty in future greenhouse gas emissions and other anthropogenic forcings.
GCM Uncertainty: Structural differences among GCMs lead to a spread in projections even for the same scenario.
Downscaling Model Uncertainty: The statistical downscaling model itself introduces uncertainty.

UQ Approaches in ML Downscaling

Ensemble Methods:
−
Deep Ensembles: Training multiple instances of the same DL model with different random initializations (and potentially variations in training data via bootstrap sampling) and then combining their predictions to estimate both the mean and the spread (uncertainty) [69,136]. DeepESD [10] is an example of a CNN ensemble framework that quantifies inter-model spread from multiple GCM inputs and internal model variability. Deep ensembles can improve UQ, especially for future periods, by providing confidence intervals [69]. The optimal number of models in an ensemble for improving mean and UQ is often found to be around 3–6 models [69].
−
Multi-Model Ensembles (MMEs): Applying a downscaling model to outputs from multiple GCMs to capture inter-GCM uncertainty.
Bayesian Neural Networks (BNNs): These models learn a probability distribution over their weights, rather than point estimates. By sampling from this posterior distribution, BNNs can provide probabilistic predictions that inherently quantify both aleatoric and epistemic uncertainty [137]. Techniques like Monte Carlo dropout are often used as a practical approximation to Bayesian inference in deep networks [137]. Bayesian AIG-Transformer and Precipitation CNN (PCNN) are examples of models incorporating these techniques for downscaling wind and precipitation [136,138].
Strengths: Provide a principled way to decompose uncertainty into aleatoric and epistemic components.
Weaknesses: Can be computationally more expensive to train and sample from compared to deterministic models or simple ensembles.
Generative Models for Probabilistic Output: GANs and Diffusion Models can, in principle, learn the conditional probability distribution $P (Y_{H R} | X_{L R})$ and generate multiple plausible high-resolution realizations for a given low-resolution input, thus providing a form of ensemble for UQ. Diffusion models, in particular, are noted for their ability to model complex distributions effectively [1].
Quantile Regression: As mentioned for extremes, models that predict quantiles of the distribution (e.g., Quantile Regression Neural Networks [36]) directly provide information about the range of possible outcomes.

Challenges in UQ

Computational Cost: Probabilistic methods like BNNs and large ensembles can be computationally intensive.
Validation of Uncertainty: Validating the reliability of uncertainty estimates, especially for future projections where ground truth is unavailable, is a significant challenge. Pseudo-reality experiments are often used for this [69].
Communication of Uncertainty: Effectively communicating complex, multi-faceted uncertainty information to end-users and policymakers is crucial but non-trivial.

Developing and implementing robust UQ methods is essential for building trust in ML-based downscaling and for providing actionable climate information that reflects the inherent uncertainties in climate projections.

9.5. Reproducibility, Data Handling, and Methodological Rigor

Beyond the core challenges of model performance and physical realism, several practical and methodological issues affect the reliability and advancement of the field.

Reproducibility: Ensuring that research findings can be independently verified is a cornerstone of scientific progress. In ML-based downscaling, this involves the following:
−
Public Code and Data: Sharing model code, training data (or clear pointers to standard datasets), and pre-trained model weights [8].
−
Containerization and Deterministic Environments: Using tools like Docker to create reproducible software environments and ensuring deterministic operations in model training and inference where possible [139].
−
Well-Defined Train/Test Splits and Evaluation Protocols: Clearly documenting how data are split for training, validation, and testing, and using standardized evaluation protocols (like VALUE [107]) to facilitate fair comparisons across studies.
Baselines. The seven-method study by Vandal et al. [1] justifies using strong linear/bias-correction baselines (BCSD, Elastic-Net, hybrid BC + ML) alongside modern DL.
Spectral/structure metrics. Following Harris et al. [90] and Annau et al. [88], include power spectra/structure functions, fraction skill scores, and spatial-coherence diagnostics to detect texture hallucinations and scale mismatch.
Uncertainty metrics. For probabilistic models (GAN/VAEs/diffusion), report CRPS, reliability diagrams/PIT, and quantile/interval coverage (as in [36,90]).
Tail-aware metrics. Report quantile-oriented scores (e.g., QVSS), return-level/return-period consistency, and extreme-event FSS where relevant (cf. [115]).
Explicitly include warming/OOD tests (e.g., pseudo-global-warming or future-slice validation). Rampal et al. [19] show intensity-aware losses and residual two-stage designs can improve robustness for extremes under warming.
−
Active Frontiers: As noted in recent papers (e.g., Quesada-Chacón et al. [8]), while reproducibility advances are being made through such efforts, consistent adoption of best practices across the community is still needed to ensure the robustness and verifiability of research findings.
Data Handling Issues:
−
Collinearity: High correlation among predictors (e.g., physically related fields such as temperature, humidity, pressure, and winds) can inflate coefficient variance, destabilize feature attribution, and make sensitivity analyses misleading. Practical diagnostics include (i) pairwise correlation heatmaps, (ii) variance inflation factors (VIF), and (iii) condition indices; however, VIF thresholds should be interpreted in context rather than treated as universal cutoffs [140,141]. Mitigations include predictor grouping (ablate correlated predictors as a set), regularization, dimensionality reduction (e.g., PCA/PLS), and domain-informed variable selection [128,140].
−
Feature Ablation and Importance Checks: To ensure predictors are genuinely contributing (and not acting as proxies for correlated fields), use structured ablations (drop-one/drop-group), permutation importance, and stability checks across regimes (e.g., wet vs. dry seasons, coastal vs. inland subregions). Report performance deltas (e.g., $Δ$ RMSE, $Δ$ FSS, $Δ$ tailMAE) rather than only “top features”, and interpret importance jointly with collinearity diagnostics [128].
−
Sensitivity to Random Initialization (Seed Variance): Neural downscalers can exhibit non-trivial run-to-run variability because of random initialization, data shuffling, and nondeterministic GPU kernels. To avoid over-interpreting single-run results, report the distribution of scores across multiple seeds (e.g., mean ± std over S runs) and, when feasible, select models based on expected validation performance rather than “best seed luck” [142,143]. When probabilistic inference is needed, approaches such as MC-dropout can reflect parameter uncertainty at test time [137].
−
Suppressor Variables: Suppressors are predictors that may appear marginally weak (or even negatively correlated) yet improve performance when combined with other predictors by removing irrelevant variance or resolving confounding. A practical diagnostic is a sign-flip check: compare marginal association (univariate) versus partial association (multivariate) and flag predictors whose effect changes sign or whose coefficient magnitude increases markedly after controlling for correlated variables. Use hierarchical modeling/ablation to verify that improvements persist out-of-sample [144].

Actionable diagnostic template (recommended reporting standard). To shift these issues from general commentary to workflow guidance, we recommend that ML downscaling studies explicitly report a minimal set of diagnostics (Table 3): (i) ablation outcomes (drop-one and drop-group with clear

Δ

skill), (ii) seed sensitivity (score distributions across S random seeds), and (iii) collinearity/suppressor screening (correlation/VIF plus sign-flip or partial-association checks). This makes feature conclusions auditable, improves reproducibility, and helps distinguish genuinely informative predictors from correlated proxies.

Methodological Rigor in Evaluation:
−
Beyond Standard Metrics: While RMSE is a common metric, it may not capture all relevant aspects of downscaling performance, especially for spatial patterns, temporal coherence, or extreme events. A broader suite of metrics is needed, including the following:
*
Spatial correlation, structural similarity index (SSIM) [46].
*
Metrics for extremes (e.g., precision, recall, critical success index for precipitation thresholds; metrics from Extreme Value Theory like GPD parameters or return levels) [8].
*
Metrics for distributional similarity (e.g., Earth Mover’s Distance, Kullback–Leibler divergence) [145].
*
Metrics for temporal coherence and spatial consistency (e.g., spectral analysis, variogram analysis, or specific metrics like Restoration Rate and Consistency Degree from TemDeep [6]). The Modified Kling–Gupta Efficiency (KGE) decomposes performance into correlation, bias, and variability [131,146].
−
Out-of-Sample Validation: A key pitfall in ML-based downscaling is that standard random k-fold cross-validation is generally invalid for spatio-temporal climate data, because spatial and temporal autocorrelation causes information leakage between train and test folds. This leakage can substantially inflate apparent skill and produce overly optimistic uncertainty estimates. Therefore, we recommend the following minimum standard: (i) never use purely random folds for gridded fields or station networks with spatial dependence; (ii) use blocked (spatial and/or temporal) splits that reflect the intended deployment setting; and (iii) explicitly report the split design (block definition, block size, any buffers, number of folds, and whether the test fold is out-of-time and/or out-of-region). These practices are well supported in the broader spatio-temporal modeling literature [147,148]. Practical validation options include the following:
*
Spatial blocked k-fold CV: Partition the domain into contiguous spatial blocks and hold out entire blocks for testing. Blocks should be larger than the effective spatial dependence scale of the target variable/covariates to reduce leakage. This is the default choice for gridded downscaling evaluation when the deployment involves new locations within the same climate regime [147,148].
*
Leave-Location-Out (LLO) CV: Hold out entire stations, catchments, or geographically distinct subregions. This provides a stringent test of spatial generalization to unseen locations and is especially appropriate for station-based or sparse-site downscaling [22,147].
*
Buffered spatial CV: Add a buffer zone around the held-out test region and exclude training samples within that buffer to further minimize leakage because of spatial proximity. As a practical guideline, the buffer distance should be on the order of the estimated decorrelation length (or larger) for the relevant fields [147,149].
*
Temporal blocked CV/forward-chaining: Hold out contiguous time periods (e.g., the most recent years) rather than random days/months. This is essential when models will be applied across time or under evolving climate conditions; it tests robustness to temporal regime shifts and reduces leakage from serial dependence [147].

Selecting a validation scheme: decision criteria. Because the appropriate CV design depends on the intended deployment, we recommend making the selection explicit using the following rules of thumb (summarized in Table 4): (i) use LLO when the scientific question is generalization to new sites/locations (especially for station-based or sparse-site downscaling); (ii) use spatial blocked CV when the target is a continuous gridded field and you need independence from nearby training pixels; (iii) add a buffer when leakage risk is high because of strong spatial dependence; and (iv) when making future-projection claims, include an out-of-time “warm test” and, when feasible, a scenario-based stress test such as PGW [150]. These recommendations follow established guidance for structured (spatio-temporal) CV [147,148].

Buffer distance (practical guidance). When buffered CV is used, the buffer distance should be tied to the data’s dependence scale rather than a universal number: estimate an effective decorrelation length (e.g., using an empirical correlogram/variogram for the target variable and key predictors), then set

d_{buffer} ≳ L_{decorr}

(and preferably based on the largest relevant dependence scale among predictors and the target). If

L_{decorr}

is not estimated, treat buffers as a sensitivity parameter and report results for multiple plausible values (e.g., small/medium/large buffers) [147,148].

Challenges in Benchmarking and Inter-Comparison

While initiatives like the VALUE framework [107] and the CORDEX ML Task Force [62] aim to foster standardized evaluation, robust benchmarking and inter-comparison in ML-based climate downscaling remain fraught with challenges [37]. A primary hurdle is the lack of universally accepted, standardized benchmark datasets that comprehensively cover diverse climate regimes, a wide array of variables, and various downscaling tasks, particularly for complex phenomena like extreme events or compound events [151]. This scarcity makes it difficult to perform equitable comparisons of model performance across different studies. Compounding this issue is the wide variability in evaluation metrics employed across the literature. While some metrics are common (e.g., RMSE), the lack of a consensus on a comprehensive suite of metrics that capture different aspects of performance (e.g., spatial structure, temporal coherence, extreme value statistics, physical consistency) hinders direct and meaningful comparisons of different ML architectures and approaches. Furthermore, it is often difficult to attribute performance differences solely to the ML architecture versus other crucial choices in the downscaling pipeline, such as predictor selection, the intricacies of data preprocessing, bias correction techniques, or specific hyperparameter tuning strategies. The performance of an ML model is an emergent property of this entire chain, making isolated architectural comparisons challenging without strict experimental controls. Finally, the computational burden associated with running comprehensive benchmarks across multiple models, various datasets, and for extended simulation periods can be substantial. The training and evaluating of numerous complex deep learning models, especially generative ones or foundation models, require significant computational resources, which may not be available to all research groups, potentially limiting participation in large-scale inter-comparison efforts. These challenges underscore the continued need for community-driven efforts towards developing accessible, comprehensive benchmark datasets and standardized, multi-faceted evaluation protocols to foster more transparent and rigorous assessment of ML downscaling methods, as called for in Section 10.7.

Table 4. Decision guide for selecting validation strategies in ML-based climate downscaling.

Downscaling Configuration/Claim	Main Risk	Recommended Validation Design
Station-based or sparse-site PP downscaling (generalize to unseen sites)	Spatial leakage across nearby stations	Leave-Location-Out (LLO): hold out entire stations/regions; report site-wise skill distributions and worst-case sites [147].
Gridded-field downscaling within same climate regime (new areas inside domain)	Nearby pixels leak into test folds	Spatial blocked k-fold: hold out contiguous blocks; prefer fewer, larger blocks. If dependence is strong, add buffered CV [148].
Need strict spatial independence (high autocorrelation; sharp gradients like coast/orography)	Inflated skill from proximity	Buffered spatial CV: exclude training data within $d_{buffer}$ of test blocks; choose $d_{buffer}$ from estimated decorrelation length(s), and report sensitivity to buffer choice [147].
Historical → future projection (non-stationarity claim)	Train/test mismatch over time; concept drift	Temporal blocking/forward-chaining: train on earlier period(s), test on later period(s). For projection claims, include a warm test as a minimum requirement, and when feasible include scenario-stress tests such as PGW [147,150].
Cross-GCM deployment (apply to a different driving GCM)	Cross-model distribution shift	Leave-One-GCM-Out (LOGO): train on multiple GCMs, hold out one GCM for testing; report cross-GCM variance and failure cases (recommendation aligns with domain-shift evaluation principles) [147].
Future + cross-GCM (strongest generalization claim)	Compound OOD shift	Combine LOGO with out-of-time splits (and PGW/scenario stress tests where feasible): “unseen GCM” and “unseen future period” simultaneously; treat this as the most conservative validation design.

Addressing these methodological aspects is vital for building a robust and reliable evidence base for the utility of ML in climate downscaling. The collective impact of these challenges—transferability, physical consistency, interpretability, extreme event representation, UQ, and methodological rigor—points to a field that, while having made enormous strides in leveraging ML for complex pattern recognition, still requires significant development to mature into a fully trusted operational tool for climate change impact assessment. The “performance paradox” noted in this work—excellent in-sample results but often poor extrapolation—is a direct consequence of these intertwined challenges.

10. Future Trajectories: Grand Challenges and Open Questions

To move beyond the “performance paradox” and resolve the “trust deficit”, the ML downscaling community must shift its focus from incremental improvements on in-distribution benchmarks to tackling fundamental scientific and technical challenges. Drawing inspiration from community-wide initiatives like the WCRP Grand Challenges [152], we identify four interconnected grand challenges that will define the next decade of research. Addressing these is critical for developing ML downscaling into a robust, trustworthy tool for climate science and adaptation planning.

Conceptual boundaries and interdependencies. Although the grand challenges below are highly interdependent, they are not synonymous; each targets a distinct failure mode in the end-to-end downscaling problem. (i) Non-stationarity refers to the temporal drift in the joint distribution

p (x, y)

induced by climate change and evolving observing/modeling systems, which can invalidate relationships learned from historical data when extrapolating to future periods; (ii) Transferability (generalization) refers to robustness across domain axes such as driving GCMs, scenarios, reanalysis vs. GCM inputs, and geographic regions. Transfer failures can occur even without strong temporal drift (e.g., cross-GCM generalization) and therefore should be treated as a broader umbrella than time-only non-stationarity; (iii) Causal/mechanism-aware modeling is a strategy (not a separate deployment axis) that aims to learn stable, physically meaningful relationships and invariants. Mechanism-aware models are expected to improve both non-stationarity robustness and cross-domain transfer, but they do so by changing what the model learns (structure and inductive bias), not by altering the evaluation scenario itself. Table 5 summarizes these conceptual boundaries to clarify what each challenge uniquely contributes to the research agenda.

With these distinctions, the remainder of Section 10 is organized to (a) identify which deployment axis is stressed (time, domain, region), and (b) highlight which modeling strategies (physics constraints, causal structure, domain adaptation, foundation models, UQ, and benchmarking) are likely to mitigate multiple failure modes simultaneously.

10.1. Grand Challenge 1: Overcoming Non-Stationarity

The Challenge: The core scientific challenge is developing models that can reliably generalize to future, out-of-distribution climate states. As established, purely statistical models trained on historical data often fail under the covariate and concept drift induced by climate change.

Promising Approaches:

Causal/Mechanism-Aware ML: Learning physical/causal structure rather than surface correlations—for example, physics-informed or analytically constrained neural networks that enforce governing laws and invariants [26,67].
Foundation Models: Large, pretrained backbones learned from massive, diverse earth-system data (e.g., multiple GCMs or reanalyses) that provide broad, reusable priors; usable zero-/few-shot or with fine-tuning [16].
Domain Adaptation and Transfer Learning: Methods to adapt models from a source to a target distribution like (historical → future, reanalysis → GCM, region A → B), including fine-tuning FMs or smaller models and explicit shift-handling techniques [17].
Rigorous OOD Testing: Systematically using Pseudo-Global Warming (PGW) experiments and holding out entire GCMs or future time periods for validation to stress-test and quantify extrapolation capabilities [19].

Open Research Questions:

How can we formally verify that a model has learned a causal physical process rather than a spurious shortcut?
What are the theoretical limits of generalization for a given model architecture and training data diversity?
Can online learning systems be developed to allow models to adapt continuously as the climate evolves, mitigating concept drift in near-real-time applications?

10.2. Grand Challenge 2: Achieving Verifiable Physical Consistency

The Challenge: Ensuring that ML-downscaled outputs are not just statistically plausible but rigorously adhere to the fundamental laws of physics (e.g., conservation of mass, energy, momentum). This is a prerequisite for scientific credibility and reliable coupling with downstream impact models.

Promising Approaches:

Physics-Informed Machine Learning (PIML): The systematic integration of physical constraints, either as soft constraints in the loss function [25] or as hard constraints embedded in the model architecture [27], is the most direct path.
Hybrid Dynamical-Statistical Models: Frameworks like dynamical–generative downscaling leverage a physical model to provide a consistent foundation, which an ML model then refines. This approach strategically outsources the enforcement of complex physics to a trusted dynamical core [9].

Open Research Questions:

How can we design computationally tractable physical loss terms for complex, non-differentiable processes like cloud microphysics or radiative transfer?
What is the optimal trade-off between the flexibility of soft constraints and the guarantees of hard constraints for multi-variable downscaling?
Can we develop methods to automatically discover relevant physical constraints from data, rather than relying solely on pre-defined equations?

Interdependencies: Physical consistency is not only a standalone objective; it is also an enabler for other grand challenges. Enforcing conservation and plausibility constraints can improve the credibility of uncertainty estimates by reducing unphysical degrees of freedom (and thereby narrowing epistemic uncertainty to scientifically meaningful modes). It also supports extreme-event fidelity by preventing pathological tails (e.g., implausible intensity spikes or event structures) that can arise when models optimize purely statistical losses. Finally, physical diagnostics provide an interpretable “sanity layer” that complements XAI: even when explanations are available, physics-based checks offer a direct way to validate whether the model’s attributions and predicted patterns remain physically coherent.

10.3. Grand Challenge 3: Reliable and Interpretable Uncertainty Quantification (UQ)

The Challenge: Moving beyond deterministic predictions to provide reliable, well-calibrated, and understandable estimates of uncertainty. This involves quantifying uncertainty from all sources (GCM, downscaling model, internal variability) and making the model’s decision-making process transparent.

Promising Approaches:

Probabilistic Generative Models: Diffusion models, in particular, are the state of the art for generating high-fidelity ensembles from which to derive probabilistic forecasts and quantify uncertainty [31,32].
Deep Ensembles and Bayesian Neural Networks (BNNs): These established techniques provide principled frameworks for estimating epistemic (model) uncertainty [69,137].
Explainable AI (XAI): Using domain-specific XAI techniques to ensure that model predictions and their associated uncertainties are based on physically meaningful precursors, thus building trust in the UQ estimates [65,129].

Open Research Questions:

How can we effectively validate UQ for far-future projections where no ground truth exists?
How can we decompose total uncertainty into its constituent sources in a computationally efficient manner for deep learning models?
How can we best communicate complex, multi-dimensional uncertainty information to non-expert stakeholders to support robust decision-making?

Interdependencies: Reliable UQ is tightly coupled to generalization and physical realism. Under distribution shift (e.g., cross-GCM, cross-region, or out-of-time deployment), uncertainty should increase in a calibrated way—otherwise the model risks being confidently wrong in precisely the scenarios where stakeholders need caution. Conversely, physically inconsistent predictions can corrupt uncertainty estimates: if a model violates known constraints, its probabilistic outputs may be well-calibrated statistically yet scientifically misleading. UQ also complements explainability: explanations without uncertainty can overstate confidence, while uncertainty without explanation provides limited actionable insight for diagnosing why and where a model fails.

10.4. Grand Challenge 4: Skillful Prediction of Climate Extremes

The Challenge: Accurately representing the statistics (frequency, intensity, duration) of high-impact, rare extreme events. Standard ML models trained with MSE-like losses often underestimate extremes because of data imbalance, a critical failure for risk assessment.

Promising Approaches:

Tailored Loss Functions: Employing loss functions designed for imbalanced data or tail behavior, such as Quantile Loss, Bernoulli-Gamma loss for precipitation, or Wasserstein-based penalties [30,33,36].
Generative Models: GANs and Diffusion models that learn the entire data distribution are inherently better at generating realistic extremes than models that only predict the conditional mean [29,32].
Integration with Extreme Value Theory (EVT): Hybrid models that combine ML with statistical EVT offer a principled way to model the extreme tails of climate distributions [114].

Open Research Questions:

How do we ensure that generative models produce extremes that are not only statistically realistic but also physically plausible in their spatio-temporal evolution?
How will the statistics of compound extremes (e.g., concurrent heat and drought) change, and can ML models capture their evolving joint probabilities?
Can we develop models that explicitly predict changes in the parameters of EVT distributions (e.g., GPD parameters) as a function of large-scale climate drivers?

Interdependencies: Extreme-event skill is especially sensitive to the same mechanisms that drive the other challenges. In practice, capturing tails robustly depends on (i) physical constraints that prevent implausible event structures and maintain coherent water/energy budgets, and (ii) uncertainty-aware modeling that signals when extremes are extrapolations beyond training support. Moreover, extreme events are a natural stress-test for explainability: credible XAI should link predicted extremes to physically meaningful drivers (e.g., moisture transport, convection-favorable environments) rather than spurious proxies. Finally, progress on extremes is difficult to compare without shared benchmarks; as such, community benchmarking and standardized evaluation protocols are prerequisites for claiming reliable improvements in rare-event prediction.

10.5. Benchmarkable Objectives for Measuring Progress

As a “Grand Challenges” section, it is helpful to define testable benchmarks that make progress operational and comparable across studies. We therefore propose a minimal evaluation framework with measurable objectives aligned to the challenges in Section 10.1, Section 10.2, Section 10.3 and Section 10.4: (i) generalization stress tests (out-of-time, cross-region, cross-GCM), (ii) physical consistency checks (explicit conservation/constraint diagnostics), (iii) probabilistic verification for uncertainty-aware models (calibration + proper scoring rules), and (iv) tail-focused extreme evaluation (quantiles and return levels, not only mean error). These measurable objectives are summarized in Table 6.

Recommended benchmark protocol (minimal standard). For any method claiming progress on future projections or climate-risk applications, we recommend the following: (1) include at least one out-of-time test and one domain-holdout test (cross-region or cross-GCM); (2) report both central and tail metrics; (3) if the method is probabilistic, report calibration + proper scores; and (4) include a concise physical-consistency scorecard. This transforms broad research directions into criteria that can be objectively validated and compared.

10.6. Current State Assessment

Our comprehensive review reveals a field characterized by a “performance paradox” and a “trust deficit”. To keep the focus tight and non-redundant, we summarize the five recurring gaps succinctly below:

Performance Paradox: ML models, particularly deep learning architectures like CNNs, U-Nets, and GANs, often demonstrate excellent performance on in-sample test data or when downscaling historical reanalysis products. They excel at learning complex spatial patterns and non-linear relationships, leading to visually compelling high-resolution outputs. However, this strong in-sample performance frequently does not translate to robust extrapolation on out-of-distribution data (e.g., future climate scenarios from different GCMs or entirely new regions)—a critical limitation given that downscaling is intended to inform future projections.
Trust Deficit: The limited transparency of many deep learning models, together with historically sparse uncertainty quantification, constrains end-user confidence and practical uptake. Without clear reasoning traces and robust uncertainty estimates, the utility of ML-downscaled products for decision-making remains limited.
Physical Inconsistency: Many current ML downscaling methods do not inherently enforce fundamental physical laws (e.g., conservation of mass/energy, thermodynamic constraints). Resulting fields can be statistically plausible yet physically unrealistic, undermining scientific interpretability and downstream use.
Challenges with Extreme Events: Accurately capturing the frequency, intensity, and spatial characteristics of extremes remains difficult. Class imbalance and commonly used loss functions tend to underestimate magnitudes and misplace patterns of high-impact events; specialized targets, data curation, and evaluation for extremes are required.
Data Limitations and Methodological Gaps: The scarcity of high-quality, high-resolution reference data in many regions, together with inconsistent metrics, validation protocols, and reporting standards, impedes apples-to-apples comparison and cumulative progress. Recent work emphasizes that computational repeatability is essential for building trust and enabling rigorous comparison across methods [8].

Answers to the Research Questions

RQ1 (Evolution of Methodologies): ML downscaling has progressed from CNN/U-Net baselines to generative models—GANs and diffusion—that better represent distributions and extremes, and to transformer/foundation models that support cross-resolution/region transfer and multi-task adaptation for downscaling [15,16,17,20]. This trajectory is reflected in studies on transferability and resolution-agnostic generalization, as well as in early reproducibility baselines [15,17,18].

RQ2 (Persistent Challenges): Despite architectural progress, core gaps persist in out-of-distribution robustness/extrapolation (especially under warming), physics consistency and diagnostics, extreme fidelity, and reproducibility/benchmarking. Recent work highlights systematic OOD stress testing and extrapolation limits, and underscores the need for stronger evaluation protocols [17,18,19].

RQ3 (Emerging Solutions/Trajectories): Promising directions include standardized physics–ML interfaces and tests, principled probabilistic modeling and UQ (with diffusion as a natural vehicle), rigorous OOD protocols, and careful adaptation of foundation/transformer models to downscaling tasks (cross-grid/region transfer with validation safeguards) [15,16,17,20].

10.7. Priority Research Directions

Based on this critical assessment, several priority research directions emerge as essential for advancing the field of ML-based climate downscaling towards greater reliability, trustworthiness, and operational utility.

Robust Extrapolation and Generalization Frameworks (Addressing RQ2):
- Systematic Evaluation Protocols: Develop and adopt standardized protocols and benchmark datasets specifically designed to test model transferability across different climate states (historical, near-future, far-future), different GCMs/RCMs, and diverse geographical regions. This includes rigorous out-of-sample testing beyond simple hold-out validation.
- Metrics for Generalization: Establish and use metrics that explicitly quantify generalization and extrapolation capability, rather than relying solely on traditional skill scores computed on in-distribution test data.
- Understanding Failure Modes: Conduct systematic analyses of why and when ML models fail to extrapolate, linking failures to model architecture, training data characteristics, or violations of physical assumptions.
Physics–ML Integration and Hybrid Modeling Standards (Addressing RQ2):
- Standardized PIML Approaches: Develop and disseminate standardized methods and libraries for incorporating physical constraints (both hard and soft) into common ML architectures used for downscaling. This includes guidance on formulating physics-based loss terms and designing constraint-aware layers.
- Validation Suites for Physical Consistency: Create benchmark validation suites that explicitly test for adherence to key physical laws (e.g., conservation principles, thermodynamic consistency, realistic spatial gradients and inter-variable relationships).
- Advancing Hybrid Models: Foster research into hybrid models that effectively combine the strengths of process-based dynamical models with the efficiency and pattern-recognition capabilities of ML, including RCM emulators and generative AI approaches for refining RCM outputs Tomasi et al. [31].
Operational Uncertainty Quantification (Addressing RQ3):
- Beyond Point Estimates: Shift the focus from deterministic (single-value) predictions to probabilistic projections that provide a comprehensive assessment of uncertainty.
- Efficient UQ Methods: Develop and promote computationally efficient UQ methods suitable for high-dimensional DL models, such as scalable deep ensembles, practical Bayesian deep learning techniques (e.g., with improved variational inference or MC dropout strategies), and generative models capable of producing reliable ensembles [32,69,137].
- Decomposition and Attribution of Uncertainty: Advance methods to decompose total uncertainty into its constituent sources (e.g., GCM uncertainty, downscaling model uncertainty, internal variability) and attribute uncertainty to specific model components or assumptions.
- User-Oriented Uncertainty Communication: Develop effective tools and protocols for communicating complex uncertainty information to diverse end-users in an understandable and actionable manner.
Explainable and Interpretable Climate AI (Addressing RQ3): Why this interacts with other challenges: Explainability is not merely a usability feature; it is a diagnostic tool for transferability, physical consistency, UQ, and extremes. In particular, XAI can help detect when a model’s apparent skill is driven by spurious correlates that may fail under non-stationarity or cross-domain deployment, and it can reveal whether predicted extreme events are triggered by physically meaningful drivers. Explanations are most decision-relevant when paired with calibrated uncertainty, enabling stakeholders to interpret not only what the model relies on but also how confident it is in regimes where data support is limited.
- Domain-Specific XAI Metrics: Establish XAI metrics and methodologies that are specifically relevant to climate science, moving beyond generic XAI techniques to those that can provide physically meaningful insights.
- Linking ML Decisions to Physical Processes: Develop XAI techniques that can causally link ML model decisions and internal representations to known climate processes and drivers, rather than just highlighting input feature importance [129].
- Standards for Model Documentation and Interpretation: Promote standards for documenting ML model architectures, training procedures, and the results of interpretability analyses to enhance transparency and facilitate critical assessment by the broader scientific community [129].
Community Infrastructure and Benchmarking (Addressing all RQs): Why benchmarking must be coupled: Benchmarking is the mechanism that makes progress on the other challenges measurable and comparable. Critically, benchmark suites should jointly evaluate (i) accuracy under structured OOD tests (time/domain/region), (ii) physical-consistency diagnostics, (iii) probabilistic verification for uncertainty-aware models, and (iv) tail-/event-focused extreme metrics. Evaluating any one dimension in isolation can hide trade-offs (e.g., improved mean RMSE with degraded extremes, or sharper predictions with miscalibrated uncertainty). Therefore, shared benchmarks should report a compact “scorecard” across these linked dimensions to reflect the true operational readiness of a downscaling method.
- Shared Evaluation Frameworks: Expand and support the community-driven evaluation frameworks (e.g., extending the VALUE initiative [107]) to facilitate systematic intercomparison of ML downscaling methods using standardized datasets and metrics.
- Reproducible Benchmark Datasets: Curate and maintain open, high-quality benchmark datasets specifically designed for training and evaluating ML downscaling models across various regions, variables, and climate conditions. These should include data for testing transferability and extreme event representation.
- Open-Source Implementations: Encourage and support the development and dissemination of open-source software implementations of key ML downscaling methods and PIML components to lower the barrier to entry and promote reproducibility.
- Collaborative Platforms: Foster collaborative platforms and initiatives (e.g., CORDEX Task Forces on ML [62]) for sharing knowledge, best practices, model components, and downscaled datasets.

Addressing these research priorities requires a concerted, interdisciplinary effort involving climate scientists, ML researchers, statisticians, and computational scientists. The focus must shift from solely optimizing statistical performance metrics towards developing ML downscaling methods that are robust, physically consistent, interpretable, and uncertainty-aware, thereby building the trust necessary for their widespread adoption in climate change impact assessment and adaptation planning.

11. Ethical Consideration, Responsible Development, and Governance in ML-Based Climate Downscaling

As ML-based climate downscaling matures and increasingly informs climate services, infrastructure planning, public risk communication, and adaptation policy, ethics cannot be treated as an add-on: here, technical failure modes translate directly into societal risks. Distribution shift and non-stationarity can produce spatially structured errors that propagate into maladaptation; poor tail behavior can distort estimates of high-impact events; and uncalibrated uncertainty can create overconfidence precisely where decision-makers most need caution (Section 9.1, Section 9.3 and Section 9.4). Moreover, unequal observational coverage and institutional capacity can advantage data-rich regions while degrading performance in the most vulnerable contexts (Section 6.2), turning technical limitations into equity harms.

11.1. Ethical Foundations for ML Downscaling: From Technical Failure Modes to Normative Obligations

We organize these concerns as a risk chain: (data and capacity constraints) → model design and validation choices → downscaled products → decisions → real-world impacts. This clarifies why ethics arise specifically in ML downscaling: scientific validity is fragile under domain shift and non-stationarity; users often treat downscaled fields as authoritative inputs to high-stakes planning; and data/expertise are distributed inequitably across regions and sectors. Drawing on trustworthy-AI principles and risk-management guidance [155,156] and climate-services ethics emphasizing integrity, transparency, humility, and collaboration [157], we frame responsible practice around four coupled obligations:

Reliability under shift (do no harm through invalid extrapolation): Anticipate non-stationarity and explicitly stress-test transferability (Section 9.1).
Transparency and contestability (make limitations legible): Document data provenance, modeling assumptions, and known failure regimes; pair explanations with uncertainty to avoid false confidence (Section 9.4 and Section 9.5).
Equity and inclusion (avoid systematically disadvantaging the vulnerable): Treat uneven observations and capacity as a first-order risk, not a nuisance variable (Section 6.2).
Accountability and governance (assign responsibility across the pipeline): Adopt auditable validation and monitoring standards appropriate to climate-service deployment [156,157]. This responsibility chain, outlining duties from GCM developers to end-user communities, is illustrated in Figure 9.

Uncertainty communication is therefore a normative requirement in climate services: it should be presented in calibrated, decision-relevant ways consistent with established climate-assessment guidance [158].

The remainder of Section 11 operationalizes this framework (summarized in Table 7) across fairness, transparency/accountability, accessibility, uncertainty communication, data governance, and best-practice governance workflows, while explicitly treating technical limitations (shift, extremes, and UQ) as the proximate causes of ethical risk in climate-service settings.

Figure 9. The responsibility chain in ML-based downscaling. This framework outlines the ethical and governance duties for each stakeholder group, from the initial GCM data provision to the final community-level impact. Fulfilling these responsibilities is critical for building trust and ensuring the equitable and effective use of downscaled climate information. Key references by stakeholder group: GCM developers—CORDEX [159], Sørland et al. [160], Diez-Sierra et al. [161]; ML modelers—Gutiérrez et al. [4], Hernanz et al. [120], Harder et al. [27]; Distribution/portals—Maraun et al. [107], Boateng and Mutz [71]; End-users/policymakers—Hawkins and Sutton [162], Hawkins and Sutton [163], Maraun et al. [3]; Communities—Bhardwaj [164], Jacob et al. [165].

11.2. Algorithmic Bias, Fairness, and Equity

11.2.1. Auditable Fairness Metrics for Climate-Relevant Strata

To move from conceptual fairness to auditable practice, we recommend reporting fairness diagnostics as stratified performance across climate-relevant groups (rather than demographic groups), e.g., coastal vs. inland, orographic vs. lowland, wet vs. dry seasons, and data-rich vs. data-sparse regions. At minimum, downscaling studies should report the following: (i) group-wise MAE/RMSE and mean bias, (ii) worst-group performance, and (iii) disparity statistics (e.g., max gap across groups) [166]. For extremes, the same stratification should be applied to threshold-based scores (e.g., CSI/FSS at P95/P99) so that “fairness” is explicitly tied to high-impact outcomes (flood-risk relevant tails) rather than only mean behavior.

A significant ethical concern is the potential for biases present in training data to be learned and amplified by ML models [167]. Observational networks may have uneven spatial or temporal coverage, historical data can reflect past inequities, and GCMs themselves possess inherent biases. If not carefully addressed, ML models can perpetuate these biases, leading to downscaled projections that disproportionately affect vulnerable communities or regions [168]. This potential for algorithmic bias is not merely an abstract concern but a direct consequence of the technical issues discussed earlier. The underrepresentation of extreme events in training data (Section 9.3) and the geographical biases in high-quality observational networks (Section 6.5) can lead to models that perform poorly for the most vulnerable regions and conditions, resulting in an inequitable distribution of risk and a misallocation of adaptation resources. Fairness considerations also arise if models perform differently across diverse geographical areas or demographic groups because of data imbalances or the learning of spurious correlations that do not generalize equitably [169].

11.2.2. Why This Matters for Downscaling (Tied to Prior Sections)

Biases in training data and model design manifest most acutely when transferring across regions and regimes with sparse or heterogeneous observations. For ML downscaling, this links directly to data sparsity and pre-processing (Section 6.5), transferability and domain shift (Section 9.1), and the representation of extremes (Section 9.3). Concretely, uneven gauge/satellite coverage and quality control gaps can skew learned relationships, producing underestimation of heavy tails in underserved regions and overconfident outputs (see also UQ, Section 9.4).

11.2.3. Practitioner Checklist

Report data coverage maps and per-region sample counts used in training and validation (ties to Section 6.5); stratify metrics by data-rich vs. data-scarce subregions.
Use shift-robust training/evaluation (Section 9.1): e.g., held-out regions, time-split OOD tests, and stress tests on atypical synoptic regimes.
Track extreme-aware metrics (Section 9.3): CSI/POD at multi-thresholds, tail-MAE, quantile errors, and bias of return levels.
Quantify epistemic uncertainty (Section 9.4) and suppress overconfident deployment in regions with low training support; communicate abstentions or wide intervals as a feature, not a bug.

11.2.4. Mini-Case (Sparse Data → Biased Extremes → Policy Risk)

Under sparse or uneven observations, ML downscalers can underestimate heavy tails and generalize poorly across products and ESMs [1,17,120]. For end-of-century extreme rainfall, deterministic baselines can miss warming-driven increases, whereas GANs recover a much larger fraction [19]. Governance therefore prioritizes (a) extreme-aware metrics and region-stratified reporting (Section 9.3); (b) shift-robust validation (Section 9.1); and (c) uncertainty and abstention policies (Section 9.4) to avoid misallocation of adaptation resources.

11.3. Transparency, Accountability, and Liability

11.3.1. Minimum Transparency Artifacts (Auditable)

For operational relevance, transparency must be tied to concrete release artifacts. We recommend that ML downscaling studies adopt documentation practices analogous to Model Cards and Datasheets [170,171], adapted to the climate domain. Concretely, a release should include the following: (i) a short “model card” specifying intended use, non-intended use, training data provenance, and benchmark results under both in-domain and out-of-domain tests; (ii) a “datasheet” summarizing observational coverage, known gaps/biases, preprocessing, and licensing/usage constraints; and (iii) a changelog describing updates (data revisions, retraining, hyperparameter changes) so that downstream decisions can trace which version produced which product.

While XAI techniques (discussed in Section 9.2.2) aim to improve transparency in model workings, broader accountability structures are essential. Key questions include the following: Who bears responsibility if flawed ML-downscaled projections contribute to maladaptation, negative socioeconomic outcomes, or environmental damage? Establishing clear lines of accountability within the complex chain from GCM development through ML downscaling to end-user application is a significant challenge. This involves not only technical transparency but also governance mechanisms that define roles, responsibilities, and potential liabilities [172].

11.3.2. Make Transparency Operational

Link transparency to the evaluation protocol (Section 7) and model choice rationale (Section 8.1). Provide a minimal “model card for downscaling” that includes training domains, variable lists, pre-processing steps, physics constraints (if any), and the full metric suite from Section 7, stratified by regime and region. Use XAI only insofar as it illuminates failure modes relevant to transfer and extremes (Section 9.2.2).

11.4. Accessibility, Inclusivity, and the Digital Divide

Capacity for climate services remains uneven globally. The WMO’s State of Climate Services 2024 documents persistent capability gaps across National Meteorological and Hydrological Services (NMHSs), with only about one third delivering climate services at an “essential” level and roughly another third at “advanced” or “full” levels, highlighting regional disparities in access and delivery [173].

The substantial computational costs (see Section 4, Section 5.4.1 and Section 11.2), extensive data requirements, and specialized expertise needed can limit participation from researchers, institutions, and communities in less-resourced regions, particularly in the Global South [164]. This can lead to a concentration of development capacity in wealthier nations and institutions, potentially tailoring solutions to their specific contexts and data availability, rather than addressing global needs equitably [164]. Promoting equitable access to tools, open datasets, computational resources, and capacity-building initiatives is crucial for ensuring global participation and benefit from these technologies [165].

11.4.1. Tie to Data-Scarce Regions and Model Bias

The “Digital divide” here is not generic: it maps to concrete risks identified earlier—training set sparsity (Section 6.5) and domain shift (Section 9.1). Where observational density is low, models tend to underestimate intensity and misplace extremes (Section 9.3).

11.4.2. Actionable Steps

Publish downscaling baselines and trained weights under permissive licenses; provide lightweight inference paths for low-resource agencies.
Release region-stratified diagnostics and data coverage artifacts so local stakeholders can judge fitness-for-use.
Prioritize augmentation/sampling schemes that upweight underrepresented regimes and seasons, with ablation evidence (links to Section 7). Augmentation, of course, must respect the invariant transformations of the augmented data in climate science.

11.5. Misinterpretation, Misuse, and Communication of Uncertainty

11.5.1. Quantitative Uncertainty Requirements (Calibration and Skill)

To reduce misuse, uncertainty quantification (UQ) should be reported with numbers that can be audited, not only qualitative narratives. At minimum, probabilistic downscaling should report the following: (i) calibration/coverage (e.g., empirical coverage of nominal 50%/90% intervals), (ii) a proper scoring rule such as CRPS [87], and (iii) tail-focused reliability (e.g., calibration conditioned on exceedances). For public-facing climate services, uncertainty communication should follow calibrated language and traceable assumptions consistent with IPCC-style guidance [174].

11.5.2. Communicating Uncertainty for Decisions

Given the low signal-to-noise ratio of regional precipitation projections over decadal horizons [163], we require calibrated predictive distributions and abstention rules (Section 9.4). Use deep ensembles or calibrated likelihoods to report coverage and reliability stratified by regime/region [69,89], and explicitly flag data-sparse areas where epistemic uncertainty is high (Section 6.5 and Section 9.1).

There is a considerable risk that high-resolution ML-downscaled outputs might be perceived by non-expert users as overly precise or certain, especially if Uncertainty Quantification (UQ, see Section 9.4) is inadequate or if its communication is poor. This can lead to the misinterpretation of projections and their potential misuse in decision-making processes where inherent uncertainties are not fully appreciated or integrated. Effective communication strategies for conveying the nuances of uncertainty, model limitations, and appropriate use cases to diverse stakeholders are vital to prevent maladaptation and ensure responsible application. The spread of misinformation or disinformation based on misconstrued climate data also poses a threat [175].

11.6. Data Governance, Privacy, and Ownership

The vast datasets used for training foundation models and other large-scale ML applications in climate downscaling necessitate robust data governance frameworks. While much climate data are open, questions around the ownership, stewardship, and accessibility of derived data products, model outputs, and the underlying data commons are emerging. Ensuring data quality, integrity, interoperability, and security is critical [169]. While direct privacy concerns might seem less prominent than in fields like healthcare, they could arise in highly localized human impact studies or when integrating socio-economic data [172]. Ethical data handling, clear licensing, and transparent data management practices are essential.

Downscaling-Specific Governance

Document data licenses and redistribution constraints alongside the evaluation artifacts (Section 7). Where privacy or licensing prevents data release, release surrogate evaluation kits: masked test loaders, synthetic but structurally matched benchmarks, or server-side evaluation that still produces the full metric protocol including extremes and UQ summaries.

11.7. The Need for Governance Frameworks and Best Practices

11.7.1. An Auditable Governance Workflow for ML Downscaling

We recommend treating operational ML downscaling as a socio-technical system governed across the full lifecycle, aligning with established AI risk frameworks and management-system standards [156,176,177]. In climate-service settings (often coordinated via National Frameworks for Climate Services), governance must define explicit roles for data stewardship, model development, independent evaluation, and decision support [178,179].

11.7.2. Suggested Roles (Minimum)

Data Steward: Documents observational provenance/coverage and approves preprocessing and updates.
Model Developer: Trains the model and produces reproducible artifacts (code, configs, seeds, environment).
Independent Evaluator/Auditor: Runs prespecified benchmarks (including OOD tests) and signs off on release.
Service Operator: Deploys, monitors drift, and manages incident response and user communication.
Decision Stakeholders: Define acceptable risk thresholds and fitness-for-purpose criteria.

These roles and their associated auditable minimum standards are detailed in Table 8.

We note that public–private delivery chains for weather/climate services increasingly emphasize explicit codes of conduct and accountability; governance for ML downscaling should be consistent with such principles [180].

Addressing the multifaceted ethical challenges outlined above calls for the proactive development and adoption of community-driven ethical guidelines, codes of conduct, and overarching governance frameworks specifically tailored to the responsible development and application of ML in climate downscaling and broader climate services [172]. Such frameworks should promote fairness, accountability, transparency, and sustainability throughout the AI lifecycle. This includes establishing clear protocols for model validation, bias detection and mitigation, uncertainty communication, stakeholder engagement, and assessing the full lifecycle impacts of AI systems, including their environmental footprint [181].

11.7.3. A Minimal, Testable Governance Bundle (Linked to Prior Sections)

In line with the risks evidenced by transfer/extrapolation limits [7,17,120] and extremes [1,19], we recommend the following: (i) Model card for downscaling: training domains, coverage maps, variables, pre-processing, physics constraints, and exact metric suite (Section 7). (ii) Shift-robust validation: held-out regions, product/variable shifts, and future-like pseudo-reality tests (Section 6.5 and Section 9.1). (iii) Extreme-first reporting: CSI/POD at multiple thresholds, tail-MAE/quantile errors, and bias of return levels (Section 9.3). (iv) Uncertainty and abstention: deep ensembles or calibrated distributions, region-/regime-stratified reliability and explicit abstention where training support is low (Section 9.4). (v) Open diagnostics: release region-stratified metrics and coverage artifacts (or privacy-preserving surrogate kits) to enable local fitness-for-use (Section 7).

12. Future Outlook: The Next Decade of ML in Climate Downscaling

The field of machine learning for climate downscaling is poised for continued rapid evolution in the coming decade. Building upon the current momentum and addressing the identified research priorities will likely lead to several emerging paradigms and necessitate critical success factors for sustained progress.

12.1. Emerging Paradigms

12.1.1. Foundation Models for Climate Downscaling

Inspired by the success of large pre-trained foundation models in natural language processing (NLP) and computer vision, a similar paradigm is emerging in climate science [61]. These models, such as Prithvi WxC [16], FourCastNet [23], and ORBIT-2 [60], are trained on massive, diverse climate datasets (e.g., decades of reanalysis data like MERRA-2 or ERA5).

Potential Benefits:
- Enhanced Transfer Learning: These models could provide powerful, pre-trained representations of atmospheric and Earth system dynamics, enabling effective transfer learning to specific downscaling tasks across various regions, variables, and GCMs with significantly reduced data requirements for fine-tuning [61].
- Multi-Task Capabilities: Foundation models can be designed for multiple downstream tasks, including forecasting, downscaling, and parameterization learning, offering a versatile tool for climate modeling.
- Implicit Physical Knowledge: Through pre-training on vast datasets governed by physical laws, these models might implicitly learn and encode some degree of physical consistency, although explicit PIML techniques will likely still be necessary to guarantee it.
Challenges: The developing and training of these massive models require substantial computational resources and curated large-scale datasets. Ensuring their generalizability and avoiding the propagation of biases learned during pre-training are also critical research areas.

12.1.2. Hybrid Hierarchical and Multi-Scale Approaches

Future downscaling systems are likely to involve more sophisticated hierarchical and multi-scale modeling chains, combining the strengths of different approaches. This could involve the following:

Global ML models or foundation models providing coarse, bias-corrected boundary conditions.
Regional physics-informed ML models or RCM emulators operating at intermediate scales, incorporating more detailed regional physics.
Local stochastic generators or specialized ML models (e.g., for extreme events or specific microclimates) providing the final layer of high-resolution detail and variability.

This approach acknowledges that different processes dominate at different scales, and a single monolithic model may not be optimal for all aspects of downscaling.

12.1.3. Online Learning and Continuous Model Adaptation

Current ML downscaling models are typically trained offline on a fixed dataset. Future systems may incorporate online learning capabilities, allowing them to continuously update and adapt as new observational data become available or as the climate itself evolves.

Benefits: This could help mitigate the stationarity assumption by allowing models to learn changing relationships over time and improve their performance for ongoing or near-real-time downscaling applications.
Challenges: Ensuring model stability, avoiding catastrophic forgetting (where learning new data degrades performance on old data), and managing the computational demands of continuous retraining are significant hurdles.

12.1.4. Deep Integration of Causal Inference and Process Understanding

There will likely be a stronger push towards ML models that not only predict accurately but also provide insights into the causal mechanisms driving local climate phenomena. This involves developing techniques that can infer causal relationships from data and building models whose internal structures reflect known physical processes, moving beyond correlative relationships. This aligns with the need for more robust generalization and interpretability.

12.2. Critical Success Factors

Realizing the full potential of these emerging paradigms and advancing the field of ML-based climate downscaling will depend on several critical success factors:

Interdisciplinary Collaboration: Sustained and deep collaboration between climate scientists, ML researchers, statisticians, computational scientists, and domain experts from impact sectors is essential. Climate scientists bring crucial domain knowledge about physical processes and data characteristics, while ML experts provide algorithmic innovation.
Open Science Practices: The continued adoption of open science principles—including the sharing of code, datasets, model weights, and standardized evaluation frameworks—is vital for reproducibility, transparency, and accelerating collective progress [8]. Initiatives like CORDEX and CMIP6, which foster data sharing and model intercomparison, provide valuable models for the ML downscaling community [182,183].
Deep Stakeholder Engagement and Co-production Throughout the Lifecycle: While listed as a critical success factor, the principle of stakeholder engagement and co-design deserves elevated emphasis, framed not merely as a desirable component but as an essential element integrated throughout the entire research, development, and deployment lifecycle of ML-based downscaling tools and climate services. Moving beyond consultation, true co-production involves iterative, sustained processes of relationship building, shared understanding, and joint output development with end-users and affected communities.
Actively involving end-users from diverse sectors (e.g., agriculture, water resource management, urban planning, public health, indigenous communities) from the very outset of ML downscaling projects offers profound benefits [165]:
- Ensuring Relevance and Actionability: Co-production helps ensure that ML downscaling efforts are targeted towards producing genuinely useful, context-specific, and actionable information that meets the actual needs of decision-makers rather than being solely technology-driven.
- Defining User-Relevant Evaluation Metrics: Collaboration with users can help define evaluation metrics and performance targets that reflect their specific decision contexts and thresholds of concern, moving beyond purely statistical measures to those that indicate practical utility.
- Building Trust and Facilitating Uptake: A transparent, demand-driven, and participatory development process fosters trust in the ML models and their outputs. When users are part of the creation process, they gain a better understanding of the model’s capabilities and limitations, which facilitates the responsible uptake and integration of ML-derived products into their decision-making frameworks.
- Addressing the “Trust Deficit”: By fostering a collaborative environment, co-production directly addresses the “trust deficit”. It allows for a two-way dialogue where the complexities, uncertainties, and assumptions inherent in ML downscaling are openly discussed and understood by both developers and users, leading to more realistic expectations and appropriate applications.
- Incorporating Local and Indigenous Knowledge: Participatory approaches can facilitate the integration of valuable local and indigenous knowledge systems with scientific data, leading to more holistic and effective adaptation strategies [184].

This deep engagement transforms the development of ML downscaling from a purely technical exercise into a collaborative endeavor aimed at producing societal value and supporting equitable climate resilience [165].

The next decade promises further exciting advancements in ML-based climate downscaling. By focusing on overcoming current limitations related to physical consistency, transferability, and uncertainty, and by embracing collaborative and open research practices, the field can move towards providing increasingly reliable and actionable high-resolution climate information to support societal adaptation to a changing climate.

13. Conclusions

This review synthesizes the rapid evolution of ML in climate downscaling, leading to the following key conclusions:

Methodological Maturation: The field has progressed from simple super-resolution CNNs to sophisticated generative architectures (GANs, Diffusion) that effectively resolve the texture-smoothing problems of earlier deterministic models.
The Physics Gap: Physical consistency remains the primary challenge. Purely data-driven models frequently violate conservation laws, necessitating the adoption of Physics-Informed Machine Learning (PIML) frameworks.
Generalization Risk: Non-stationarity under climate change poses a fundamental risk to statistical validity. Robust stress-testing using “perfect model” frameworks is essential to quantify extrapolation errors.
Operational Requirements: Widespread adoption requires moving beyond RMSE. Operational confidence demands robust Uncertainty Quantification (UQ) and Explainable AI (XAI) to transparently communicate limitations to decision-makers.

13.1. Quantitative Synthesis and Implications for Practice

Across recent multi-model evaluations, generalization failures under distribution shift are not merely qualitative: several studies report large degradations when models are transferred across GCMs or applied under out-of-distribution future scenarios, including substantially increased bias in projected change and near-doubling of error in some settings [18,120]. At the same time, the literature indicates that explicitly constraining models with physical structure remains comparatively rare, despite early demonstrations that hard constraints can improve physical consistency (e.g., conservation-related behavior) [27]. These quantitative patterns reinforce a central conclusion of this review: robust adoption requires (i) explicit OOD benchmarking, (ii) tail-aware evaluation beyond mean metrics, and (iii) uncertainty reporting that is calibrated and decision-relevant.

13.2. Roadmap: Minimum Validation Standards for Climate Services

To translate research priorities into implementable practice, we propose the following minimum diagnostic protocol (Table 9) for any ML downscaling product intended for climate-service or decision-support use:

14. Future Scope and Outstanding Challenges

The future trajectory of ML downscaling lies in the transition from bespoke regional models to global Foundation Models. The emerging generation of large-scale, pre-trained transformers (e.g., for weather forecasting) offers a path toward “zero-shot” downscaling capabilities. However, significant challenges remain in ensuring these “black boxes” are robust to the covariate shifts induced by climate change. Future research must prioritize the development of causal learning frameworks that disentangle correlation from causation, and the standardization of benchmarking protocols to facilitate transparent comparison. Ultimately, the integration of these ML components into operational Digital Twins represents the final frontier in closing the gap between global climate science and local adaptation action.

Funding

This research was funded by National Science Foundation grant numbers 2331908 and 2417849. The APC was funded by the National Science Foundation.

Data Availability Statement

No new data were created or analyzed in this study.

Acknowledgments

During the preparation of this manuscript, the authors used generative AI tools (e.g., Gemini 2.5 and Undermind AI) to identify and search the relevant literature, suggest the structure of the manuscript, refine LaTeX formatting for tables and figures, and assist with paraphrasing and English-language editing. All claims derived from AI outputs were independently verified against the original sources. The authors reviewed and edited the AI-assisted content and assume full responsibility for the integrity and accuracy of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Vandal, T.; Kodra, E.; Ganguly, A.R. Intercomparison of machine learning methods for statistical downscaling: The case of daily and extreme precipitation. Theor. Appl. Climatol. 2019, 137, 557–576. [Google Scholar] [CrossRef]
Rampal, N.; Hobeichi, S.; Gibson, P.B.; Baño-Medina, J.; Abramowitz, G.; Beucler, T.; González-Abad, J.; Chapman, W.; Harder, P.; Gutiérrez, J.M. Enhancing regional climate downscaling through advances in machine learning. Artif. Intell. Earth Syst. 2024, 3, 230066. [Google Scholar] [CrossRef]
Maraun, D.; Wetterhall, F.; Ireson, A.M.; Chandler, R.E.; Kendon, E.J.; Widmann, M.; Brienen, S.; Rust, H.W.; Sauter, T.; Themeßl, M.; et al. Precipitation downscaling under climate change: Recent developments to bridge the gap between dynamical models and the end user. Rev. Geophys. 2010, 48, RG3003. [Google Scholar] [CrossRef]
Gutiérrez, J.M.; Maraun, D.; Widmann, M.; Huth, R.; Hertig, E.; Benestad, R.; Roessler, O.; Wibig, J.; Wilcke, R.; Kotlarski, S.; et al. An intercomparison of a large ensemble of statistical downscaling methods over Europe: Results from the VALUE perfect predictor cross-validation experiment. Int. J. Climatol. 2019, 39, 3750–3785. [Google Scholar] [CrossRef]
Reichstein, M.; Camps-Valls, G.; Stevens, B.; Jung, M.; Denzler, J.; Carvalhais, N.; Prabhat, F. Deep learning and process understanding for data-driven Earth system science. Nature 2019, 566, 195–204. [Google Scholar] [CrossRef]
Wang, L.; Li, Q.; Lv, Q.; Peng, X.; You, W. TemDeep: A self-supervised framework for temporal downscaling of atmospheric fields at arbitrary time resolutions. Geosci. Model Dev. 2025, 18, 2427–2442. [Google Scholar] [CrossRef]
Lanzante, J.R.; Dixon, K.W.; Nath, M.J.; Whitlock, C.E.; Adams-Smith, D. Some Pitfalls in Statistical Downscaling of Future Climate. Bull. Am. Meteorol. Soc. 2018, 99, 791–803. [Google Scholar] [CrossRef]
Quesada-Chacón, D.; Stöger, J.; Güntner, A.; Bernhofer, C. Repeatable high-resolution statistical downscaling through deep learning. Geosci. Model Dev. 2022, 15, 7353–7370. [Google Scholar] [CrossRef]
Lopez-Gomez, I.; Wan, Z.Y.; Zepeda-Núñez, L.; Schneider, T.; Anderson, J.; Sha, F. Dynamical-generative downscaling of climate model ensembles. Proc. Natl. Acad. Sci. USA 2025, 122, e2420288122. [Google Scholar] [CrossRef]
Baño Medina, J.; Manzanas, R.; Cimadevilla, E.; Fernández, J.; González-Abad, J.; Cofiño, A.S.; Gutiérrez, J.M. Downscaling multi-model climate projection ensembles with deep learning (DeepESD): Contribution to CORDEX EUR-44. Geosci. Model Dev. 2022, 15, 6747–6758. [Google Scholar] [CrossRef]
Baño-Medina, J.; Manzanas, R.; Gutiérrez, J.M. Configuration and intercomparison of deep learning neural models for statistical downscaling. Geosci. Model Dev. 2019, 13, 2109–2124. [Google Scholar] [CrossRef]
Pall, P.; Allen, M.; Stone, D.A. Testing the Clausius–Clapeyron constraint on changes in extreme precipitation under CO₂ warming. Clim. Dyn. 2007, 28, 351–363. [Google Scholar] [CrossRef]
Vandal, T.; Kodra, E.; Gosh, S.; Ganguly, A.R. DeepSD: Generating High-Resolution Climate Change Projections Through Single Image Super-Resolution. arXiv 2017, arXiv:1703.03126. [Google Scholar] [CrossRef]
Maraun, D.; Shepherd, T.G.; Widmann, M.; Zappa, G.; Walton, D.; Gutiérrez, J.M.; Hagemann, S.; Richter, I.; Soares, P.M.; Hall, A.; et al. Towards process-informed bias correction of climate change simulations. Nat. Clim. Change 2017, 7, 764–773. [Google Scholar] [CrossRef]
Curran, D.; Saleem, H.; Hobeichi, S.; Salim, F.D. Resolution-Agnostic Transformer-based Climate Downscaling. arXiv 2024, arXiv:2411.14774. [Google Scholar]
Schmude, J.; Roy, S.; Trojak, W.; Jakubik, J.; Civitarese, D.S.; Singh, S.; Kuehnert, J.; Ankur, K.; Gupta, A.; Phillips, C.E.; et al. Prithvi wxc: Foundation model for weather and climate. arXiv 2024, arXiv:2409.13598. [Google Scholar]
Prasad, A.; Harder, P.; Yang, Q.; Sattegeri, P.; Szwarcman, D.; Watson, C.; Rolnick, D. Evaluating the transferability potential of deep learning models for climate downscaling. arXiv 2024, arXiv:2407.12517. [Google Scholar] [CrossRef]
Legasa, M.N.; Manzanas, R.; Gutiérrez, J.M. Assessing Three Perfect Prognosis Methods for Statistical Downscaling of Climate Change Precipitation Scenarios. Geophys. Res. Lett. 2023, 50, e2022GL102525. [Google Scholar] [CrossRef]
Rampal, N.; Gibson, P.B.; Sherwood, S.; Abramowitz, G. On the extrapolation of generative adversarial networks for downscaling precipitation extremes in warmer climates. Geophys. Res. Lett. 2024, 51, e2024GL112492. [Google Scholar] [CrossRef]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Wang, F.; Tian, D.; Lowe, L.; Kalin, L.; Lehrter, J. Deep Learning for Daily Precipitation and Temperature Downscaling. Water Resour. Res. 2021, 57, e2020WR029308. [Google Scholar] [CrossRef]
Quesada-Chacón, D.; Stöger, J.; Güntner, A.; Bernhofer, C. Downscaling CORDEX Through Deep Learning to Daily 1 km Multivariate Ensemble in Complex Terrain. Earth’s Future 2023, 11, e2023EF003531. [Google Scholar] [CrossRef]
Pathak, J.; Subramanian, S.; Harrington, P.; Raja, S.; Chattopadhyay, A.; Mardani, M.; Kurth, T.; Hall, D.; Li, Z.; Azizzadenesheli, K.; et al. FourCastNet: A Global Data-driven High-resolution Weather Model using Adaptive Fourier Neural Operators. arXiv 2022, arXiv:2202.11214. [Google Scholar] [CrossRef]
Kumar, R.; Sharma, T.; Vaghela, V.; Jha, S.K.; Agarwal, A. PrecipFormer: Efficient Transformer for Precipitation Downscaling. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), Tucson, AZ, USA, 26 February–6 March 2025; pp. 489–497. [Google Scholar]
Beucler, T.; Rasp, S.; Pritchard, M.; Gentine, P. Achieving conservation of energy in neural network emulators for climate modeling. arXiv 2019, arXiv:1906.06622. [Google Scholar] [CrossRef]
Raissi, M.; Perdikaris, P.; Karniadakis, G.E. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 2019, 378, 686–707. [Google Scholar] [CrossRef]
Harder, P.; Jha, S.; Rolnick, D. Hard-Constrained Deep Learning for Climate Downscaling. J. Mach. Learn. Res. 2022, 23, 1–38. [Google Scholar]
Leinonen, J.; Nerini, D.; Berne, A. Stochastic Super-Resolution for Downscaling Time-Evolving Atmospheric Fields with a GAN. In Proceedings of the ECML/PKDD Workshop on ClimAI, Virtual, 14–18 September 2020. [Google Scholar]
Price, I.; Rasp, S. Increasing the Accuracy and Resolution of Precipitation Forecasts Using Deep Generative Models. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics, Valencia, Spain, 28–30 March 2022. [Google Scholar]
Rampal, N.; Gibson, P.B.; Sood, A.; Stuart, S.; Fauchereau, N.C.; Brandolino, C.; Noll, B.; Meyers, T. High-resolution downscaling with interpretable deep learning: Rainfall extremes over New Zealand. Weather Clim. Extrem. 2022, 38, 100525. [Google Scholar] [CrossRef]
Tomasi, E.; Franch, G.; Cristoforetti, M. Can AI be enabled to perform dynamical downscaling? A latent diffusion model to mimic kilometer-scale COSMO5.0_CLM9 simulations. Geosci. Model Dev. 2025, 18, 2051–2078. [Google Scholar] [CrossRef]
Srivastava, P.; El Helou, A.; Vilalta, R.; Li, H.W.; Kumar, V.; Mandt, S. Precipitation Downscaling with Spatiotemporal Video Diffusion. In Advances in Neural Information Processing Systems 37, Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS 2024), Vancouver, BC, Canada, 10–15 December 2024; Curran Associates, Inc.: San Jose, CA, USA, 2024; pp. 19327–19340. [Google Scholar]
Liu, Y.; Doss-Gollin, J.; Balakrishnan, G.; Veeraraghavan, A. Generative Precipitation Downscaling using Score-based Diffusion with Wasserstein Regularization. arXiv 2024, arXiv:2410.00381. [Google Scholar] [CrossRef]
Tripathi, S.; Srinivas, V.V.; Nanjundiah, R.S. Downscaling of precipitation for climate change scenarios: A support vector machine approach. J. Hydrol. 2006, 330, 621–640. [Google Scholar] [CrossRef]
He, X.; Chaney, N.W.; Schleiss, M.; Sheffield, J. Spatial downscaling of precipitation using adaptable random forests. Water Resour. Res. 2016, 52, 8217–8237. [Google Scholar] [CrossRef]
Cannon, A.J. Quantile regression neural networks: Implementation in R and application to precipitation downscaling. Comput. Geosci. 2011, 37, 1277–1284. [Google Scholar] [CrossRef]
Vaughan, A.; Adamson, H.; Tak-Chu, L.; Turner, R.E. Convolutional conditional neural processes for local climate downscaling. arXiv 2021, arXiv:2101.07857. [Google Scholar] [CrossRef]
Baño-Medina, J.; Manzanas, R.; Gutiérrez, J.M. On the suitability of deep convolutional neural networks for continental-wide downscaling of climate change projections. Clim. Dyn. 2021, 57, 2941–2951. [Google Scholar] [CrossRef]
Soares, P.M.M.; Johannsen, F.; Lima, D.C.A.; Lemos, G.; Bento, V.A.; Bushenkova, A. High-resolution downscaling of CMIP6 Earth system and global climate models using deep learning for Iberia. Geosci. Model Dev. 2024, 17, 229–257. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Proceedings of the 18th International Conference, Munich, Germany, 5–9 October 2015; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar] [CrossRef]
Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, Proceedings of the International Workshop, Granada, Spain, 20 September 2018; Springer: Cham, Switzerland, 2018; Volume 11045, pp. 3–11. [Google Scholar] [CrossRef]
Liu, J.; Shi, C.; Ge, L.; Tie, R.; Chen, X.; Zhou, T.; Gu, X.; Shen, Z. Enhanced Wind Field Spatial Downscaling Method Using UNET Architecture and Dual Cross-Attention Mechanism. Remote Sens. 2024, 16, 1867. [Google Scholar] [CrossRef]
Pasula, A.; Subramani, D.N. Global Climate Model Bias Correction Using Deep Learning. arXiv 2025, arXiv:2504.19145. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Kim, J.; Lee, J.K.; Lee, K.M. Accurate Image Super-Resolution Using Very Deep Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar] [CrossRef]
Lim, B.; Son, S.; Kim, H.; Nah, S.; Lee, K.M. Enhanced Deep Residual Networks for Single Image Super-Resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; pp. 1132–1140. [Google Scholar] [CrossRef]
Iotti, M.; Davini, P.; von Hardenberg, J.; Zappa, G. RainScaleGAN: A Conditional Generative Adversarial Network for Rainfall Downscaling. Artif. Intell. Earth Syst. 2025, 4, e240074. [Google Scholar] [CrossRef]
Accarino, G.; De Rubeis, T.D.; Falcucci, G.; Ubaldi, E.; Aloisio, G. MSG-GAN-SD: A Multi-Scale Gradients GAN for Statistical Downscaling of 2-Meter Temperature over the EURO-CORDEX Domain. AI 2021, 2, 600–620. [Google Scholar] [CrossRef]
Glawion, L.; Polz, J.; Kunstmann, H.; Fersch, B.; Chwala, C. Global spatio-temporal ERA5 precipitation downscaling to km and sub-hourly scale using generative AI. npj Clim. Atmos. Sci. 2025, 8, 219. [Google Scholar] [CrossRef]
Kang, M.; Shin, J.; Park, J. StudioGAN: A taxonomy and benchmark of GANs for image synthesis. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 15725–15742. [Google Scholar] [CrossRef]
Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8110–8119. [Google Scholar]
Stengel, K.A.; Glaws, A.; Hettinger, D.; King, R.N. Adversarial super-resolution of climatological wind and solar data. Proc. Natl. Acad. Sci. USA 2020, 117, 16805–16815. [Google Scholar] [CrossRef]
National Renewable Energy Laboratory. Sup3rCC: Super-Resolution for Renewable Energy Resource Data with Climate Change Impacts. Available online: https://www.nrel.gov/analysis/sup3rcc (accessed on 27 May 2025).
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 10684–10695. [Google Scholar] [CrossRef]
Berry, L.; Brando, A.; Meger, D. Shedding light on large generative networks: Estimating epistemic uncertainty in diffusion models. In Proceedings of the 40th Conference on Uncertainty in Artificial Intelligence, Barcelona, Spain, 15–18 July 2024. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Advances in Neural Information Processing Systems 30, Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: San Jose, CA, USA, 2017; pp. 5998–6008. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar] [CrossRef]
Zhong, X.; Du, F.; Chen, L.; Wang, Z.; Li, H. Investigating transformer-based models for spatial downscaling and correcting biases of near-surface temperature and wind-speed forecasts. Q. J. R. Meteorol. Soc. 2024, 150, 275–289. [Google Scholar] [CrossRef]
Sinha, S.; Benton, B.; Emami, P. On the effectiveness of neural operators at zero-shot weather downscaling. Environ. Data Sci. 2025, 4, e21. [Google Scholar] [CrossRef]
Wang, X.; Choi, J.Y.; Kurihaya, T.; Lyngaas, I.; Yoon, H.J.; Fan, M.; Nafi, N.M.; Tsaris, A.; Aji, A.M.; Hossain, M.; et al. ORBIT-2: Scaling Exascale Vision Foundation Models for Weather and Climate Downscaling. arXiv 2025, arXiv:2505.04802. [Google Scholar] [CrossRef]
Shi, J.; Shirali, A.; Jin, B.; Zhou, S.; Hu, W.; Rangaraj, R.; Wang, S.; Han, J.; Wang, Z.; Lall, U.; et al. Deep Learning and Foundation Models for Weather Prediction: A Survey. arXiv 2025, arXiv:2501.06907. [Google Scholar] [CrossRef]
Coordinated Regional Climate Downscaling Experiment (CORDEX). Task Force on Machine Learning. 2024. Describes Ongoing Task Force Activities. Last Website Update Noted as 2025. Available online: https://cordex.org/strategic-activities/taskforces/task-force-on-machine-learning/ (accessed on 26 May 2025).
Hobeichi, S.; Nishant, N.; Shao, Y.; Abramowitz, G.; Pitman, A.; Sherwood, S.; Bishop, C.; Green, S. Using machine learning to cut the cost of dynamical downscaling. Earth’s Future 2023, 11, e2022EF003291. [Google Scholar] [CrossRef]
Ghosh, S. SVM-PGSL coupled approach for statistical downscaling to predict rainfall from GCM output. J. Geophys. Res. Atmos. 2010, 115. [Google Scholar] [CrossRef]
González-Abad, J.; Baño-Medina, J.; Gutiérrez, J.M. Using Explainability to Inform Statistical Downscaling Based on Deep Learning Beyond Standard Validation Approaches. J. Adv. Model. Earth Syst. 2023, 15, e2023MS003641. [Google Scholar] [CrossRef]
Daw, A.; Karpatne, A.; Watkins, W.D.; Read, J.S.; Kumar, V. Physics-guided neural networks (pgnn): An application in lake temperature modeling. In Knowledge Guided Machine Learning; Chapman and Hall/CRC: Boca Raton, FL, USA, 2022; pp. 353–372. [Google Scholar]
Beucler, T.; Pritchard, M.; Rasp, S.; Ott, J.; Baldi, P.; Gentine, P. Enforcing analytic constraints in neural networks emulating physical systems. Phys. Rev. Lett. 2021, 126, 098302. [Google Scholar] [CrossRef]
Schuster, G.T.; Chen, Y.; Feng, S. Review of physics-informed machine-learning inversion of geophysical data. Geophysics 2024, 89, T337–T356. [Google Scholar] [CrossRef]
González-Abad, J.; Baño-Medina, J. Deep Ensembles to Improve Uncertainty Quantification of Statistical Downscaling Models under Climate Change Conditions. arXiv 2023, arXiv:2305.00975. [Google Scholar] [CrossRef]
Xiang, L.; Hu, P.; Wang, F.; Yu, J.; Zhang, L. A Novel Reference-Based and Gradient-Guided Deep Learning Model for Daily Precipitation Downscaling. Atmosphere 2022, 13, 511. [Google Scholar] [CrossRef]
Boateng, D.; Mutz, S.G. pyESDv1. 0.1: An open-source Python framework for empirical-statistical downscaling of climate information. Geosci. Model Dev. Discuss. 2023, 16, 6479–6514. [Google Scholar] [CrossRef]
Wang, Z.; Bugliaro, L.; Gierens, K.; Hegglin, M.I.; Rohs, S.; Petzold, A.; Kaufmann, S.; Voigt, C. Machine learning for improvement of upper tropospheric relative humidity in ERA5 weather model data. Atmos. Chem. Phys. 2025, 25, 2845–2861. [Google Scholar] [CrossRef]
Daly, C.; Halbleib, M.; Smith, J.I.; Gibson, W.P.; Doggett, M.K.; Taylor, G.H.; Curtis, J.; Pasteris, P.P. Physiographically sensitive mapping of climatological temperature and precipitation across the conterminous United States. Int. J. Climatol. 2008, 28, 2031–2064. [Google Scholar] [CrossRef]
Herrera, S.; Cardoso, R.M.; Soares, P.M.; Espírito-Santo, F.; Viterbo, P.; Gutiérrez, J.M. Iberia01: A new gridded dataset of daily precipitation and temperatures over Iberia. Earth Syst. Sci. Data 2019, 11, 1947–1971. [Google Scholar] [CrossRef]
Cornes, R.C.; van der Schrier, G.; van den Besselaar, E.J.M.; Jones, P.D. An Ensemble Version of the E-OBS Temperature and Precipitation Data Sets. J. Geophys. Res. Atmos. 2018, 123, 9391–9409. [Google Scholar] [CrossRef]
Technische Universität Dresden. Regionales Klimainformationssystem Sachsen (ReKIS). General Project Portal. A Summary Document “Climate_datasets_Zusammenfassung.pdf” Is Available from the Portal. 2023. Available online: https://rekis.hydro.tu-dresden.de/ (accessed on 26 May 2025).
Huffman, G.J.; Bolvin, D.T.; Braithwaite, D.; Hsu, K.; Joyce, R.J.; Kidd, C.; Nelkin, E.J.; Sorooshian, S.; Tan, J.; Xie, P. Algorithm Theoretical Basis Document (ATBD), Version 06.3; Integrated Multi-satellitE Retrievals for GPM (IMERG) Algorithm Theoretical Basis Document (ATBD); Technical Report; NASA Goddard Space Flight Center: Washington, DC, USA, 2020. Available online: https://gpm.nasa.gov/resources/documents/algorithm-information/IMERG-V06-ATBD (accessed on 26 May 2025).
Entekhabi, D.; Njoku, E.G.; O’Neill, P.E.; Kellogg, K.H.; Crow, W.T.; Edelstein, W.N.; Entin, J.K.; Goodman, S.D.; Jackson, T.J.; Johnson, J.T.; et al. The Soil Moisture Active Passive (SMAP) Mission. Proc. IEEE 2010, 98, 704–716. [Google Scholar] [CrossRef]
Pastorello, G.; Trotta, C.; Canfora, E.; Chu, H.; Christianson, D.; Cheah, Y.W.; Poindexter, C.; Chen, J.; Elbashandy, A.; Humphrey, M.; et al. The FLUXNET2015 dataset and the ONEFlux processing pipeline for eddy covariance data. Sci. Data 2020, 7, 225. [Google Scholar] [CrossRef]
Sishah, S.; Abrahem, T.; Azene, G.; Dessalew, A.; Hundera, H. Downscaling and validating SMAP soil moisture using a machine learning algorithm over the Awash River basin, Ethiopia. PLoS ONE 2023, 18, e0279895. [Google Scholar] [CrossRef]
Sha, Y.; Stull, R.; Ghafarian, P.; Ou, T.; Gultepe, I. Deep-Learning-Based Gridded Downscaling of Surface Meteorological Variables in Complex Terrain. Part I: Daily Maximum and Minimum 2-m Temperature. J. Appl. Meteorol. Climatol. 2020, 59, 2057–2073. [Google Scholar] [CrossRef]
Sarafanov, M.; Kazakov, E.; Nikitin, N.O.; Kalyuzhnaya, A.V. A Machine Learning Approach for Remote Sensing Data Gap-Filling with Open-Source Implementation: An Example Regarding Land Surface Temperature, Surface Albedo and NDVI. Remote Sens. 2020, 12, 3865. [Google Scholar] [CrossRef]
Huang, X. Evaluating Loss Functions and Learning Data Pre-Processing for Climate Downscaling Deep Learning Models. arXiv 2023, arXiv:2306.11144. [Google Scholar] [CrossRef]
Choi, H.; Kim, Y.; Kim, D. Enhancing Extreme Rainfall Nowcasting with Weighted Loss Functions in Deep Learning Models. EGU General Assembly 2025, EGU25-19416. Available online: https://meetingorganizer.copernicus.org/EGU25/EGU25-19416.html (accessed on 26 May 2025).
Fallah, B.; Rakhshandehroo, G.R.; Berg, P.; Wulfmeyer, V.; Hattermann, F.F. Climate model downscaling in central Asia: A dynamical and a neural network approach. Geosci. Model Dev. 2025, 18, 161–180. [Google Scholar] [CrossRef]
Roberts, N.M.; Lean, H.W. Scale-selective verification of rainfall accumulations from high-resolution forecasts of convective events. Mon. Weather Rev. 2008, 136, 78–97. [Google Scholar] [CrossRef]
Gneiting, T.; Raftery, A.E. Strictly proper scoring rules, prediction, and estimation. J. Am. Stat. Assoc. 2007, 102, 359–378. [Google Scholar] [CrossRef]
Annau, N.J.; Cannon, A.J.; Monahan, A.H. Algorithmic Hallucinations of Near-Surface Winds: Statistical Downscaling with GANs to Convection-Permitting Scales. Artif. Intell. Earth Syst. 2023, 2, e230015. [Google Scholar] [CrossRef]
Lakshminarayanan, B.; Pritzel, A.; Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems 30, Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Curran Associates, Inc.: San Jose, CA, USA, 2017. [Google Scholar]
Harris, L.; McRae, A.T.T.; Chantry, M.; Dueben, P.D.; Palmer, T.N. A Generative Deep Learning Approach to Stochastic Downscaling of Precipitation Forecasts. J. Adv. Model. Earth Syst. 2022, 14, e2022MS003120. [Google Scholar] [CrossRef]
Marzban, C.; Sandgathe, S. Verification with variograms. Weather Forecast. 2009, 24, 1102–1120. [Google Scholar] [CrossRef]
Davis, C.; Brown, B.; Bullock, R. Object-based verification of precipitation forecasts. Part I: Methodology and application to mesoscale rain areas. Mon. Weather Rev. 2006, 134, 1772–1784. [Google Scholar] [CrossRef]
Davis, C.; Brown, B.; Bullock, R. Object-based verification of precipitation forecasts. Part II: Application to convective rain systems. Mon. Weather Rev. 2006, 134, 1785–1795. [Google Scholar] [CrossRef]
Huth, R.; Kyselỳ, J.; Pokorná, L. A GCM simulation of heat waves, dry spells, and their relationships to circulation. Clim. Change 2000, 46, 29–60. [Google Scholar] [CrossRef]
Mendes, D.; Marengo, J.A. Temporal downscaling: A comparison between artificial neural network and autocorrelation techniques over the Amazon Basin in present and future climate change scenarios. Theor. Appl. Climatol. 2010, 100, 413–421. [Google Scholar] [CrossRef]
Zolina, O.; Simmer, C.; Belyaev, K.; Gulev, S.K.; Koltermann, P. Changes in the duration of European wet and dry spells during the last 60 years. J. Clim. 2013, 26, 2022–2047. [Google Scholar] [CrossRef]
Fall, C.M.N.; Lavaysse, C.; Drame, M.S.; Panthou, G.; Gaye, A.T. Wet and dry spells in Senegal: Comparison of detection based on satellite products, reanalysis, and in situ estimates. Nat. Hazards Earth Syst. Sci. 2021, 21, 1051–1069. [Google Scholar] [CrossRef]
Coles, S.G. An Introduction to Statistical Modeling of Extreme Values; Springer Series in Statistics; Springer: London, UK, 2001. [Google Scholar] [CrossRef]
Vissio, G.; Lembo, V.; Lucarini, V.; Ghil, M. Evaluating the performance of climate models based on Wasserstein distance. Geophys. Res. Lett. 2020, 47, e2020GL089385. [Google Scholar] [CrossRef]
Perkins, S.; Pitman, A.; Holbrook, N.J.; Mcaneney, J. Evaluation of the AR4 climate models’ simulated daily maximum temperature, minimum temperature, and precipitation over Australia using probability density functions. J. Clim. 2007, 20, 4356–4376. [Google Scholar] [CrossRef]
Sha, Y.; Stull, R.; Ghafarian, P.; Ou, T.; Gultepe, I. Deep-Learning-Based Gridded Downscaling of Surface Meteorological Variables in Complex Terrain. Part II: Daily Precipitation. J. Appl. Meteorol. Climatol. 2020, 59, 2075–2092. [Google Scholar] [CrossRef]
Wood, A.W.; Leung, L.R.; Sridhar, V.; Lettenmaier, D.P. Hydrologic implications of dynamical and statistical approaches to downscaling climate model outputs. Clim. Change 2004, 62, 189–216. [Google Scholar] [CrossRef]
Pierce, D.W.; Cayan, D.R.; Thrasher, B.L. Statistical downscaling using localized constructed analogs (LOCA). J. Hydrometeorol. 2014, 15, 2558–2585. [Google Scholar] [CrossRef]
Doblas-Reyes, F.J.; Sörensson, A.A.; Almazroui, M.; Dosio, A.; Gutowski, W.J.; Haarsma, R.; Hamdi, R.; Hewitson, B.; Kwon, W.-T.; Lamptey, B.L.; et al. Linking Global to Regional Climate Change. In Climate Change 2021: The Physical Science Basis; Contribution of Working Group I to the Sixth Assessment Report of the IPCC; Cambridge University Press: Cambridge, UK, 2021; pp. 1363–1512. [Google Scholar] [CrossRef]
Basile, S.; Crimmins, A.R.; Avery, C.W.; Hamlington, B.D.; Kunkel, K.E. Appendix 3. Scenarios and Datasets. In Fifth National Climate Assessment; USGCRP: Washington, DC, USA, 2023. [Google Scholar] [CrossRef]
Harilal, N.; Singh, M.; Bhatia, U. Augmented Convolutional LSTMs for Generation of High-Resolution Climate Change Projections. IEEE Access 2021, 9, 25208–25218. [Google Scholar] [CrossRef]
Maraun, D.; Widmann, M.; Gutierrez, J.M.; Kotlarski, S.; Chandler, R.E.; Hertig, E.; Huth, R.; Wibig, J.; Wilcke, R.A.I.; Themeßl, M.J.; et al. VALUE: A framework to validate downscaling approaches for climate change studies. Earth’s Future 2015, 3, 1–14. [Google Scholar] [CrossRef]
Pérez, A.; Santa Cruz, M.; San Martín, D.; Gutiérrez, J.M. Transformer-based super-resolution downscaling for regional reanalysis: Full domain vs tiling approaches. arXiv 2024, arXiv:2410.12728. [Google Scholar]
Gama, J.; Žliobaitė, I.; Bifet, A.; Pechenizkiy, M.; Bouchachia, A. A survey on concept drift adaptation. ACM Comput. Surv. CSUR 2014, 46, 44. [Google Scholar] [CrossRef]
Geirhos, R.; Jacobsen, J.H.; Michaelis, C.; Zemel, R.; Brendel, W.; Bethge, M.; Wichmann, F.A. Shortcut learning in deep neural networks. Nat. Mach. Intell. 2020, 2, 665–673. [Google Scholar] [CrossRef]
Cavaiola, M.; Tuju, P.E.; Mazzino, A. Accurate and efficient AI-assisted paradigm for adding granularity to ERA5 precipitation reanalysis. Sci. Rep. 2024, 14, 26158. [Google Scholar] [CrossRef]
Legasa, M.; Manzanas, R.; Calviño, A.; Gutiérrez, J.M. A posteriori random forests for stochastic downscaling of precipitation by predicting probability distributions. Water Resour. Res. 2022, 58, e2021WR030272. [Google Scholar] [CrossRef]
Baño-Medina, J. Understanding deep learning decisions in statistical downscaling models. In Proceedings of the 10th International Conference on Climate Informatics, Virtual, 22–25 September 2020; pp. 79–85. [Google Scholar]
Boulaguiem, Y.; Zscheischler, J.; Vignotto, E.; van der Wiel, K.; Engelke, S. Modeling and simulating spatial extremes by combining extreme value theory with generative adversarial networks. Environ. Data Sci. 2022, 1, e5. [Google Scholar] [CrossRef]
Lee, J.; Park, S.Y. WGAN-GP-Based Conditional GAN with Extreme Critic for Precipitation Downscaling in a Key Agricultural Region of the Northeastern U.S. IEEE Access 2025, 13, 46030–46041. [Google Scholar] [CrossRef]
Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.k.; Woo, W.c. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. In Advances in Neural Information Processing Systems 28, Proceedings of the 29th Annual Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Curran Associates, Inc.: San Jose, CA, USA, 2015; pp. 802–810. [Google Scholar]
Miao, Q.; Liu, Y.; Liu, T.; Sorooshian, S. Improving Monsoon Precipitation Prediction Using Combined Convolutional and Long Short Term Memory Neural Network. Water 2019, 11, 977. [Google Scholar] [CrossRef]
Anh, D.T.; Bae, D.J.; Jung, K. Downscaling rainfall using deep learning LSTM and feedforward neural networks. Int. J. Climatol. 2019, 39, 2502–2518. [Google Scholar] [CrossRef]
Yang, F.; Ye, Q.; Wang, K.; Sun, L. Successful Precipitation Downscaling Through an Innovative Transformer-Based Model. Remote Sens. 2024, 16, 4292. [Google Scholar] [CrossRef]
Hernanz, A.; Rodriguez-Camino, E.; Navascués, B.; Gutiérrez, J.M. On the limitations of deep learning for statistical downscaling of climate change projections: The transferability and the extrapolation issues. Atmos. Sci. Lett. 2024, 25, e1195. [Google Scholar] [CrossRef]
Vandal, T.; Kodra, E.; Gosh, S.; Gunter, L.; Gonzalez, J.; Ganguly, A.R. Statistical downscaling of global climate models with image super-resolution and uncertainty quantification. arXiv 2018, arXiv:1811.03605. [Google Scholar] [CrossRef]
Gretton, A.; Borgwardt, K.M.; Rasch, M.J.; Schölkopf, B.; Smola, A. A kernel two-sample test. J. Mach. Learn. Res. 2012, 13, 723–773. [Google Scholar]
Székely, G.J.; Rizzo, M.L. Energy statistics: A class of statistics based on distances. J. Stat. Plan. Inference 2013, 143, 1249–1272. [Google Scholar] [CrossRef]
Dutta, S.; Innan, N.; Yahia, S.B.; Shafique, M. AQ-PINNs: Attention-Enhanced Quantum Physics-Informed Neural Networks for Carbon-Efficient Climate Modeling. arXiv 2024, arXiv:2409.01522. [Google Scholar]
Radke, T.; Fuchs, S.; Wilms, C.; Polkova, I.; Rautenhaus, M. Explaining neural networks for detection of tropical cyclones and atmospheric rivers in gridded atmospheric simulation data. Geosci. Model Dev. 2025, 18, 1017–1039. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30, Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Curran Associates, Inc.: San Jose, CA, USA, 2017; pp. 4765–4774. [Google Scholar]
van Zyl, C.; Ye, X.; Naidoo, R. Harnessing eXplainable artificial intelligence for feature selection in time series energy forecasting: A comparative analysis of Grad-CAM and SHAP. Appl. Energy 2024, 353, 122079. [Google Scholar] [CrossRef]
O’Loughlin, R.J.; Li, D.; Neale, R.; O’Brien, T.A. Moving beyond post hoc explainable artificial intelligence: A perspective paper on lessons learned from dynamical climate modeling. Geosci. Model Dev. 2025, 18, 787–802. [Google Scholar] [CrossRef]
Mamalakis, A.; Barnes, E.A.; Ebert-Uphoff, I. Investigating the fidelity of explainable artificial intelligence methods for applications of convolutional neural networks in geoscience. Artif. Intell. Earth Syst. 2022, 1, e220012. [Google Scholar] [CrossRef]
Gupta, H.V.; Kling, H.; Yilmaz, K.K.; Martinez, G.F. Decomposition of the mean squared error and NSE performance criteria: Implications for improving hydrological modelling. J. Hydrol. 2009, 377, 80–91. [Google Scholar] [CrossRef]
Zscheischler, J.; Westra, S.; Van Den Hurk, B.J.; Seneviratne, S.I.; Ward, P.J.; Pitman, A.; AghaKouchak, A.; Bresch, D.N.; Leonard, M.; Wahl, T.; et al. Future climate risk from compound events. Nat. Clim. Change 2018, 8, 469–477. [Google Scholar] [CrossRef]
Zscheischler, J.; Martius, O.; Westra, S.; Bevacqua, E.; Raymond, C.; Horton, R.M.; van den Hurk, B.; AghaKouchak, A.; Jézéquel, A.; Mahecha, M.D.; et al. A typology of compound weather and climate events. Nat. Rev. Earth Environ. 2020, 1, 333–347. [Google Scholar] [CrossRef]
Mazdiyasni, O.; AghaKouchak, A. Substantial increase in concurrent droughts and heatwaves in the United States. Proc. Natl. Acad. Sci. USA 2015, 112, 11484–11489. [Google Scholar] [CrossRef]
Addison, H.; Kendon, E.; Ravuri, S.; Aitchison, L.; Watson, P.A. Machine learning emulation of a local-scale UK climate model. arXiv 2022, arXiv:2211.16116. [Google Scholar] [CrossRef]
Gerges, F.; Boufadel, M.C.; Bou-Zeid, E.; Nassif, H.; Wang, J.T.L. A Novel Bayesian Deep Learning Approach to the Downscaling of Wind Speed with Uncertainty Quantification. In Advances in Knowledge Discovery and Data Mining, Proceedings of the 26th Pacific-Asia Conference, PAKDD, Chengdu, China, 16–19 May 2022; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2022; Volume 13281, pp. 55–66. [Google Scholar] [CrossRef]
Gal, Y.; Ghahramani, Z. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML), New York, NY, USA, 19–24 June 2016; Balcan, M.F., Weinberger, K.Q., Eds.; PMLR: Cambridge, MA, USA, 2016; Volume 48, pp. 1050–1059. [Google Scholar]
Gerges, F.; Boufadel, M.C.; Bou-Zeid, E.; Nassif, H.; Wang, J.T.L. Bayesian Multi-Head Convolutional Neural Networks with Bahdanau Attention for Forecasting Daily Precipitation in Climate Change Monitoring. In Machine Learning and Knowledge Discovery in Databases, Proceedings of the European Conference, ECML PKDD 2022, Grenoble, France, 19–23 September 2022; Cerquitelli, T., Monreale, A., Mikut, R., Moccia, S., Raedt, L.D., Eds.; Proceedings, Part V; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2022; Volume 13717, pp. 416–431. [Google Scholar] [CrossRef]
Merkel, D. Docker: Lightweight Linux Containers for Consistent Development and Deployment. Linux J. 2014, 2014, 2. [Google Scholar]
Dormann, C.F.; Elith, J.; Bacher, S.; Buchmann, C.; Carl, G.; Carré, G.; Marquéz, J.R.G.; Gruber, B.; Lafourcade, B.; Leitão, P.J.; et al. Collinearity: A review of methods to deal with it and a simulation study evaluating their performance. Ecography 2013, 36, 27–46. [Google Scholar] [CrossRef]
O’brien, R.M. A caution regarding rules of thumb for variance inflation factors. Qual. Quant. 2007, 41, 673–690. [Google Scholar] [CrossRef]
Reimers, N.; Gurevych, I. Reporting score distributions makes a difference: Performance study of lstm-networks for sequence tagging. arXiv 2017, arXiv:1707.09861. [Google Scholar] [CrossRef]
Dodge, J.; Gururangan, S.; Card, D.; Schwartz, R.; Smith, N.A. Show your work: Improved reporting of experimental results. arXiv 2019, arXiv:1909.03004. [Google Scholar] [CrossRef]
Cohen, J.; Cohen, P.; West, S.G.; Aiken, L.S. Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences; Routledge: Boca Raton, FL, USA, 2013. [Google Scholar]
Düsterhus, A.; Hense, A. Advanced information criterion for environmental data quality assurance. Adv. Sci. Res. 2012, 8, 99–104. [Google Scholar] [CrossRef]
Kling, H.; Fuchs, M.; Paulin, M. Runoff conditions in the upper Danube basin under an ensemble of climate change scenarios. J. Hydrol. 2012, 424–425, 264–277. [Google Scholar] [CrossRef]
Roberts, D.R.; Bahn, V.; Ciuti, S.; Boyce, M.S.; Elith, J.; Guillera-Arroita, G.; Hauenstein, S.; Lahoz-Monfort, J.J.; Schröder, B.; Thuiller, W.; et al. Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography 2017, 40, 913–929. [Google Scholar] [CrossRef]
Valavi, R.; Elith, J.; Lahoz-Monfort, J.J.; Guillera-Arroita, G. blockCV: An r package for generating spatially or environmentally separated folds for k-fold cross-validation of species distribution models. bioRxiv 2018, 357798. [Google Scholar] [CrossRef]
Mahoney, M.J.; Johnson, L.K.; Silge, J.; Frick, H.; Kuhn, M.; Beier, C.M. Assessing the performance of spatial cross-validation approaches for models of spatially structured data. arXiv 2023, arXiv:2303.07334. [Google Scholar] [CrossRef]
Brogli, R.; Heim, C.; Mensch, J.; Sørland, S.L.; Schär, C. The pseudo-global-warming (PGW) approach: Methodology, software package PGW4ERA5 v1. 1, validation, and sensitivity analyses. Geosci. Model Dev. 2023, 16, 907–926. [Google Scholar] [CrossRef]
Climate Change AI. Data Gaps (Beta). Available online: https://www.climatechange.ai/dev/datagaps (accessed on 27 May 2025).
World Climate Research Programme. WCRP Grand Challenges (Ended in 2022). Official Community Theme Summary Page. 2022. Available online: https://www.wcrp-climate.org/component/content/category/26-grand-challenges (accessed on 13 August 2025).
Hersbach, H. Decomposition of the continuous ranked probability score for ensemble prediction systems. Weather Forecast. 2000, 15, 559–570. [Google Scholar] [CrossRef]
Glenn, W.B. Verification of forecasts expressed in terms of probability. Mon. Weather Rev. 1950, 78, 1–3. [Google Scholar]
European High-Level Expert Group on Artificial Intelligence. Ethics Guidelines for Trustworthy AI; European Commission, Digital Strategy: Brussels, Belgium, 2019. [Google Scholar]
AI, N. Artificial Intelligence Risk Management Framework (AI RMF 1.0). 2023. Available online: https://nvlpubs.nist.gov/nistpubs/ai/nist.ai (accessed on 27 May 2025).
Adams, P.; Hewitson, B.; Vaughan, C.; Wilby, R.; Zebiak, S.; Eitland, E.; Shumake-Guillemot, J. Call for an ethical framework for climate services. WMO Bull. 2015, 64, 51–54. [Google Scholar]
Mastrandrea, M.D.; Field, C.B.; Stocker, T.F.; Edenhofer, O.; Ebi, K.L.; Frame, D.J.; Held, H.; Kriegler, E.; Mach, K.J.; Matschoss, P.R.; et al. Guidance Note for Lead Authors of the IPCC Fifth Assessment Report on Consistent Treatment of Uncertainties; IPCC: Geneva, Switzerland, 2010. [Google Scholar]
CORDEX. CORDEX Experiment Design for Dynamical Downscaling of CMIP6. Available online: https://cordex.org/wp-content/uploads/2021/02/CORDEX-CMIP6_exp_design_draft_SOD_ln.pdf (accessed on 1 June 2025).
Sørland, S.L.; Schär, C.; Lüthi, D.; Kjellström, E. Bias patterns and climate change signals in GCM-RCM model chains. Environ. Res. Lett. 2018, 13, 074017. [Google Scholar] [CrossRef]
Diez-Sierra, J.; Iturbide, M.; Gutiérrez, J.M.; Fernández, J.; Milovac, J.; Cofiño, A.S.; Cimadevilla, E.; Nikulin, G.; Levavasseur, G.; Kjellström, E.; et al. The worldwide C3S CORDEX grand ensemble: A major contribution to assess regional climate change in the IPCC AR6 Atlas. Bull. Am. Meteorol. Soc. 2022, 103, E2804–E2826. [Google Scholar] [CrossRef]
Hawkins, E.; Sutton, R. The potential to narrow uncertainty in regional climate predictions. Bull. Am. Meteorol. Soc. 2009, 90, 1095–1108. [Google Scholar] [CrossRef]
Hawkins, E.; Sutton, R. The potential to narrow uncertainty in projections of regional precipitation change. Clim. Dyn. 2011, 37, 407–418. [Google Scholar] [CrossRef]
Bhardwaj, T. Climate Justice Hangs in the Balance Will AI Divide or Unite the Planet. Available online: https://www.downtoearth.org.in/climate-change/climate-justice-hangs-in-the-balance-will-ai-divide-or-unite-the-planet (accessed on 11 January 2026).
Jacob, D.; St. Clair, A.L.; Mahon, R.; Marsland, S.; Murisa, M.N.; Buontempo, C.; Pulwarty, R.S.; Siddiqui, M.R.; Grossi, A.; Steynor, A.; et al. Co-production of climate services: Challenges and enablers. Front. Clim. 2025, 7, 1507759. [Google Scholar] [CrossRef]
Mehrabi, N.; Morstatter, F.; Saxena, N.; Lerman, K.; Galstyan, A. A survey on bias and fairness in machine learning. ACM Comput. Surv. CSUR 2021, 54, 115. [Google Scholar] [CrossRef]
Savannah Software Solutions. The Role of AI in Climate Modeling: Exploring How Artificial Intelligence Is Improving Predictions and Responses to Climate Change. Available online: https://savannahsoftwaresolutions.co.ke/the-role-of-ai-in-climate-modeling-exploring-how-artificial-intelligence-is-improving-predictions-and-responses-to-climate-change/ (accessed on 27 May 2025).
Sustainability-Directory. AI Bias in Equitable Climate Solutions. Available online: https://sustainability-directory.com/question/ai-bias-equitable-climate-solutions/ (accessed on 27 May 2025).
Amnuaylojaroen, T. Advancements and challenges of artificial intelligence in climate modeling for sustainable urban planning. Front. Artif. Intell. 2025, 8, 1517986. [Google Scholar] [CrossRef]
Mitchell, M.; Wu, S.; Zaldivar, A.; Barnes, P.; Vasserman, L.; Hutchinson, B.; Spitzer, E.; Raji, I.D.; Gebru, T. Model cards for model reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency, Atlanta, GA, USA, 29–31 January 2019; pp. 220–229. [Google Scholar]
Gebru, T.; Morgenstern, J.; Vecchione, B.; Vaughan, J.W.; Wallach, H.; Iii, H.D.; Crawford, K. Datasheets for datasets. Commun. ACM 2021, 64, 86–92. [Google Scholar] [CrossRef]
American Bar Association. Climate Change and Responsible AI Affect Cybersecurity and Digital Privacy Conflicts. SciTech Lawyer 2025, 21. [Google Scholar]
World Meteorological Organization. 2024 State of Climate Services. 2024. Assesses Global Climate Services Capacity and Gaps. Available online: https://wmo.int/publication-series/2024-state-of-climate-services (accessed on 27 May 2025).
Mastrandrea, M.D.; Mach, K.J.; Plattner, G.K.; Edenhofer, O.; Stocker, T.F.; Field, C.B.; Ebi, K.L.; Matschoss, P.R. The IPCC AR5 guidance note on consistent treatment of uncertainties: A common approach across the working groups. Clim. Change 2011, 108, 675. [Google Scholar] [CrossRef]
UNDP Climate Promise. What Are Climate Misinformation and Disinformation and How Can We Tackle Them? Available online: https://climatepromise.undp.org/news-and-stories/what-are-climate-misinformation-and-disinformation-and-how-can-we-tackle-them (accessed on 27 May 2025).
ISO/IEC 42001:2023; Artificial Intelligence—Management System. ISO: Geneva, Switzerland; IEC: Geneva, Switzerland, 2023.
ISO/IEC 23894:2023; Information Technology—Artificial Intelligence—Guidance on Risk Management. ISO: Geneva, Switzerland; IEC: Geneva, Switzerland, 2023.
Golding, N.; Lambkin, K.; Wilson, L.; De Troch, R.; Fischer, A.M.; Hygen, H.O.; Hama, A.M.; Dyrrdal, A.V.; Jamsin, E.; Termonia, P.; et al. Developing national frameworks for climate services: Experiences, challenges and learnings from across Europe. Clim. Serv. 2025, 37, 100530. [Google Scholar] [CrossRef]
World Meteorological Organization (WMO). National Framework for Climate Services (NFCS) Factsheet; World Meteorological Organization (WMO): Geneva, Switzerland, 2018. [Google Scholar]
World Meteorological Organization (WMO). WMO–HMEI Code of Ethics Guiding Public–Private Engagement; World Meteorological Organization (WMO): Geneva, Switzerland, 2024. [Google Scholar]
EY. AI and Sustainability: Opportunities, Challenges and Impact. Available online: https://www.ey.com/en_nl/insights/climate-change-sustainability-services/ai-and-sustainability-opportunities-challenges-and-impact (accessed on 27 May 2025).
Giorgi, F.; Jones, C.; Asrar, G.R. Addressing climate information needs at the regional level: The CORDEX framework. WMO Bull. 2009, 58, 175–183. [Google Scholar]
Eyring, V.; Bony, S.; Meehl, G.A.; Senior, C.A.; Stevens, B.; Stouffer, R.J.; Taylor, K.E. Overview of the Coupled Model Intercomparison Project Phase 6 (CMIP6) experimental design and organization. Geosci. Model Dev. 2016, 9, 1937–1958. [Google Scholar] [CrossRef]
WeAdapt. Justice and Equity in Climate Change Adaptation: Overview of an Emerging Agenda. Available online: https://weadapt.org/knowledge-base/gender-and-social-equality/justice-and-equity-in-climate-change-adaptation-overview-of-an-emerging-agenda/ (accessed on 27 May 2025).

Figure 1. Layered roadmap connecting model families to downscaling challenges and the “performance paradox/trust deficit.” Inner ring shows the shortcomings that exist in all current models (left, up, and bottom arms). The inner square shows each family’s core strength; the outer square lists trade-offs in green and pink boxes. Dashed arcs mark hybrids (PIML/constraints; two-stage CNN → Transformer). Classical/MOS forms the foundation; Physical Consistency is a cross-cutting goal. This figure does not mean that a family model is skillful in only one direction. Green boxes shows the pros and the red boxes shows the cons of the family method. Key references by category: CNN/U-Net—Vandal et al. [13], Baño-Medina et al. [11], Quesada-Chacón et al. [8], Wang et al. [21], Quesada-Chacón et al. [22]; Transformers—Curran et al. [15], Pathak et al. [23], Schmude et al. [16], Kumar et al. [24]; Ongoing Efforts (Physical Consistency)—Beucler et al. [25], Raissi et al. [26], Harder et al. [27], Lopez-Gomez et al. [9]; GANs/Diffusion Models—Leinonen et al. [28], Price and Rasp [29], Rampal et al. [30], Tomasi et al. [31], Srivastava et al. [32], Liu et al. [33].

Figure 2. A comparative overview of model families for climate downscaling tasks.

Figure 5. Mapping robust validation techniques to the challenges they address, particularly for non-stationary climate data.

Figure 6. Distributional consistency and transferability assessment adapted from Sha et al. [101]. This set of two-dimensional parity plots (density histograms) compares the downscaled output (x-axis) against ground truth (y-axis). (a) Performance in the training domain shows tight alignment with the 1:1 identity line. (b) Performance in the transferring domain (unseen geographic region) demonstrates the model’s ability to generalize, a critical test for validating deep learning models against covariate shift. The density color scale reveals where the model captures the bulk of the distribution versus where it diverges for extreme values. © American Meteorological Society. Used with permission.

Figure 7. Spatial fidelity and structural error analysis adapted from Sha et al. [101]. This multi-panel comparison highlights the “texture gap” between methods. (b,c) The BCSD baseline produces smoother fields with larger spatial errors (blue/red regions in difference maps) in complex terrain. (e–h) The Deep Learning approaches (UNet and Nest-UNet) recover sharper, physically plausible precipitation gradients that better match the high-resolution PRISM ground truth (a), demonstrating superior topological realism over traditional interpolation methods. © American Meteorological Society. Used with permission.

Figure 8. Performance disparity between Support Vector Regression (PCASVR) and Neural Networks (AE) adapted from Vandal et al. [1]. (a) Boxplots of daily RMSE showing that PCASVR yields significantly higher error margins (∼11 mm/day) compared to Autoencoders (AE) and Elastic-Net (∼7 mm/day). (b) Spatial bias maps revealing that PCASVR produces incoherent, noisy pixel-wise error patterns (bottom right of figure), whereas AE and BCSD preserve smoother, physically consistent spatial structures.

Table 2. Comparative analysis of dominant machine learning architectures for climate downscaling. Note: Typical resolutions and UQ descriptions are illustrative; see key references for details.

Architecture	Key Mechanisms/Characteristics	Strengths in Downscaling	Limitations/Weaknesses	UQ Capabilities/Robustness to Non-Stat. and Extremes	Typical Climate Variables	Typical Input Res.	Typical Output Res.	Key Refs
SVM (Support Vector Machines)	Kernel-based supervised learning; finds optimal hyperplane in transformed feature space; can use nonlinear kernels for complex relationships.	Performs well with limited data; robust to high-dimensional predictor spaces; strong baseline for PP downscaling.	Choice of kernel and hyperparameters critical; may underperform on highly non-stationary or extreme events; less scalable to massive training datasets.	UQ typically via bootstrapping or ensembles; deterministic by default; robustness to non-stationarity depends on training sample diversity.	Precip, Temp	GCM scale (e.g., 50–250 km)	Station/grid scale	[34,64]
Random Forests (RF, AP-RF, Prec-DWARF)	Ensemble of decision trees trained on bootstrap samples; output is mean/majority vote; AP-RF extends with predictive distribution outputs.	Handles nonlinear predictor–predictand relationships; naturally ranks predictor importance; AP-RF produces stochastic samples.	May smooth fine-scale details; bias in extremes without specialized treatment; interpretability less direct than single trees.	Yes for AP-RF (predictive distribution via gamma parameters); deterministic for standard RF; moderate robustness to non-stationarity if trained on diverse climates.	Precip	0.25–1°	0.125°/site-level	[35,112]
CNN (SRCNN, U-Net, ResNet)	Convolutional layers, pooling, shared weights. U-Net: encoder–decoder w/skip connections. ResNet: residual blocks.	Spatial feature extraction, pattern recognition; U-Nets preserve fine details; ResNets enable deeper learning.	Overfitting; extrapolation issues; can be overly smooth under MSE loss; plain CNNs struggle with depth.	UQ via ensembles; robustness to non-stationarity often limited without targeted strategies (e.g., PGW training). Standard CNNs may smooth extremes unless using specialized losses or architectures.	Temp, Precip, Wind, Solar Rad.	25–250 km	1–25 km	[10,13,37,38,40,113]
GAN (CGAN, MSG-GAN, evtGAN, Sup3rCC)	Generator and Discriminator trained adversarially. Conditional GANs (CGANs) use input conditions. Sup3rCC uses GANs to learn and inject spatio-temporal features from historical high-res data into coarse GCM outputs for renewable energy resource variables.	Perceptually realistic outputs, sharp details, better extreme event statistics, spatial variability. Sup3rCC provides high-resolution (4 km hourly) realistic data for wind, solar, temp, humidity, pressure, tailored for energy system analysis and computationally efficient compared to dynamical downscaling.	Training instability (mode collapse), difficult evaluation, potential artifacts, may not capture the full statistical distribution. Sup3rCC does not represent specific historical weather events, but historical/future climate conditions, and does not reduce GCM uncertainty.	UQ via ensembles, but it can be challenging to calibrate. Potential for better extreme event generation. Robustness to Non-Stationarity is an active research area; can learn spurious correlations if not carefully designed/trained. Sup3rCC aims for physically realistic outputs by learning from historical data.	Temp, Precip, Wind, Solar Rad. Sup3rCC specialized for renewable energy variables (wind, solar, temp, humidity, pressure).	GCM scale (e.g., 25–100 km)	1–12 km. Sup3rCC: 4 km hourly.	[28,47,48,53,88,114,115]
LSTM/ConvLSTM	Recurrent memory cells (LSTM); ConvLSTM embeds convolutions into gates.	Captures long-range temporal dependencies; suitable for sequence modeling; CNN–LSTM hybrids.	High complexity; ConvLSTM outperforms pure LSTM on spatial data; very long-range spatial dependencies can be limited.	UQ via ensembles or Bayesian RNNs; can model temporal non-stationarity if reflected in training data but may struggle with unseen future shifts and rare extremes without augmentation.	Precip, Runoff, other time-evolving vars.	Gridded time series	Gridded time series	[106,116,117,118]
Transformer (ViT, PrecipFormer, etc.)	Self-attention for global context; captures long-range spatio-temporal interactions.	Excellent at modeling long-range dependencies; strong transfer potential, especially in hybrid architectures.	Quadratic attention cost (being mitigated by sparse/linearized variants); relatively new in downscaling; large data requirements.	UQ via attention-weighted ensembles; promising for non-stationarity when pre-trained on diverse climates; attention can focus on localized antecedent signatures of extremes, aiding detection though not guaranteeing tail magnitude accuracy.	Temp, Precip, Wind, multiple vars.	Various (e.g., 50 km, 250 km)	Various (e.g., 0.9 km, 7 km, 25 km)	[15,16,24,57,108,119]
Diffusion Model (LDM, STVD)	Iterative denoising process; LDMs operate in latent space.	High-quality, diverse samples; stable training; explicit probabilistic outputs; good spatial detail.	Computationally intensive (though LDMs mitigate cost); relatively nascent for downscaling; slow sampling.	Excellent UQ via learned distributions and ensemble generation; promising for capturing tail behavior and fine-grained spatial detail of extremes; robustness to non-stationarity is an active research area, but shows potential when trained on diverse climate data.	Temp, Precip, Wind	100–250 km	2–10 km	[20,31,32,33,54]
Multi-task Foundation Models (e.g., Prithvi-WxC, FourCastNet, ORBIT-2)	Large pre-trained (often Transformer-based) models fine-tuned for downscaling.	Zero/few-shot potential; multi-variable support; leverage extensive pre-training.	Very high pre-training cost; uncertain generalization to new locales/tasks without adaptation; bias propagation risks.	UQ via large-ensemble sampling; pre-training on diverse climates can enhance robustness to non-stationarity and extremes, but careful domain adaptation is essential.	Multiple vars	Coarse GCM/Reanalysis	Fine (task-dependent)	[23,60]

Table 3. Recommended diagnostics for data-handling issues in ML-based climate downscaling.

Issue	How to Diagnose	What to Report	Common Mitigations
Collinearity	Pairwise $\| r \|$ matrix; VIF; condition indices [140,141]	List flagged predictor groups; VIF summary; grouped-ablation deltas	Group-wise ablation; regularization; PCA/PLS; domain-driven pruning
Suppressors/ confounding	Marginal vs partial association (“sign-flip”); hierarchical ablations [144]	Predictors with sign flips; whether gains persist on held-out/OOD splits	Group-wise modeling; constrain feature sets; regime-specific evaluation
Seed variance/instability	Repeat training for S seeds; examine score distributions [142]	Mean ± std (or CI) across seeds; rank stability	Deterministic settings; longer training; ensembles; robust selection criteria
Ablation interpretability	Drop-one and drop-group; permutation importance stability	$Δ$ skill per feature/group across regimes	Feature grouping; consistent reporting across seasons/regions

Table 5. Conceptual boundaries among closely related grand challenges.

Challenge	What It Uniquely Targets	Primary Evaluation Axis/Stress Test
Non-stationarity	Temporal drift and regime changes that break historical mappings	Out-of-time validation (train on earlier decades, test on later decades or “warm” periods); explicit drift diagnostics
Transferability	Cross-domain generalization beyond the training domain (GCM, scenario, region, data source)	Leave-one-domain-out tests (e.g., leave-one-GCM-out, leave-one-region-out), and combined spatial+temporal OOD tests
Causal/mechanism-aware ML	Learning stable, physically meaningful relations (invariants) rather than spurious correlations	Mechanism-based sanity checks, robustness under interventions, and improved performance under non-stationarity and cross-domain transfer

Table 6. Illustrative measurable objectives for evaluating progress on the grand challenges in ML-based climate downscaling. Thresholds are shown as examples and should be adapted to region/variable.

Grand Challenge	Measurable Objective (Examples)	Recommended Reporting
Non-stationarity/OOD robustness	Limit performance degradation under out-of-time tests (“warm” periods) and explicit domain shift	Report IID vs. OOD skill side-by-side (e.g., ratio ${RMSE}_{OOD} / {RMSE}_{IID}$ ), plus drift diagnostics and failure cases
Transferability (cross-domain)	Demonstrate stability across domains (e.g., leave-one-GCM-out, leave-one-region-out)	Report cross-domain mean ± std; identify worst-case domain; include at least one strict domain-holdout stress test
Physical consistency	Quantify physical violations and enforce verifiable constraints	Report conservation/constraint diagnostics (e.g., water-budget error, non-negativity, physically plausible ranges); include “physics scorecards” alongside accuracy
Uncertainty quantification (UQ)	Provide calibrated predictive distributions, not only point estimates	Report empirical coverage of prediction intervals (e.g., 90%/95%); reliability diagrams; proper scores such as CRPS and Brier score [87,153,154]
Extremes	Reduce bias in tail behavior and rare-event risk metrics	Report tail metrics (e.g., P95/P99 quantile error), exceedance skill (e.g., CSI/FSS for threshold events), and return-level bias from EVT fits

Table 7. Ethical architecture for ML-based downscaling: mapping technical failure modes to societal risks, normative obligations, and actionable controls.

Technical Failure Mode	Societal Risk Pathway	Normative Obligation	Actionable Controls (Examples)
Non-stationarity/ domain shift (Section 9.1)	Overconfident use of invalid projections; maladaptation when future regimes differ from training	Reliability under shift; precaution in deployment	OOD stress tests (cross-GCM/PGW/time splits); shift-aware reporting; monitoring for drift
Extreme-event bias (Section 9.3)	Under-/over-estimation of high-impact hazards; inequitable disaster preparedness	Safety for high-consequence tails	Tail-focused evaluation (event metrics + tail magnitude); physics checks during extremes
Uncalibrated or missing UQ (Section 9.4)	False confidence; inability to manage risk or compare options	Honest uncertainty communication; calibrated decision support	Probabilistic verification; calibration diagnostics; communicate limits using calibrated language
Sparse/biased observations and capacity gaps (Section 6.2)	Systematic degradation in data-poor regions; inequitable climate services	Equity and inclusion	Dataset documentation; stratified evaluation by region/vulnerability; co-production with users
Weak reproducibility/benchmarking (Section 9.5)	Unverifiable claims; erosion of trust and accountability	Auditability and accountability	Model/data cards; versioned pipelines; shared benchmarks and minimal validation standards

Table 8. Auditable minimum standards for ethical ML downscaling in climate services.

Domain	Minimum Auditable Checks (Report as Numbers/Artifacts)
Fairness/equity	Stratified MAE/RMSE and bias; worst-group performance; disparity gaps; tail metrics stratified by regime (e.g., CSI/FSS at P95/P99).
Transparency	Model card + dataset datasheet; intended/non-intended use; training/eval split policy; versioning and changelog [170,171].
Uncertainty	Calibration/coverage at multiple levels; proper scoring (e.g., CRPS); tail reliability; uncertainty communication protocol [87,174].
Governance	Named accountable roles; independent evaluation gate; drift monitoring triggers; incident-response procedure; documentation aligned to AI RMF/management-system standards [156,176,177].

Table 9. Minimum diagnostic and validation standards for operational ML downscaling (roadmap-level checklist).

Component	Minimum Requirements (Must Be Reported)
Generalization	In-domain test and at least one explicit OOD test (cross-GCM, cross-region, or scenario/PGW warm-test when applicable); report performance deltas (in vs. OOD).
Initialization sensitivity	Multi-seed evaluation with mean ± std (or confidence intervals) for core metrics; report whether conclusions hold across seeds.
Extremes	Threshold-based scores (e.g., CSI/FSS at P95/P99) and tail magnitude errors (e.g., conditional MAE above P95) stratified by key regimes.
Physical consistency	At least one physics/budget diagnostic relevant to the variable (e.g., conservation or closure proxy), reported alongside skill metrics.
Uncertainty	If probabilistic, calibration/coverage and proper scoring (e.g., CRPS) and a short uncertainty-communication protocol [87,174].
Transparency and governance	Release documentation (model card + datasheet), versioning/changelog, and named accountable roles aligned with risk/governance frameworks [156,170,171].

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Najafi, H.; Lagerwall, G.L.; Obeysekera, J.; Liu, J. Machine Learning in Climate Downscaling: A Critical Review of Methodologies, Physical Consistency, and Operational Applications. Water 2026, 18, 271. https://doi.org/10.3390/w18020271

AMA Style

Najafi H, Lagerwall GL, Obeysekera J, Liu J. Machine Learning in Climate Downscaling: A Critical Review of Methodologies, Physical Consistency, and Operational Applications. Water. 2026; 18(2):271. https://doi.org/10.3390/w18020271

Chicago/Turabian Style

Najafi, Hamed, Gareth Lynton Lagerwall, Jayantha Obeysekera, and Jason Liu. 2026. "Machine Learning in Climate Downscaling: A Critical Review of Methodologies, Physical Consistency, and Operational Applications" Water 18, no. 2: 271. https://doi.org/10.3390/w18020271

APA Style

Najafi, H., Lagerwall, G. L., Obeysekera, J., & Liu, J. (2026). Machine Learning in Climate Downscaling: A Critical Review of Methodologies, Physical Consistency, and Operational Applications. Water, 18(2), 271. https://doi.org/10.3390/w18020271

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Machine Learning in Climate Downscaling: A Critical Review of Methodologies, Physical Consistency, and Operational Applications

Abstract

1. Introduction: The Imperative for High-Resolution Climate Projections and the Rise of Machine Learning

1.1. Positioning This Review in the Literature

1.2. Overview of the Review’s Scope and Objectives

2. Review Methodology

2.1. Search Strategy and Data Sources

2.2. Inclusion and Exclusion Criteria

3. Background: The Downscaling Problem

3.1. The Scale Gap in Climate Modeling and the Need for Downscaling

3.2. Sectoral Implications of the Scale Gap

3.3. Limitations of Traditional Downscaling Methods

The Fidelity–Cost Trilemma

3.4. Emergence and Promise of ML in Transforming Statistical Downscaling

4. The Evolution of Machine Learning Approaches in Climate Downscaling

4.1. Early Applications and Classical ML Benchmarks

4.2. The Deep Learning Paradigm Shift

4.2.1. Pioneering Work with Convolutional Neural Networks (CNNs)

4.2.2. Architectural Innovations

U-Nets

Residual Networks (ResNets)

Generative Adversarial Networks (GANs)

Diffusion Models

Transformers

Operational Constraints and Deployability

5. The Physical Frontier: Hybrid and Physics-Informed Downscaling

5.1. The Imperative for Physical Consistency

5.2. Architectural Integration of Physical Laws: PIML

5.3. Hybrid Frameworks: Merging Dynamical and Statistical Strengths

5.4. Enforcing Physical Realism in Practice

5.4.1. The Frontier of Physics-Informed Machine Learning (PIML)

The Promise of Physics–ML Integration

Implementation Approaches for PIML

Case Studies and Results

6. Data, Variables, and Preprocessing Strategies in ML-Based Downscaling

6.1. Common Predictor Datasets (Low-Resolution Inputs)

6.2. High-Resolution Reference Datasets (Target Data)

6.3. Key Downscaled Variables

6.4. Feature Engineering and Selection

6.5. Data Preprocessing Challenges

6.6. Quantitative Benchmarks and Methodological Uncertainties in Preprocessing

6.6.1. Normalization Sensitivity and Extremes

6.6.2. Regridding Artifacts and Representativeness

6.6.3. The Bias Correction Paradox

7. A Prescriptive Protocol for Model Evaluation

7.1. Variable-Specific Minimum Suites

7.1.1. Protocol for Precipitation Downscaling

7.1.2. Protocol for Temperature Downscaling

7.2. Comparative Analysis and State of the Art

7.3. Validation Under Non-Stationarity

7.3.1. Pseudo-Global Warming (PGW) Experiments

7.3.2. Transfer Learning and Domain Adaptation

7.3.3. Process-Informed Architectures and Predictor Selection

7.3.4. Validation Strategies for Non-Stationary Conditions

7.4. A Multi-Faceted Toolkit for Model Evaluation

Uncertainty Baselines

7.5. Tier 1: Mandatory Baseline Diagnostics

7.6. Tier 2: Essential Operational Standards

7.7. Tier 3: Advanced and Probabilistic Standards

7.8. Diagnostic Visualization Suite

7.9. Operational Relevance: Beyond Statistical Skill

8. Critical Investigation of Model Performance and Rationale

8.1. Rationale for Model Choices

8.2. Strategic Framework for Architecture Selection

8.3. The Coherent Pipeline: Linking Loss, Architecture, and Validation

8.4. Factors Contributing to Model Success

8.5. Factors Hindering Model Learning

Comparative Susceptibility to Physical Inconsistency

8.6. Comparative Analysis of ML Approaches

9. Overarching Challenges in ML-Based Climate Downscaling

9.1. Transferability and Domain Adaptation: The Achilles’ Heel

Case Studies (Quantitative Case Studies)

9.2. Physical Consistency and Interpretability

9.2.1. Ensuring Physically Plausible Outputs

9.2.2. Explainable AI (XAI): Unmasking the “Black Box”

The Need for Interpretability

Common XAI Techniques Applied to Downscaling

Challenges in XAI for Climate Downscaling

9.3. Representation of Extreme Events