Deep Climate Model Distillation for Localized Flood Forecasting in Low-Resource Areas

Olaniyan, Julius; Olaniyan, Deborah; Obagbuwa, Ibidun C.; Ngafeeson, Madison N.

doi:10.3390/meteorology5020016

Open AccessArticle

Deep Climate Model Distillation for Localized Flood Forecasting in Low-Resource Areas

¹

Center for Applied Data Science (CADS), Faculty of Natural and Applied Sciences, Sol Plaatje University, Kimberley 8301, South Africa

²

Department of Computer Science and Informatics, Faculty of Natural and Agricultural Sciences, University of the Free State, Qwaqwa Campus, Bloemfontein 9301, South Africa

³

Department of Mathematical Science and Computing, Walter Sisulu University, Mthatha 5117, South Africa

⁴

Rinker School of Business, Palm Beach Atlantic University, 901 S Flagler Drive, West Palm Beach, FL 33416, USA

^*

Author to whom correspondence should be addressed.

Meteorology 2026, 5(2), 16; https://doi.org/10.3390/meteorology5020016 (registering DOI)

Submission received: 18 May 2026 / Revised: 12 June 2026 / Accepted: 16 June 2026 / Published: 19 June 2026

(This article belongs to the Special Issue Early Career Scientists’ (ECS) Contributions to Meteorology (2026))

Download

Browse Figures

Versions Notes

Abstract

Floods remain among the most devastating natural disasters globally, disproportionately impacting low-resource regions where real-time flood forecasting is constrained by limited computational infrastructure and the scarcity of fine-resolution predictive models. Although state-of-the-art global climate models achieve high predictive accuracy, their scale and computational complexity restrict their applicability in localized and resource-constrained settings. This study proposes a deep climate model distillation framework that transfers knowledge from a high-capacity Fourier Neural Operator (FNO)-based global climate model inspired by FourCastNet into lightweight, regionally adaptive student networks suitable for edge deployment. The framework combines climate variables, satellite observations, and hydrological measurements to improve localized flood prediction. Knowledge transfer is achieved through a multi-objective distillation strategy that combines supervised learning, soft-target alignment, and intermediate feature matching. Experimental evaluation across multiple flood-prone regions in Sub-Saharan Africa and South Asia shows that the distilled student model achieves an average classification accuracy of 0.89, an AUC of 0.91, and an F1-score of 0.88, retaining approximately 96.7% of the teacher model’s predictive performance. In continuous discharge estimation, the model attains a mean absolute error of 0.17, RMSE of 0.24, and an R² score of 0.85. The proposed distillation approach yields an 8× reduction in inference latency and over a 20× reduction in model size, enabling real-time execution on low-power edge devices such as the Raspberry Pi 4 and NVIDIA Jetson Nano. The student model further demonstrates robust regional and temporal generalization, with limited performance degradation in unseen geographic areas and during extreme flood years.

Keywords:

flood forecasting; knowledge distillation; low-resource regions; climate modeling; spatiotemporal attention

Graphical Abstract

1. Introduction

Floods are among the most devastating natural hazards globally, displacing millions, damaging infrastructure, and threatening food and water security [1]. Their impact is especially severe in low-resource regions, where limited access to early warning systems, resilient infrastructure, and emergency response services exacerbates vulnerability [2]. With climate change driving more frequent and extreme weather events, the demand for reliable, localized flood forecasting has become increasingly urgent [3].

While large-scale climate models, such as those from the Coupled Model Intercomparison Project (CMIP), have advanced our understanding of global hydrological trends, translating these forecasts into actionable local predictions remains challenging [4]. High-resolution climate simulations are computationally intensive, requiring supercomputing resources and dense observational datasets that are often unavailable in low-income or remote regions [5]. Moreover, these models typically operate at coarse spatial and temporal resolutions, limiting their ability to capture localized rainfall-runoff processes essential for timely flood warnings [6].

Deep learning has offered promising avenues to bridge this gap. Conventional flood prediction models leverage Convolutional Long Short-Term Memory (ConvLSTM) networks, graph-based hydrological modeling, or attention mechanisms to improve local forecasts [7], but they remain constrained by computational cost, sparse data, and limited generalizability across regions. Separately, knowledge distillation has emerged as a method to compress complex teacher models into smaller, efficient student models while preserving predictive performance [8]. Recent work in climate model distillation has focused on reducing model size or inference time [9,10], yet these approaches rarely integrate multiple objectives such as feature, output, and attention alignment, nor do they explicitly handle low-resource, data-sparse environments.

Despite advances in deep flood forecasting and climate model compression, no existing framework simultaneously addresses the following challenges: (i) localized flood prediction under data scarcity, (ii) multi-modal integration of satellite and limited hydrological observations, (iii) efficient student models that retain interpretability, and (iv) attention mechanisms for capturing spatiotemporal dependencies that highlight spatiotemporally critical features.

To address these gaps, we propose a Deep Climate Model Distillation framework for Localized Flood Forecasting in Low-Resource Areas. Our technical contributions are as follows:

Multi-objective knowledge distillation: We design a distillation strategy that aligns outputs, intermediate features, and attention maps between teacher and student models to maximize predictive fidelity.
ConvLSTM–Transformer hybrid student architecture: The student model combines ConvLSTM layers for spatiotemporal feature extraction with a Transformer module to capture long-range dependencies in flood-relevant signals.
Multi-modal data fusion: Satellite imagery and limited ground-based hydrological measurements are integrated to enhance model accuracy in data-sparse regions.
Attention-guided interpretability: Spatial and temporal attention mechanisms highlight critical areas and periods influencing flood risk, improving model explainability for end-users and decision-makers.

The remainder of this paper is organized as follows. Section 2 reviews prior work in climate model distillation and deep flood forecasting. Section 3 details the proposed methodology, including architecture design, multi-objective loss formulation, and training strategies. Section 4 presents experimental results across diverse low-resource case studies. Section 5 discusses implications, limitations, and future research directions.

2. Literature Review

2.1. Deep Learning for Climate and Flood-Related Applications

Deep learning has increasingly become central to addressing climate-related challenges, including flood forecasting, resource management, and disaster risk reduction. Sharma and Kaur [11] provide a comprehensive overview of how deep learning techniques contribute to sustainable development across interconnected domains such as climate systems, agriculture, energy, and urban infrastructure. Their work highlights the potential of data-driven models to capture nonlinear climate dynamics but also underscores persistent barriers related to computational cost, interpretability, and deployment in resource-constrained regions.

Several studies have focused specifically on flood prediction using deep learning architectures. Karapetyan et al. [12] propose a vision-based deep framework for coastal flood prediction under climate change, leveraging high-resolution imagery to model shoreline adaptations and inundation risk. While effective for coastal environments, such vision-heavy approaches are typically computationally expensive and rely on dense observational data, limiting applicability in inland and low-resource regions.

More recently, Kow et al. [13] explored hybrid Transformer–LSTM architectures for flood forecasting and water level prediction. These models demonstrated improved long-range dependency modeling and enhanced interpretability through attention mechanisms. However, both studies assume access to reliable hydrological sensor networks and centralized compute infrastructure, conditions rarely met in developing regions.

2.2. Lightweight and Edge-Oriented Flood Modeling

To address deployment constraints, research has increasingly shifted toward lightweight and edge-compatible models. FloodNet-Lite, proposed by Thirugnanasammandamoorthi et al. [14], introduces an optimized U-Net variant for flood mapping from remote sensing imagery, explicitly targeting edge deployment in next-generation (6G-enabled) environments. Their results demonstrate that architectural optimization can significantly reduce model size without major accuracy loss.

Similarly, Li et al. [15] propose an AI-driven flood detection and prediction system with visual energy optimization for consumer electronics. Their work emphasizes power efficiency and real-time inference but primarily focuses on visual detection rather than integrated hydrological forecasting.

While these studies advance lightweight modeling, they typically operate independently of large-scale climate models, missing the opportunity to transfer knowledge from high-capacity global systems to local predictors.

2.3. Deep Climate Modeling and Knowledge Distillation

Large-scale climate models and deep neural networks trained on reanalysis data offer strong predictive performance but are prohibitively expensive for localized deployment. Xiang and Fujii [16] address this challenge through a proposed domain-adapted BERT distillation and reinforcement ensemble (DARE), a distill-and-reinforce framework that compresses ensemble neural networks for climate-domain processing. Their work demonstrates that distillation can preserve predictive skill while improving efficiency, particularly for climate-related time-series tasks.

More broadly, knowledge distillation has evolved from simple logit matching to more sophisticated feature-level and representation transfer mechanisms. Yang et al. [17] show that aligning intermediate representations significantly improves knowledge transfer in large language models, suggesting that distillation effectiveness depends heavily on how internal representations, not just outputs are transferred. These findings are highly relevant to climate modeling, where spatial and temporal feature hierarchies encode critical physical information.

2.4. Toward Interpretable, Resource-Efficient Climate Intelligence

Interpretability and geospatial awareness are increasingly recognized as essential for operational climate intelligence. Chithra [18] emphasizes the integration of deep learning with geospatial analytics for climate-smart resource management, advocating for models that align predictions with physical geography and policy needs. Likewise, Li et al. [19] introduce interpretable Transformer–LSTM hybrids that provide attention-based explanations for flood dynamics, reinforcing the importance of transparency in decision-support systems.

At the methodological level, Zhou et al. [20] propose scalable Transformer architectures for high-dimensional multivariate time-series forecasting, addressing efficiency challenges through architectural scaling strategies. While not climate-specific, their framework offers insights into managing complexity in spatiotemporal prediction tasks.

Recent advances in hydrological forecasting increasingly emphasize interdisciplinary integration beyond traditional hydrodynamic modeling [21]. Modern flood prediction systems are no longer viewed solely as numerical forecasting tools but as operational decision-support infrastructures that must incorporate sustainability, deployment feasibility, infrastructure resilience, emergency planning, and resource optimization [22]. Studies in hydrological AI have shown that the practical value of flood forecasting depends not only on predictive accuracy but also on interpretability, computational efficiency, scalability, and usability within real-world disaster-management ecosystems. Consequently, interdisciplinary contributions from systems analytics, organizational decision sciences, and technology management have become increasingly relevant in climate-risk intelligence research [23]. This perspective aligns with emerging hydrological literature advocating integrated socio-technical approaches for translating advanced predictive models into deployable early-warning systems suitable for resource-constrained and climate-vulnerable regions. In this context, the proposed framework extends beyond hydrological prediction by incorporating lightweight deployment considerations, operational interpretability, and resource-aware climate intelligence for localized flood-risk management.

2.5. Comparative Analysis of Existing Approaches

To clarify the research gap, Table 1 contrasts representative approaches across key dimensions relevant to localized flood forecasting.

Collectively, prior work demonstrates progress in deep flood forecasting, lightweight modeling, interpretability, and climate model distillation. However, these research streams largely evolve in isolation. No existing framework simultaneously distills knowledge from high-capacity global climate models into lightweight, interpretable student models explicitly designed for localized flood forecasting under data and resource constraints.

This work addresses this gap by proposing a deep climate model distillation framework that transfers spatiotemporal knowledge from global climate models into regionally adapted, attention-driven student architectures. Unlike prior studies, the approach jointly emphasizes (i) global-to-local knowledge transfer, (ii) multi-modal data fusion, (iii) resource-efficient deployment, and (iv) interpretable flood prediction in data-scarce, high-risk regions.

3. Materials and Methods

This study created a framework to simplify large, high-capacity climate models into smaller, regionally tailored forecasting models aimed at predicting floods in low-resource areas. The methodology included four main parts: building multimodal inputs, developing and fine-tuning a teacher model, designing a student model with geospatial conditioning, and implementing a knowledge distillation process led by a multi-component loss function. The entire training and evaluation process was built to enable predictions on edge platforms while keeping high accuracy.

3.1. Dataset and Preprocessing

The primary atmospheric and hydrological inputs used in this study were derived from the ERA5 reanalysis dataset and selected outputs from CMIP6 historical climate simulations. ERA5, produced by the European Centre for Medium-Range Weather Forecasts (ECMWF), Reading, UK, provides observation-constrained hourly atmospheric reanalysis fields and served as the principal source of dynamically consistent meteorological variables, including precipitation, temperature, soil moisture, surface runoff, and wind-related variables [24].

In contrast, CMIP6 data were not used as reanalysis products but rather as auxiliary climate-model simulations employed during the teacher-model pretraining stage to expose the network to broader climatological variability and long-term hydroclimatic patterns [25]. Specifically, historical CMIP6 experiments were utilized only for large-scale representation learning prior to regional fine-tuning on observation-aligned datasets. This distinction is important because CMIP6 simulations represent model-generated climate trajectories under prescribed forcings rather than observation-constrained atmospheric reconstructions. To avoid ambiguity, we explicitly distinguish between ERA5 (observationally constrained reanalysis) and CMIP6 (climate model simulations) throughout the study.

To maintain consistency across datasets, only variables common to both ERA5 and selected CMIP6 historical simulations were retained during pretraining. The final forecasting stages and downstream flood prediction tasks relied primarily on ERA5, satellite observations, and in situ hydrological measurements.

To capture high-frequency, short-duration rainfall events that often cause flash floods, the framework included precipitation estimates from the Global Precipitation Measurement (GPM) mission, providing data at 0.1° spatial and 30 min temporal resolution [26]. Ground-based measurements, including river discharge and rainfall from national hydrological services, were used when available to improve accuracy. Global repositories such as the Global Runoff Data Centre (GRDC) and the Global Precipitation Climatology Centre (GPCC) were used not only for validation but also to ensure traceability of hydrological event signals [27,28]. Each flood event incorporated into the dataset is explicitly linked to its source metadata, including event identifiers, geographic extents, basin associations, and time intervals derived from these observational records.

For geospatial modeling, static inputs such as elevation and land cover were included. Elevation data were obtained from the Shuttle Radar Topography Mission (SRTM) and the hydrologically conditioned HydroSHEDS dataset, both enabling accurate delineation of catchment structures and runoff pathways [29,30]. Land cover and surface characteristics were defined using the MODIS MCD12Q1 dataset, which provides global annual land cover classification at 500 m resolution according to the IGBP scheme [31].

A thorough data harmonization and preprocessing pipeline was established to align heterogeneous data sources from climate reanalysis, satellite observations, and in situ hydrological measurements. A key clarification is that spatial alignment to a common grid was performed at a nominal resolution of 0.1° to enable multimodal fusion; however, this does not imply that all datasets possess intrinsic or physically meaningful resolution at this scale.

Each dataset was first processed at its native spatial resolution before transformation to the unified grid. ERA5 reanalysis fields (native resolution approximately 0.25°) and CMIP6 climate model outputs (typically 1–2° resolution) were upsampled to 0.1° using bilinear interpolation for consistency. In contrast, high-resolution static geospatial layers such as MODIS land cover (500 m), SRTM elevation (30–90 m), and HydroSHEDS products (3–15 arc-seconds) were aggregated to 0.1° using mean aggregation for continuous variables and majority voting for categorical variables. These transformations are strictly for computational alignment and do not introduce additional physical information beyond the native resolution of each dataset.

Continuous variables such as precipitation and soil moisture were interpolated using bilinear resampling, while categorical data such as land cover classes were processed using nearest-neighbor interpolation to preserve discrete class structure. All time-series variables were standardized to a uniform 3-hourly temporal resolution, and missing values resulting from data gaps or sensor inconsistencies were filled using linear interpolation with controlled gap handling to avoid physically unrealistic long-range interpolation. Continuous variables were normalized using z-score standardization to stabilize training.

The multimodal inputs were then combined into a unified tensor representation capturing both static and dynamic environmental variables across space and time. This representation enables the model to learn coupled interactions between precipitation forcing, soil moisture dynamics, topographic controls, and river discharge evolution, which are essential for early flood onset prediction.

For supervised learning, output targets were constructed using a harmonized spatiotemporal labeling pipeline derived from event-based flood records and hydrological observations. Flood occurrence labels were primarily sourced from the Dartmouth Flood Observatory (DFO), University of Colorado Boulder, Boulder, CO, USA event archive and complementary regional disaster databases. To ensure transparency and reproducibility, each flood event is explicitly linked to its metadata, including event identifiers, affected regions, basin boundaries, and timestamps derived from observational records. All discharge targets were standardized using basin-wise z-score normalization prior to training, and inverse transformation was not applied during evaluation to ensure consistency of comparative error analysis across basins.

Since these datasets are event-based rather than grid-based, a structured conversion pipeline was implemented to transform flood events into gridded spatiotemporal training labels. Each DFO event was represented as a spatial polygon with associated start and end timestamps, which were rasterized onto the model grid at 0.1° resolution to generate binary flood-affected masks. To account for uncertainty in reporting and hydrological delay, each event was assigned a temporal propagation window spanning its reported duration with additional buffer steps. In cases of spatial or temporal overlap between events, a union-based aggregation strategy was applied to prevent duplication and ensure consistent event representation.

The resulting grid-time labels therefore represent harmonized event footprints rather than direct pixel-level measurements. We explicitly acknowledge that this introduces uncertainty due to coarse event reporting resolution and ambiguity in mapping large-scale flood documentation to fine-resolution grid cells.

For flood severity estimation, depth categories (minor, moderate, severe) were constructed using regional hydrological reports, disaster impact assessments, and model-derived inundation proxies where direct measurements were unavailable. The severity bins were defined as: minor (0–0.3 m), moderate (0.3–1.0 m), and severe (>1.0 m). We acknowledge that in several regions, particularly in Sub-Saharan Africa, direct depth observations are sparse, and thus these labels should be interpreted as best-effort hydrological approximations rather than direct measurements.

Geographically, the study focused on flood-prone low-resource countries where forecasting capacity is limited and risk exposure is high. Case studies included Nigeria, Kenya, and Mozambique in Sub-Saharan Africa, and Bangladesh, India, and Nepal in South Asia, covering diverse hydrological regimes including monsoonal river flooding, flash floods, and seasonal basin overflow events. Each country was treated as a separate forecasting zone using geospatial masks derived from global hydrological datasets, enabling both within-region training and cross-region transferability analysis.

3.2. Input Representation and Data Fusion

Attention-related representations extracted from the teacher and student networks are explicitly distinguished according to their underlying architectural mechanisms. Since the Fourier Neural Operator (FNO) teacher does not natively compute transformer-style Query, Key, and Value (QKV) self-attention, the teacher-side interaction representations are constructed from spectral channel interaction responses derived from Fourier mixing operations.

For the student model, transformer attention matrices are defined as:

A_{S} = {A_{S}^{(l)}}, A_{S}^{(l)} \in R^{N \times N}

where

N = T \times H_{l} \times W_{l}

It represents the total number of spatiotemporal tokens formed by flattening the temporal and spatial dimensions at layer

l

. Thus, the student attention mechanism operates over joint spatiotemporal representations rather than purely spatial tokens.

Specifically, each token corresponds to a feature vector associated with a particular temporal step and spatial grid location:

z_{t} (i, j) \in R^{d}

where

t \in {1, \dots, T}

denotes temporal position,

(i, j)

denotes spatial coordinates,

d

is the latent embedding dimension.

The resulting transformer attention matrices therefore capture dependencies across both temporal evolution and spatial hydrological interactions.

For the teacher network, spectral interaction maps are extracted from Fourier-domain feature responses and represented as:

A_{T} = {A_{T}^{(l)}}, A_{T}^{(l)} \in R^{N_{s} \times N_{s}}

where

N_{s} = H_{l} \times W_{l}

This corresponds to the flattened spatial spectral representation at layer

l

. These interaction maps encode spatial correlation structures learned through spectral convolution and Fourier channel mixing.

Because the teacher and student representations operate in different domains (spectral spatial interactions vs. spatiotemporal transformer attention), a learned projection adapter is applied before alignment:

{\tilde{A}}_{T}^{(l)} = Q_{l} (A_{T}^{(l)})

where

Q_{l} : R^{N_{s} \times N_{s}} \to R^{N \times N}

This is a learned projection operator that maps teacher spectral interaction representations into the student spatiotemporal attention space.

The final attention alignment objective is therefore defined as:

L_{attn} = \sum_{l = 1}^{M} {∥ Q_{l} (A_{T}^{(l)}) - A_{S}^{(l)} ∥}_{2}^{2}

where

M

denotes the number of aligned layer pairs,

Q_{l} (\cdot)

is the learned projection adapter,

∥ \cdot ∥_{2}^{2}

denotes the squared Frobenius norm.

This formulation enables stable cross-architecture distillation between the spectral FNO teacher and the lightweight transformer-based student model while preserving both spatial hydrological interactions and temporal flood evolution patterns.

3.3. Teacher Model Design with High-Resolution Deep Climate Architecture

The teacher model was instantiated using a Fourier Neural Operator (FNO)–based architecture inspired by FourCastNet, a Fourier-based deep learning framework developed by NVIDIA Research and collaborators, designed to capture long-range spatiotemporal dependencies in climate fields. The model consists of 12 spectral convolution layers, each operating in the frequency domain using a truncated Fourier representation. For each layer, 32 spectral modes were retained along both spatial dimensions, balancing expressive capacity with computational efficiency.

The teacher operates at a spatial resolution of 0.1°, consistent with the harmonized input data, and processes multivariate climate fields over temporal windows of length

T

. Each spectral convolution layer is followed by pointwise nonlinear transformations and residual connections to stabilize training. The hidden channel width was fixed at 256 feature maps across layers.

The teacher model was first pretrained using ERA5 reanalysis fields together with selected CMIP6 historical climate simulations to learn large-scale atmospheric and hydrological representations across diverse climatic regimes. During this stage, CMIP6 outputs were used solely as auxiliary climate-model simulations to enrich climatological variability during representation learning, while ERA5 provided the primary observation-constrained atmospheric states used for downstream regional adaptation. Subsequently, it was partially fine-tuned on region-specific data by unfreezing the final four spectral layers while keeping earlier layers fixed. This strategy preserved global climatological knowledge while allowing adaptation to localized flood-generating processes such as regional rainfall–runoff relationships and terrain-driven flow accumulation.

The final teacher outputs, denoted

{\hat{y}}_{T}

, were produced per grid cell and per forecast horizon and served as soft targets for knowledge distillation into the student model. A complete schematic of the teacher–student knowledge distillation framework is presented in Figure 1.

Although the Fourier Neural Operator (FNO)-based teacher architecture was originally developed for large-scale atmospheric forecasting, in this work it is employed as a general spatiotemporal climate representation learner rather than a standalone hydrological simulator. Specifically, the teacher model first learns multivariate atmospheric and land-surface dynamics from ERA5 reanalysis fields and selected CMIP6 historical climate simulations, including precipitation evolution, soil moisture transport, runoff accumulation, temperature variability, and large-scale circulation patterns. These latent climate representations are subsequently adapted to flood forecasting through supervised regional fine-tuning using flood occurrence labels and discharge observations.

To enable hydrological prediction, the final latent representations extracted from the spectral backbone are passed through two task-specific prediction heads. The first head consists of a fully connected classification module with sigmoid activation for binary flood occurrence and flood severity estimation. The second head is a regression module composed of linear projection layers that estimate continuous discharge values for each forecast horizon. This design allows the atmospheric representations learned by the teacher model to be transformed into flood-relevant hydrological outputs while preserving large-scale spatiotemporal dependencies.

Because the teacher and student architectures operate in different representational spaces namely spectral representations in the FNO teacher and recurrent-attention representations in the ConvLSTM–Transformer student a dimensional alignment mechanism was introduced during distillation. Specifically, intermediate teacher feature maps were projected through learned linear adapter layers before feature matching was applied. Let the projection operator be defined as:

P_{l} : R^{C_{T} \times H_{l} \times W_{l}} \to R^{C_{S} \times H_{l} \times W_{l}}

where

C_{T}

denotes the number of teacher feature channels,

C_{S}

denotes the number of student feature channels,

H_{l}

and

W_{l}

represent the spatial dimensions at layer

l

.

The projection operator

P_{l}

aligns teacher feature dimensions with the corresponding student representations at matched layers.

The feature distillation objective is therefore formulated as:

L_{feature} = \sum_{l = 1}^{M} {∥ P_{l} (f_{T}^{(l)}) - f_{S}^{(l)} ∥}_{2}^{2}

where

M

denotes the number of matched teachers–student layer pairs used for distillation,

f_{T}^{(l)}

represents the feature map extracted from the teacher model at layer

l

,

f_{S}^{(l)}

represents the corresponding student feature representation,

and

P_{l} (\cdot)

is the learned projection adapter used for dimensional alignment.

Similarly, interaction representations extracted from the teacher network were reshaped and aligned with the student attention matrices prior to optimization. This projection-based alignment ensured stable knowledge transfer despite the architectural heterogeneity between the spectral FNO teacher and the lightweight recurrent-attention student model.

3.4. Student Model Design with Geospatial Embedding

The student model was designed as a lightweight hybrid architecture, presented in Figure 2, combining convolutional recurrent modeling with attention-based temporal reasoning. Specifically, the model consists of two stacked ConvLSTM layers with hidden dimensions of 64 and 128 channels, respectively. These layers capture localized spatiotemporal dependencies in precipitation and hydrological signals while maintaining a low parameter footprint.

The ConvLSTM outputs are passed to a temporal transformer module composed of 2 transformer encoder layers, each with 4 attention heads and a model dimension of 128. This module enables selective attention over temporal states, allowing the student to focus on critical rainfall accumulation and soil saturation patterns preceding flood events.

Geospatial priors derived from elevation, slope, and land-use attributes are embedded using a two-layer Multilayer Perceptron (MLP) with Rectified Linear Unit (ReLU) activations, producing a geospatial embedding of dimension

d_{g} = 32

. These embeddings are injected into the ConvLSTM hidden states via spatial concatenation, providing explicit terrain-aware conditioning.

So, the student model contains approximately 1.8 million parameters, compared to over 60 million parameters in the teacher model. This reduction enables efficient deployment on edge devices without sacrificing predictive performance.

3.5. Knowledge Distillation Loss Formulation

To effectively transfer hydrometeorological knowledge from the high-capacity teacher model to the lightweight student model, a multi-objective distillation framework is employed. The total training objective combines direct supervision from ground-truth observations with structured knowledge transfer from the teacher’s outputs, intermediate representations, and attention patterns.

The overall loss function is defined as:

L_{total} = α \times L_{hard} + β \times L_{soft} + γ \times L_{feature} + λ \times L_{attn}

where

L_{total}

is the total loss function;

L_{hard}

denotes the supervised learning loss;

L_{soft}

represents the soft-target distillation loss;

L_{feature}

is the feature alignment loss;

L_{attn}

is the attention alignment loss;

α, β, γ, λ

are coefficients controlling the relative contribution of each term.

In this study, the weights are set as:

α = 1.0, β = 0.5, γ = 0.3, λ = 0.2

These values are selected through validation-based tuning to balance prediction accuracy and stable knowledge transfer.

3.5.1. Hard Supervision Loss

The hard loss enforces learning from ground-truth labels and is defined as follows:

For flood classification:

L_{hard} = Binary Cross-Entropy (y, {\hat{y}}_{S})

For discharge regression:

L_{hard} = Mean Squared Error (y, {\hat{y}}_{S})

where

{\hat{y}}_{S}

represents the student model prediction;

y

represents the ground-truth observation.

3.5.2. Soft-Target Distillation Loss

The soft loss aligns student predictions with the teacher’s probability distribution using Kullback–Leibler divergence and temperature scaling:

L_{soft} = KL divergence between σ ({\hat{y}}_{T} / τ) and σ ({\hat{y}}_{S} / τ)

where

{\hat{y}}_{T}

is the teacher prediction;

{\hat{y}}_{S}

is the student prediction;

τ

is the temperature parameter (

τ = 4

in this study);

σ (\cdot)

denotes the softmax function.

The temperature parameter

τ

is used to smooth probability distributions, enabling better transfer of inter-class relationships.

3.5.3. Feature Distillation Loss

To transfer intermediate representations, a feature alignment loss is introduced. Since the teacher and student operate in different representation spaces (spectral representations in the teacher vs. recurrent-attention representations in the student), a learned projection operator is used.

The feature loss is defined as:

L_{feature} = \sum_{l = 1}^{M} {∥ P_{l} (f_{T}^{(l)}) - f_{S}^{(l)} ∥}_{2}^{2}

where

M

is the number of matched layer pairs;

l

is the layer index;

f_{T}^{(l)}

is the teacher feature map at layer

l

;

f_{S}^{(l)}

is the corresponding student feature representation at layer

l

;

P_{l} (\cdot)

is a learned linear projection mapping teacher features into the student feature space;

{∥ \cdot ∥}_{2}^{2}

denotes the squared L2 norm.

This projection ensures dimensional compatibility between heterogeneous architectures.

3.5.4. Attention Alignment Loss

To preserve the teacher’s spatiotemporal reasoning structure, attention knowledge is distilled through a learned projection-based alignment mechanism. Since the teacher (FNO-based spectral model) and student (transformer-based model) operate in fundamentally different representational spaces, a learned adapter function is required to ensure meaningful alignment. The attention alignment loss is defined as:

L_{attn} = \sum_{l = 1}^{M} {∥ Q_{l} (A_{T}^{(l)}) - A_{S}^{(l)} ∥}_{2}^{2}

where

M

denotes the number of aligned layer pairs,

Q_{l} (\cdot)

is the learned projection adapter,

∥ \cdot ∥_{2}^{2}

denotes the squared Frobenius norm.

This formulation ensures consistent cross-architecture alignment between the Fourier-based teacher and transformer-based student.

This projection-based formulation is essential due to the structural mismatch between spectral convolutional interactions in the teacher and self-attention mechanisms in the student model.

3.6. Training Pipeline and Edge Profiling

The model development process followed three stages. First, the teacher model was initialized and adapted regionally. Second, knowledge distillation was performed to train the lightweight student model. Finally, edge-device profiling was conducted. The overall workflow begins with pretraining the teacher model on large-scale global climate datasets, specifically ERA5 reanalysis fields and CMIP6 historical climate simulations. These datasets provide long-term, multivariate climate signals that enable the model to learn generalized hydrological patterns of atmospheric and land-surface dynamics. To complement global-scale learning, region-specific climate–hydrological patches were extracted using geospatial masking and combined with satellite and in situ hydrological observations, thereby enriching the representation of localized flood-prone dynamics.

To ensure rigorous evaluation and prevent temporal or spatial leakage, the dataset was partitioned using a spatiotemporal generalization strategy. Data from 2010 to 2018 were used for training, 2019 to 2020 for validation, and 2021 to 2023 for final testing. This temporal separation ensures that the model is evaluated strictly on unseen future periods, reflecting realistic forecasting conditions rather than interpolation within known time windows. In addition, a leave-one-region-out evaluation protocol was adopted to assess spatial generalization. Under this setting, each target region (Nigeria, Kenya, Mozambique, Bangladesh, India, and Nepal) was systematically excluded during training and used exclusively for testing. This design enables a strict evaluation of cross-basin transferability across heterogeneous hydrological regimes.

Following global pretraining, the teacher model was fine-tuned on regional datasets to adapt its representations to local hydrological processes such as rainfall–runoff response, soil moisture evolution, and topographically driven flow accumulation. This fine-tuning stage was supervised using historical flood records and discharge observations, ensuring that the learned representations are explicitly aligned with flood-relevant predictive tasks. Once adapted, the fine-tuned teacher model served as the supervisory network in the knowledge distillation stage, where its outputs, intermediate feature maps, and attention patterns were transferred to the lightweight student model. The student architecture, designed with temporal convolutional layers and attention mechanisms incorporating geospatial embeddings, was trained to replicate the teacher’s predictive behavior while significantly reducing computational complexity and memory requirements.

To ensure robustness and fair evaluation, all experiments were conducted under a unified implementation framework using PyTorch-2.1.0 Lightning with identical preprocessing pipelines across models. Training was performed on an NVIDIA A100 GPU (Santa Clara, CA, USA) using mixed-precision (FP16) computation to enhance efficiency while maintaining numerical stability. Input sequences consisted of eight temporal steps at three-hour intervals, representing a 24 h historical context window, and all variables were harmonized to a nominal spatial resolution of 0.1 degrees following preprocessing alignment. A batch size of 16 was used consistently across all experiments to balance computational efficiency and convergence stability.

Model optimization was carried out using the AdamW optimizer with an initial learning rate of 1 × 10⁻⁴ and a weight decay of 1 × 10⁻⁵, configured with momentum parameters β₁ = 0.9 and β₂ = 0.999. A cosine annealing learning rate schedule with a warm-up phase was applied, in which the learning rate was gradually increased over the first five epochs before decaying to a minimum value of 1 × 10⁻⁶. Training was conducted for up to 50 epochs, with early stopping based on validation loss using a patience of 10 epochs to mitigate overfitting.

For deployment benchmarking, inference experiments were conducted using ONNX Runtime with batch size fixed at 1 to simulate real-time operational deployment conditions. The benchmark input tensor dimensions were configured as C × T × H × W = 14 × 8 × 64 × 64, corresponding to 14 multimodal environmental variables, 8 temporal steps, and a 64 × 64 spatial grid. To improve inference efficiency on edge hardware, INT8 post-training quantization was applied during deployment profiling, while structural pruning was not used in order to preserve predictive stability. Latency measurements were averaged over multiple inference runs on Raspberry Pi 4 Model B (Raspberry Pi Foundation, Cambridge, UK) and an NVIDIA Jetson Nano Developer Kit (NVIDIA Corporation, Santa Clara, CA, USA) platforms under identical runtime settings.

To ensure reproducibility and consistency in deployment reporting, model size is defined as the serialized ONNX model footprint in FP32 precision, including computational graph metadata and runtime buffers. This definition follows standard deployment benchmarking practices in edge AI literature.

The reported parameter counts reflect learnable weights only, while model size (MB) reflects full serialized inference artifacts. Also, INT8 quantization was applied exclusively for deployment benchmarking on edge devices (Raspberry Pi 4 and NVIDIA Jetson Nano), and does not alter the reported FP32 model size values used for compression ratio analysis.

Memory profiling included both static model parameters and dynamic activation memory generated during sequential ConvLSTM processing. Consequently, reported Random Access Memory (RAM) consumption reflects not only the compact model size but also the temporary storage required for hidden-state propagation and transformer attention computations during inference.

The forecasting configuration is defined as a sequence-to-one prediction task. The model receives an input historical window of 24 h, represented as 8 time steps at 3 h intervals, and produces a forecast for flood occurrence and discharge conditions at a 24 h lead time (t + 24 h). This setup ensures short-range predictive capability suitable for early flood warning applications in low-resource environments.

For benchmarking against the Global Flood Awareness System (GloFAS v4.0), developed jointly by the European Commission Joint Research Centre (JRC), Ispra, Italy, and the European Centre for Medium-Range Weather Forecasts (ECMWF), Reading, United Kingdom, we align evaluation by comparing model outputs against GloFAS forecasts at overlapping lead-time windows. Since GloFAS provides ensemble forecasts at multi-day horizons (3–5 days), we aggregate and temporally align these outputs to the nearest comparable evaluation window for consistent cross-system comparison. This ensures that performance differences reflect modeling approach rather than temporal mismatch.

The knowledge distillation process employed a multi-objective loss function combining supervised learning, soft-target alignment, feature matching, and attention alignment. To address severe imbalance between flood and non-flood samples, class-weighted supervision was incorporated into the hard-loss component, with weights computed from inverse class frequency statistics derived from the training set. The weighting coefficients were fixed at α = 1.0, β = 0.5, γ = 0.3, and λ = 0.2, while the distillation temperature was set to τ = 4. These hyperparameters were selected through validation-based tuning to ensure stable convergence and balanced information transfer between teacher and student models. The teacher model was initialized through pretraining on ERA5 and CMIP6 datasets, whereas the student model was initialized using Xavier initialization for convolutional components and uniform initialization for transformer-based layers. To ensure result robustness, all experiments were repeated over five independent runs with different random seeds, and final results are reported as mean values with corresponding standard deviations.

Finally, the distilled student model underwent detailed edge-deployment profiling to evaluate its suitability for real-time forecasting in low-resource environments. The evaluation included parameter count, inference latency, floating-point operations (FLOPs), model storage size, and peak RAM utilization during sequential inference. Profiling was performed using ONNX Runtime with INT8 quantization enabled and batch size fixed at 1 to emulate real-time deployment conditions on Raspberry Pi 4 and NVIDIA Jetson Nano devices. Reported RAM usage includes both static parameter allocation and dynamic activation memory associated with ConvLSTM temporal state retention and transformer attention operations during inference.

4. Results and Discussion

The performance of the framework was assessed using classification, regression, calibration, and efficiency metrics against the selected baseline models.

The evaluation covered both classification tasks (flood occurrence and severity prediction) and regression tasks (continuous discharge estimation). Because flood prediction at fine spatiotemporal resolution is inherently characterized by severe class imbalance, evaluation was extended beyond conventional accuracy and ROC-AUC metrics to include precision–recall (PR) curves and average precision (AP) scores, which provide more reliable assessment under rare-event conditions. Classification performance was therefore evaluated using accuracy, F1-score, recall, false alarm rate, ROC-AUC, and AP, while regression performance was assessed using mean absolute error (MAE), root mean squared error (RMSE), and the coefficient of determination (R²).

To address severe class imbalance, evaluation included precision–recall (PR) curves and average precision (AP) scores in addition to ROC-AUC metrics. Classification thresholds were optimized on the validation set to balance recall and precision for operational flood warning scenarios. During training, a class-weighted supervision strategy was employed to reduce bias toward majority non-flood samples and improve sensitivity to rare flood events.

All deep learning models were evaluated over five independent runs with different random seeds, and results are reported as mean ± standard deviation. Standard deviations remained below ±1.5%, indicating stable convergence. Due to the deterministic nature of tree-based methods, the random forest baseline was evaluated using a fixed configuration.

The training dynamics of the student model, including the convergence behavior of the loss and accuracy curves, are shown in Figure 3. The model showed consistent convergence with minimal overfitting, as validation metrics closely followed the training metrics across epochs.

To comprehensively benchmark the proposed distilled framework, comparisons were conducted against multiple deep learning, operational forecasting, and classical machine learning baselines using identical input modalities and evaluation protocols. The evaluated baselines included a non-distilled ConvLSTM model, a random forest regressor using handcrafted hydrological predictors, transformer-based sequence forecasting architectures including Informer and PatchTST, and an LSTM-based hydrological forecasting model inspired by CAMELS-style regional rainfall–runoff learning frameworks proposed by Kratzert et al. [32].

Large-scale operational flood forecasting outputs from the Global Flood Awareness System (GloFAS v4.0) were incorporated as an external benchmark where compatible discharge observations and forecast horizons were available. GloFAS ensemble reforecast products were evaluated at a 3–5-day lead time and spatially regridded to the unified 0.1° evaluation framework for consistent comparison across models.

Because direct reproduction of physics-based hydrological routing systems such as Variable Infiltration Capacity (VIC) model, the Soil and Water Assessment Tool (SWAT), and the LISFLOOD hydrological and flood forecasting model (VIC, SWAT, and LISFLOOD) requires extensive basin-specific calibration and hydrodynamic parameterization beyond the scope of this study, GloFAS was adopted as a representative operational physics-based baseline. This enables a consistent comparison between the proposed distilled framework and established large-scale hydrological forecasting systems under harmonized spatiotemporal evaluation conditions.

An empirical tradeoff between predictive performance and inference latency is presented, illustrating the benefits of model distillation. The student model retained 96.7% of the teacher model’s classification accuracy while achieving an 8× reduction in latency. These results affirm the feasibility of deploying the distilled model in real-time, low-power environments.

4.1. Quantitative Performance

After establishing the evaluation protocol, the student model was compared with several baseline approaches. The comparison included deep learning models (ConvLSTM, Informer, and PatchTST), an LSTM-based hydrological forecasting model inspired by CAMELS-style rainfall-runoff learning, a Random Forest baseline, and the full-scale FourCastNet teacher model.

As summarized in Table 2, the proposed distilled student model achieved an average classification accuracy of 0.89, F1-score of 0.88, recall of 0.90, and AUC of 0.91, while also attaining an average precision (AP) score of 0.87 under severe class imbalance conditions.

In discharge estimation, performance metrics were computed on z-score normalized discharge values applied per basin to ensure numerical stability and fair comparison across heterogeneous hydrological regimes. Accordingly, MAE and RMSE are reported in normalized discharge units rather than raw physical discharge (m³/s). This ensures that evaluation reflects model predictive error relative to basin-specific variability rather than absolute magnitude differences across catchments.

Although the teacher model achieved the highest overall predictive performance, the distilled student retained approximately 96.7% of the teacher’s classification capability while reducing inference latency by nearly 8× and model size by more than 20×. Compared with transformer-based baselines such as Informer and PatchTST, the proposed student model achieved comparable forecasting skill with significantly lower computational overhead and memory consumption, making it more suitable for deployment on low-power edge devices.

The CAMELS-style LSTM hydrology baseline demonstrated competitive discharge prediction capability but exhibited reduced generalization across geographically heterogeneous flood regimes. Similarly, GloFAS reference forecasts provided strong large-scale riverine forecasting performance but lacked localized adaptation for high-resolution regional flood prediction. Classical machine learning approaches such as Random Forest showed lower predictive accuracy and weaker temporal consistency, particularly for extreme flood events and long forecasting horizons.

Precision–recall analysis showed that the model maintained reliable rare-event detection performance despite the substantial imbalance between flood and non-flood samples. This indicates that the reported performance is not solely driven by majority-class dominance but reflects meaningful predictive skill for operational flood early-warning applications.

4.2. Generalization Gap Analysis (Train/Validation vs. Test Performance)

A consistent performance gap is observed between training/validation results and final test performance, as reported in Table 2 and Figure 3. This gap is not indicative of model instability, but rather reflects the presence of combined temporal and spatial distribution shifts inherent in the evaluation design.

The temporal split (training: 2010–2018, validation: 2019–2020, test: 2021–2023) introduces a measurable non-stationarity effect in hydroclimatic forcing variables, including changes in precipitation intensity, seasonal variability, and extreme event frequency. These shifts are particularly pronounced in monsoon-influenced regions, where interannual variability is high.

In addition to temporal shift, the evaluation incorporates a leave-one-region-out spatial generalization setting, which further increases distributional divergence between training and test domains. Under this setting, each test region represents a structurally different hydrological regime not seen during training, including Sahelian flash flood systems, South Asian monsoon basins, and Himalayan catchments.

A detailed per-region breakdown shows that performance degradation is most significant in highly non-stationary hydrological environments, particularly monsoon-dominated basins such as Bangladesh and parts of India. In contrast, regions with more stable hydrological regimes exhibit smaller performance drops, indicating that model generalization is sensitive to climatic variability rather than uniformly degraded.

Importantly, despite this expected distribution shift, the model maintains strong performance across all test regions, confirming that the observed gap reflects realistic deployment conditions rather than overfitting. This behavior is consistent with prior findings in spatiotemporal climate modeling, where cross-domain transfer typically incurs measurable but bounded degradation under strict spatial–temporal separation protocols.

Extreme Flood Event Performance Analysis

To explicitly evaluate model robustness under extreme hydrological conditions, we conducted a stratified analysis of performance during historically high-severity flood years. Extreme flood years were defined using a combined criterion based on the upper 10% quantile of discharge anomalies and confirmed extreme event records from the Dartmouth Flood Observatory (DFO) catalog.

Under this definition, test samples were partitioned into normal and extreme-event subsets, and performance degradation was computed relative to overall test performance. Results indicate that the distilled student model exhibits a performance degradation of less than 6% in accuracy and F1-score during extreme flood years compared to normal hydrological conditions.

This limited degradation indicates that the model maintains predictive capability even under rare and high-intensity flood conditions, where distributional shift and nonlinear hydrological dynamics are most pronounced.

4.3. Calibration and Reliability Analysis

Because flood forecasting is a high-stakes decision-support task, probability calibration was further evaluated to ensure that predicted flood probabilities correspond closely to observed event frequencies. In addition to ROC-AUC evaluation, calibration quality was assessed using reliability diagrams, Expected Calibration Error (ECE), and Brier scores before and after post hoc temperature scaling.

Temperature scaling was applied on the validation set using a single scalar temperature parameter to recalibrate the student model’s probabilistic outputs without modifying classification boundaries. Reliability analysis demonstrated that calibration substantially improved the agreement between predicted probabilities and observed flood occurrence frequencies, particularly for medium- and high-risk events.

The distilled student model achieved a post-calibration ECE of 0.031 compared to 0.084 before calibration, while the Brier score improved from 0.118 to 0.091. Similar improvements were observed for the teacher model, although the distilled student maintained lower inference cost and memory consumption. Reliability diagrams presented in Figure 4 further demonstrate that post-calibration predictions more closely follow the ideal diagonal calibration line, indicating improved probabilistic consistency across forecast thresholds.

The reliability diagrams were generated using a fixed binning strategy for probability calibration analysis. Initial visual artifacts in the reliability curve (Figure 4) were attributed to the use of a coarse bin resolution (10 bins) combined with prediction concentration in high-confidence regions, a known effect in imbalanced classification problems such as flood prediction.

To improve interpretability, the visualization was recomputed using an increased bin resolution (20 bins) without altering the underlying probability estimates used for Expected Calibration Error (ECE) computation. This ensures that reported calibration metrics remain unchanged while improving the visual smoothness and interpretability of reliability plots.

These results indicate that the framework achieves reliable discrimination performance but also produces well-calibrated probabilistic forecasts suitable for operational flood early-warning systems, where reliable uncertainty estimation is essential for balancing false alarms and missed flood events.

4.4. Model Compression Effectiveness

To assess the efficiency of the proposed distillation framework, the student model’s computational profile was benchmarked against baselines. As shown in Figure 5, the student model achieved an 8× reduction in inference latency compared to the teacher model and a 20× reduction in model size, with only a 3% drop in accuracy. Specifically, the distilled model required 85 million FLOPs and occupied 18 MB of storage, enabling real-time prediction on devices such as the Raspberry Pi 4 and Jetson Nano.

This performance efficiency tradeoff underscores the practical viability of the approach for low-resource settings where bandwidth, compute, and power consumption are major constraints.

In addition, the performance metrics of all models are tabulated in Table 3, offering a side-by-side comparison of predictive and computational characteristics.

4.5. Regional Generalization

To evaluate spatial generalization, the model was assessed using a structured leave-one-country-out validation protocol across the study regions, namely Nigeria, Kenya, Mozambique, Bangladesh, India, and Nepal. In this evaluation framework, each country was systematically excluded from the training set and used exclusively as a held-out test region, thereby enabling a strict assessment of cross-regional transferability under heterogeneous hydrological and climatic regimes. This design avoids reliance on geographically adjacent testing and instead provides a more robust measure of true out-of-distribution generalization across distinct flood systems.

Performance is reported at the country level to ensure interpretability across diverse flood-generating mechanisms, including Sahelian flash floods in West Africa, monsoon-driven riverine flooding in South Asia, and complex catchment responses in Himalayan basins. As summarized in Figure 6, the model demonstrates consistently strong performance across both seen and held-out regions, with variability reflecting differences in hydrological complexity and climatic non-stationarity.

Across all evaluated regions, the average degradation in accuracy relative to in-domain performance remains below 4.5%, confirming strong but non-uniform transferability across climatic zones. Importantly, discharge predictions are reported in physical units (m³/s), enabling physically meaningful comparison across basins with differing hydrological scales, while normalized metrics are additionally provided where inter-basin magnitude differences may bias interpretation.

These findings suggest that the proposed student model learns transferable hydrometeorological representations supported by geospatial embeddings and attention-based temporal modeling; however, the degree of transferability is influenced by regional hydrological heterogeneity, particularly in extreme or highly non-stationary flood regimes.

This suggests that the distilled student network learned transferrable hydrometeorological features, aided by its geospatial positional encodings and attention mechanisms.

4.6. Ablation Studies

To understand the contribution of individual components in the distilled student architecture, a series of ablation experiments were conducted. Each experiment involved systematically removing a key module or loss term and retraining the student model under identical conditions. The results provide empirical evidence for the role of each design decision in enhancing predictive accuracy and generalization. In the first variant, attention distillation was removed from the loss function. This modification resulted in a noticeable degradation in model performance, with the AUC score dropping from 0.91 to 0.854 and the F1-score falling by 5%. This performance decay highlights the significance of preserving inter-variable temporal dependencies and spatial correlation patterns learned from the teacher model. The second variant excluded the feature alignment loss, which is responsible for bridging intermediate representations between the teacher and student networks. Although the overall accuracy remained moderately high (86%), the predictions became less calibrated, and the variance in the flood risk maps increased. This resulted in lower interpretability and less robust decision thresholds, indicating the importance of latent feature supervision. A third ablation involved removing geospatial encodings from the student model’s architecture. This variant showed the steepest decline in cross-region generalization, with the accuracy in unseen regions dropping by over 8%. The AUC also fell to 0.837, reinforcing the need for integrating geographical priors in spatially aware environmental modeling. Collectively, these results confirm that the full model configuration offers the best tradeoff between accuracy and generalization, with each component contributing uniquely to the overall architecture. The summary of the results of the ablation study is presented in Table 4.

In addition, Figure 7 presents a comparative visualization of the key performance metrics Accuracy, F1-Score, and AUC across different architectural configurations of the student model, highlighting the effect of specific components through ablation.

The results of this study demonstrate the viability and effectiveness of a distilled deep learning framework for localized flood forecasting, particularly in low-resource environments. The proposed student model achieved strong performance across both classification and regression tasks, maintaining high fidelity to its teacher counterpart while operating at a fraction of the computational cost. These gains are especially meaningful in contexts where real-time decision-making is constrained by limited access to infrastructure, connectivity, or power.

To explicitly quantify the contribution of CMIP6 historical simulations during teacher model pretraining, we conducted a controlled ablation study comparing three configurations: (i) training from scratch without large-scale pretraining, (ii) ERA5-only pretraining, and (iii) combined ERA5 + CMIP6 pretraining (proposed approach). The results are summarized in Table 5.

As shown in Table 5, training from scratch yields the weakest performance across all metrics, with an accuracy of 0.81 and an AUC of 0.83, highlighting the importance of large-scale climate pretraining for learning meaningful climate dynamics.

Introducing ERA5-only pretraining significantly improves performance, increasing accuracy to 0.87 and AUC to 0.90. This indicates that observation-constrained reanalysis data provides strong supervision for learning hydrometeorological dynamics relevant to flood prediction.

The inclusion of CMIP6 historical simulations further improves performance, albeit more modestly, raising accuracy to 0.89 and AUC to 0.91, while also reducing cross-region generalization error from 5.2% to 4.3%. These gains indicate that CMIP6 primarily contributes to improving long-term climatological diversity and distributional robustness, rather than significantly altering event-scale predictive accuracy.

Importantly, the improvements introduced by CMIP6 are most pronounced in out-of-distribution regional testing, where exposure to broader climate variability helps stabilize predictions in unseen hydrological regimes. In contrast, ERA5 remains the dominant source of fine-scale predictive signal due to its observational constraint and higher temporal fidelity.

These results demonstrate that CMIP6 serves as a complementary pretraining signal that enhances generalization under domain shift, while ERA5 remains the primary driver of short-term flood forecasting accuracy.

A key outcome of this research is the model’s ability to generalize across diverse geographies with minimal degradation. The incorporation of geospatial encodings and region-aware attention mechanisms enabled the student model to transfer hydrometeorological knowledge across climatologically similar but data-scarce areas. This is of particular importance in regions like Sub-Saharan Africa and South Asia, where flood risk is high but local modeling capabilities are limited. Moreover, the distillation framework promotes not only computational efficiency but also interpretability. Despite these advantages, certain limitations must be acknowledged. The model’s predictive performance in regions with extreme or atypical flood dynamics remains less stable, and its reliance on available satellite and ground truth data introduces sensitivity to data quality and resolution. Furthermore, ethical concerns related to under- or over-prediction of flood events necessitate careful calibration before deployment in real-world early warning systems

4.7. Rare-Event Evaluation and Class Imbalance Analysis

Given the highly imbalanced nature of flood forecasting datasets, where flood occurrences constitute between 5% and 10% of total samples depending on region (Table 6), evaluation was extended beyond ROC-AUC to include precision–recall (PR) curves and average precision (AP), which provide a more reliable measure of predictive performance under rare-event conditions.

As shown in Table 6, all study regions exhibit strong class imbalance, with non-flood events dominating the dataset. This imbalance significantly impacts evaluation metrics, as ROC-AUC may remain high even when models perform poorly on rare positive events. Therefore, PR-AUC is treated as the primary metric for assessing operational flood detection capability.

To ensure fair and operationally meaningful evaluation, classification thresholds were not fixed at 0.5 but instead selected using validation-set optimization. Specifically, the operating threshold was chosen to maximize the F1-score on the validation split while maintaining recall above a predefined minimum acceptable level for flood warning applications.

In addition, full precision–recall curves are now reported for all models to provide a complete characterization of performance under varying decision thresholds. These curves demonstrate that the model maintains strong precision across a wide range of recall values, confirming robustness under extreme class imbalance conditions.

While ROC-AUC remains included for comparability with prior literature, it is interpreted with caution due to its insensitivity to class prevalence. Consequently, PR-AUC and F1-score are emphasized as the primary indicators of predictive performance in this study.

4.8. Risk Management, Ethics, and Deployment Considerations

Flood forecasting systems carry significant societal responsibility, as both false positives and false negatives can lead to substantial economic and humanitarian consequences. To mitigate these risks, the model incorporates probabilistic outputs that allow for threshold calibration based on local risk tolerance. Calibration techniques, including temperature scaling, were applied to align predicted probabilities with observed frequencies.

The framework is designed to support human-in-the-loop deployment, where model outputs augment not replace expert decision-making by hydrologists and disaster management authorities. This approach enables contextual interpretation of predictions and reduces the risk of automated misjudgment in high-stakes scenarios.

From an ethical and governance perspective, particular care must be taken when deploying predictive models in vulnerable communities. Transparency, uncertainty communication, and continuous post-deployment monitoring are essential to ensure responsible use. The lightweight nature of the student model further supports equitable access by enabling deployment in regions with limited computational infrastructure.

5. Conclusions

This study presented a knowledge distillation framework for transforming high-capacity global climate models into lightweight, regionally adaptable flood forecasting systems suitable for deployment in low-resource environments. Through a teacher–student learning paradigm that integrates spatiotemporal attention, multimodal data fusion, and feature-level supervision, the proposed student model achieved strong predictive performance while significantly reducing computational demands.

Experimental evaluation across multiple regions demonstrated that the distilled student model retained approximately 96.7% of the teacher model’s classification accuracy, achieving an average accuracy of 0.89, an AUC of 0.91, and an F1-score of 0.88, while reducing inference latency by 8× and model size by over 20×. In discharge estimation tasks, the student model achieved a mean absolute error of 0.17, RMSE of 0.24, and an

R^{2}

score of 0.85, confirming close alignment with observed hydrological measurements. These results indicate that the distillation process successfully preserved both predictive fidelity and hydrometeorological reasoning from the teacher model.

The framework also demonstrated robust regional and temporal generalization, with accuracy degradation remaining below 4.5% when applied to unseen geographic regions and below 6% during historically extreme flood years. Ablation studies further confirmed the importance of attention distillation, feature alignment, and geospatial encodings, each contributing measurably to predictive accuracy, calibration stability, and cross-region transferability. Importantly, the distilled model achieved real-time inference on edge platforms such as the Raspberry Pi 4 and NVIDIA Jetson Nano, validating its practicality for deployment in bandwidth- and power-constrained settings.

Looking ahead, several promising research directions emerge. First, incorporating active learning mechanisms could improve adaptability by enabling community-level or expert feedback during deployment. Second, extending the multimodal input space with drone imagery and mobile sensor networks may enhance situational awareness in data-scarce regions. Third, embedding the forecasting framework within multilingual and culturally contextualized early warning systems could increase accessibility and trust among end users. Finally, future work may explore cross-region meta-distillation and continual learning strategies to further strengthen generalization across heterogeneous climate regimes.

Together, the framework demonstrates that knowledge distillation can bridge the gap between high-capacity climate intelligence systems and practical flood forecasting applications, enabling scalable and computationally efficient early-warning solutions for climate-vulnerable regions.

Author Contributions

Conceptualization, J.O. and D.O.; methodology, J.O.; software, J.O.; validation, J.O. and D.O.; formal analysis, J.O.; investigation, D.O. and I.C.O.; resources, J.O., I.C.O. and M.N.N.; data curation, J.O.; writing—original draft preparation, J.O. and D.O.; writing—review and editing, D.O., M.N.N. and I.C.O.; visualization, J.O.; supervision, D.O. and I.C.O.; project administration, D.O. and M.N.N.; funding acquisition, M.N.N. M.N.N. additionally contributed to interdisciplinary research coordination, analytical interpretation of the decision-support implications of the framework, and critical review of the manuscript from a systems and sustainability perspective. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding. The article processing charge (APC) was supported by Madison N. Ngafeeson.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study are publicly available. ERA5 reanalysis data are available through the Copernicus Climate Data Store, while CMIP6 datasets are accessible through the Earth System Grid Federation (ESGF) data portals. The specific datasets, preprocessing procedures, and experimental configurations used in this study are described in the manuscript and can be obtained from the corresponding sources.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ECMWF	European Centre for Medium-Range Weather Forecasts
GRDC	Global Runoff Data Centre
DFO	Dartmouth Flood Observatory
ECE	Expected Calibration Error
ERA5	Fifth Generation ECMWF Reanalysis
FLOPs	Floating Point Operations
GloFAS	Global Flood Awareness System
GPM	Global Precipitation Measurement
LSTM	Long Short-Term Memory
MAE	Mean Absolute Error
MODIS	Moderate Resolution Imaging Spectroradiometer
MLP	Multilayer Perceptron
ONNX	Open Neural Network Exchange
PR	Precision–Recall
QKV	Query-Key-Value
RMSE	Root Mean Square Error
ROC	Receiver Operating Characteristic
SRTM	Shuttle Radar Topography Mission
T	Temperature/temporal dimension (depending on notation)
FNO	Fourier Neural Operator
AUC	Area Under the Receiver Operating Characteristic Curve
AP	Average Precision
CMIP6	Coupled Model Intercomparison Project Phase 6
ConvLSTM	Convolutional Long Short-Term Memory
INT8	8-bit Integer Quantization
FP16	16-bit Floating Point Precision
FP32	32-bit Floating Point Precision
RAM	Random Access Memory
GPU	Graphics Processing Unit

References

Amaratunga, D.; Anzellini, V.; Guadagno, L.; Hagen, J.S.; Komac, B.; Krausmann, E.; Linsen, M.; Pescaroli, G.; Rossi, J.L.; Samuel, K.; et al. United Nations Office for Disaster Risk Reduction Regional Assessment Report on Disaster Risk Reduction 2023: Europe and Central Asia; UNDRR: Geneva, Switzerland, 2023; Available online: https://discovery.ucl.ac.uk/id/eprint/10182237/ (accessed on 18 January 2026).
Winsemius, H.C.; Aerts, J.C.; Van Beek, L.P.; Bierkens, M.F.; Bouwman, A.; Jongman, B.; Kwadijk, J.C.J.; Ligtvoet, W.; Lucas, P.L.; Van Vuuren, D.P. Global drivers of future river flood risk. Nat. Clim. Change 2016, 6, 381–385. [Google Scholar]
McLeman, R.; Hevesi, C.; Cadham, E. Evolution of Climate-Related Migration and Displacement in IPCC Reporting; Working Paper No. 2025/14; Toronto Metropolitan Centre for Immigration and Settlement (TMCIS): Toronto, ON, Canada, 2025. [Google Scholar]
Najafi, H.; Lagerwall, G.L.; Obeysekera, J.; Liu, J. Machine Learning in Climate Downscaling: A Critical Review of Methodologies, Physical Consistency, and Operational Applications. Water 2026, 18, 271. [Google Scholar] [CrossRef]
Vandal, T.; Kodra, E.; Ganguly, S.; Michaelis, A.; Nemani, R.; Ganguly, A.R. Deepsd: Generating high resolution climate change projections through single image super-resolution. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017; pp. 1663–1672. [Google Scholar]
Beck, H.E.; Vergopolan, N.; Pan, M.; Levizzani, V.; Van Dijk, A.I.; Weedon, G.P.; Brocca, L.; Pappenberger, F.; Huffman, G.J.; Wood, E.F. Global-scale evaluation of 22 precipitation datasets using gauge observations and hydrological modeling. Hydrol. Earth Syst. Sci. 2017, 21, 6201–6217. [Google Scholar] [CrossRef]
Reichstein, M.; Camps-Valls, G.; Stevens, B.; Jung, M.; Denzler, J.; Carvalhais, N.; Prabhat, F. Deep learning and process understanding for data-driven Earth system science. Nature 2019, 566, 195–204. [Google Scholar] [CrossRef] [PubMed]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2025, arXiv:1503.02531. [Google Scholar]
Rolnick, D.; Donti, P.L.; Kaack, L.H.; Kochanski, K.; Lacoste, A.; Sankaran, K.; Ross, A.S.; Milojevic-Dupont, N.; Jaques, N.; Waldman-Brown, A.; et al. Tackling climate change with machine learning. ACM Comput. Surv. 2022, 55, 1–96. [Google Scholar] [CrossRef]
Hickmon, N.L.; Varadharajan, C.; Hoffman, F.M.; Wainwright, H.M.; Collis, S. Artificial Intelligence for Earth System Predictability (AI4ESP); 2021 Workshop Report, No. ANL-22/54; Argonne National Lab. (ANL): Argonne, IL, USA, 2022. [Google Scholar]
Sharma, H.; Kaur, S. Deep learning for sustainable development across climate, energy, agriculture and urban systems. Discov. Sustain. 2025, 6, 1408. [Google Scholar] [CrossRef]
Karapetyan, A.; Chow, A.C.H.; Madanat, S. Deep Vision-Based Framework for Coastal Flood Prediction Under Climate Change Impacts and Shoreline Adaptations. arXiv 2024, arXiv:2406.15451. [Google Scholar]
Kow, P.Y.; Liou, J.Y.; Yang, M.T.; Lee, M.H.; Chang, L.C.; Chang, F.J. Advancing climate-resilient flood mitigation: Utilizing transformer-LSTM for water level forecasting at pumping stations. Sci. Total Environ. 2024, 927, 172246. [Google Scholar] [CrossRef] [PubMed]
Thirugnanasammandamoorthi, P.; Ghosh, D.; Dewangan, R.K.; Hasan, M.K.; Ariffin, K.A.Z.; Abbas, H.S.; Elshafie, H.; Saeed, R.A.; Awouda, A.E. FloodNet-Lite: A Lightweight Deep Learning for Flood Mapping Using Remote Sensing Data with Optimized UNet and Edge Deployment Approach in 6G. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 20294–20314. [Google Scholar] [CrossRef]
Li, S.; Song, Q.; Zeng, S. AI-Driven Consumer Safety Monitoring System with Visual Energy Optimization for Flood Detection and Prediction. IEEE Trans. Consum. Electron. 2025, 71, 8533–8544. [Google Scholar] [CrossRef]
Xiang, K.; Fujii, A. Dare: Distill and reinforce ensemble neural networks for climate-domain processing. Entropy 2023, 25, 643. [Google Scholar] [CrossRef] [PubMed]
Yang, J.; Song, J.; Han, X.; Bi, Z.; Wang, T.; Liang, C.X.; Song, X.; Zhang, Y.; Niu, Q.; Peng, B.; et al. Feature alignment and representation transfer in knowledge distillation for large language models. arXiv 2025, arXiv:2504.13825. [Google Scholar]
Chithra, E.S. Deep Learning and Geospatial Analytics for Climate-Smart Resource Management. R. Int. Glob. J. Adv. Appl. Res. 2025, 2, 113–117. [Google Scholar] [CrossRef]
Li, W.; Liu, C.; Xu, Y.; Niu, C.; Li, R.; Li, M.; Hu, C.; Tian, L. An interpretable hybrid deep learning model for flood forecasting based on Transformer and LSTM. J. Hydrol. Reg. Stud. 2024, 54, 101873. [Google Scholar] [CrossRef]
Zhou, X.; Wang, W.; Buntine, W.; Qu, S.; Sriramulu, A.; Tan, W.; Bergmeir, C. Scalable transformer for high dimensional multivariate time series forecasting. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, Boise, ID, USA, 21–25 October 2024; pp. 3515–3526. [Google Scholar]
Nadir, M.; Aïssa, R.; Ali, S.M.; Sohail, I. A Systematic Review of Flood Management Evolution, with Emphasis on How Generative AI Reshapes Prediction-to-Decision Pathways. Water 2026, 18, 582. [Google Scholar] [CrossRef]
Haris, M.; Keshtkar, S.; Anjum, A.; Syed, M.H.; Kojima, H.; Ahmad, R. Disaster management technology and approaches: A focus on heterogeneous systems. Environ. Syst. Decis. 2026, 46, 1–28. [Google Scholar] [CrossRef]
Jindal, N.; Chauhan, P.; Chanda, K.; Ravi, C.; Garg, A. Integrating Ai, Data Science, And Decision Analytics for Climate-Resilient Business Strategy: A Multi-Criteria Engineering Approach. Lex Localis J. Local Self-Gov. 2025, 23, 380–390. [Google Scholar]
Hersbach, H.; Bell, B.; Berrisford, P.; Hirahara, S.; Horányi, A.; Muñoz-Sabater, J.; Nicolas, J.; Peubey, C.; Radu, R.; Schepers, D.; et al. The ERA5 global reanalysis. Q. J. R. Meteorol. Soc. 2020, 146, 1999–2049. [Google Scholar] [CrossRef]
Eyring, V.; Bony, S.; Meehl, G.A.; Senior, C.A.; Stevens, B.; Stouffer, R.J.; Taylor, K.E. Overview of the Coupled Model Intercomparison Project Phase 6 (CMIP6) experimental design and organization. Geosci. Model Dev. 2016, 9, 1937–1958. [Google Scholar] [CrossRef]
Huffman, G.J.; Bolvin, D.T.; Braithwaite, D.; Hsu, K.L.; Joyce, R.J.; Kidd, C.; Nelkin, E.J.; Sorooshian, S.; Stocker, E.F.; Tan, J.; et al. Integrated multi-satellite retrievals for the global precipitation measurement (GPM) mission (IMERG). In Satellite Precipitation Measurement; Springer International Publishing: Cham, Switzerland, 2020; Volume 1, pp. 343–353. [Google Scholar] [CrossRef]
Global Runoff Data Centre (GRDC), 56068 Koblenz, Germany. Available online: https://www.bafg.de/GRDC/ (accessed on 18 January 2026).
Schneider, U.; Becker, A.; Finger, P.; Meyer-Christoffer, A.; Rudolf, B.; Ziese, M. GPCC Full Data Reanalysis Version 7.0 at 1.0°: Monthly Land-Surface Precipitation from Rain-Gauges; Version 7; Global Precipitation Climatology Centre (GPCC): Offenbach am Main, Germany, 2014. [CrossRef]
Farr, T.G.; Rosen, P.A.; Caro, E.; Crippen, R.; Duren, R.; Hensley, S.; Kobrick, M.; Paller, M.; Rodriguez, E.; Roth, L.; et al. The Shuttle Radar Topography Mission. Rev. Geophys. 2007, 45, RG2004. [Google Scholar] [CrossRef]
Lehner, B.; Verdin, K.; Jarvis, A. HydroSHEDS Technical Documentation; World Wildlife Fund US: Washington, DC, USA, 2008; Available online: https://www.hydrosheds.org (accessed on 18 January 2026).
Friedl, M.A.; Sulla-Menashe, D.; Skole, B.; Hook, S.; Dwyer, J.L.; Morisette, A.; Schaaf, C.B. MODIS Collection 5 global land cover: Algorithm refinements and characterization of new datasets. Remote Sens. Environ. 2010, 114, 168–182. [Google Scholar] [CrossRef]
Kratzert, F.; Klotz, D.; Shalev, G.; Klambauer, G.; Hochreiter, S.; Nearing, G. Towards learning universal, regional, and local hydrological behaviors via machine learning applied to large-sample datasets. Hydrol. Earth Syst. Sci. 2019, 23, 5089–5110. [Google Scholar] [CrossRef]

Figure 1. Teacher–Student Knowledge Distillation Framework. This schematic illustrates the multi-objective process where a high-capacity Fourier Neural Operator (FNO) teacher model transfers spatiotemporal knowledge including soft targets, intermediate features, and attention maps to a lightweight student model through a specialized distillation interface.

Figure 2. Student Model Architectural Design. This diagram details the lightweight hybrid architecture of the student model, which integrates two stacked ConvLSTM layers for spatiotemporal feature extraction and a 2-layer temporal transformer for capturing long-range dependencies.

Figure 3. Training Dynamics of the Model. This figure presents the convergence behavior of loss and accuracy curves for the student model over training epochs, showing the convergence behavior of training and validation metrics.

Figure 4. Reliability and Calibration Analysis of the Model. Reliability diagrams showing predicted flood probabilities versus observed frequencies before and after temperature scaling. Plots are generated using increased bin resolution for improved visualization clarity, while ECE is computed from raw probability outputs.

Figure 5. Accuracy vs. Inference Latency of the Models. An empirical tradeoff chart benchmarking the distilled student model against the teacher and baselines, highlighting an 8× reduction in latency with only a minor (3%) drop in predictive accuracy.

Figure 6. Cross-Region Generalization Performance. A bar chart comparing the accuracy of the model across diverse training and unseen (test) geographic regions in Sub-Saharan Africa and South Asia to assess spatial transferability.

Figure 7. Model Performance by Variant. A comparative visualization of key metrics (Accuracy, F1-Score, and AUC) across different architectural configurations of the student model, demonstrating the impact of attention distillation, feature alignment, and geospatial encodings through ablation.

Table 1. Comparison of Related Work in Flood and Climate Modeling.

Approach	Data Requirements	Model Size	Distillation Used	Local vs. Global Focus	Edge Deployable
Vision-based flood models [12]	High (dense imagery)	Large	No	Local (coastal)	No
Transformer–LSTM flood models [13]	Moderate–High (sensors)	Large	No	Local	No
Lightweight flood models [14,15]	Moderate (imagery)	Small	No	Local	Yes
Climate distillation models [16]	High (global simulations)	Medium	Yes	Global	No
Proposed framework	Low–Moderate (multi-modal)	Small	Yes	Global → Local	Yes

Table 2. Comparative Performance of the Proposed Framework and Baseline Models.

Model	Accuracy	F1-Score	Recall	AUC	AP	MAE (m³/s)	RMSE (m³/s)	R²	Latency (ms)	Model Size (MB)	FLOPs (M)	RAM Usage (MB)
Teacher (FourCastNet-FNO)	0.92	0.90	0.91	0.95	0.92	0.15	0.21	0.89	600	360	2200	1024
Student (Distilled)	0.89	0.88	0.90	0.91	0.87	0.17	0.24	0.85	80	18	85	250
Informer	0.86	0.84	0.85	0.89	0.83	0.20	0.27	0.81	190	52	410	410
PatchTST	0.85	0.83	0.84	0.88	0.82	0.21	0.28	0.80	175	48	370	395
LSTM Hydrology Baseline	0.83	0.81	0.82	0.86	0.80	0.19	0.26	0.82	130	38	180	290
ConvLSTM Baseline	0.78	0.76	0.77	0.81	0.74	0.25	0.32	0.74	140	45	240	320
GloFAS Reference	0.80	0.78	0.79	0.84	0.77	0.23	0.30	0.76	N/A	N/A	N/A	N/A
Random Forest	0.74	0.71	0.72	0.77	0.69	0.28	0.37	0.69	120	3.1	10	110

Table 3. Performance Comparison of all Models.

Model	Accuracy	AUC	F1-Score	Latency (ms)	FLOPs (M)	Model Size (MB)
Teacher (FourCastNet)	0.92	0.95	0.9	600	2200	360
Student (Distilled)	0.89	0.91	0.88	80	85	18
ConvLSTM	0.78	0.81	0.76	140	240	45
Random Forest	0.74	0.77	0.71	120	10	3.1

Table 4. Ablation Results of Student Model. Reported “unseen-region drop” represents relative performance degradation with respect to the full student model baseline under architectural ablations.

Model Variant	Accuracy	F1-Score	AUC	Accuracy Drop in Unseen Regions
Full Student Model	0.89	0.88	0.91	0
No Attention Distillation	0.84	0.83	0.854	0.035
No Feature Alignment Loss	0.86	0.85	0.86	0.025
No Geospatial Encoding	0.82	0.8	0.837	0.08

Table 5. Pretraining Ablation Study Evaluating Contribution of ERA5 and CMIP6. Reported cross-region degradation reflects absolute performance drop under different pretraining configurations, evaluated using leave-one-region-out testing.

Pretraining Setup	Accuracy	F1-Score	AUC	Cross-Region Accuracy Drop (%)
Training from scratch	0.81	0.79	0.83	7.8
ERA5 only	0.87	0.86	0.90	5.2
ERA5 + CMIP6 (proposed)	0.89	0.88	0.91	4.3

Table 6. Flood class prevalence across datasets.

Region	Flood Event Ratio (%)	Non-Flood Ratio (%)
Nigeria	6.2	93.8
Kenya	5.8	94.2
Mozambique	7.1	92.9
Bangladesh	9.4	90.6
India	8.7	91.3
Nepal	5.5	94.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Olaniyan, J.; Olaniyan, D.; Obagbuwa, I.C.; Ngafeeson, M.N. Deep Climate Model Distillation for Localized Flood Forecasting in Low-Resource Areas. Meteorology 2026, 5, 16. https://doi.org/10.3390/meteorology5020016

AMA Style

Olaniyan J, Olaniyan D, Obagbuwa IC, Ngafeeson MN. Deep Climate Model Distillation for Localized Flood Forecasting in Low-Resource Areas. Meteorology. 2026; 5(2):16. https://doi.org/10.3390/meteorology5020016

Chicago/Turabian Style

Olaniyan, Julius, Deborah Olaniyan, Ibidun C. Obagbuwa, and Madison N. Ngafeeson. 2026. "Deep Climate Model Distillation for Localized Flood Forecasting in Low-Resource Areas" Meteorology 5, no. 2: 16. https://doi.org/10.3390/meteorology5020016

APA Style

Olaniyan, J., Olaniyan, D., Obagbuwa, I. C., & Ngafeeson, M. N. (2026). Deep Climate Model Distillation for Localized Flood Forecasting in Low-Resource Areas. Meteorology, 5(2), 16. https://doi.org/10.3390/meteorology5020016

Article Menu

Deep Climate Model Distillation for Localized Flood Forecasting in Low-Resource Areas

Abstract

1. Introduction

2. Literature Review

2.1. Deep Learning for Climate and Flood-Related Applications

2.2. Lightweight and Edge-Oriented Flood Modeling

2.3. Deep Climate Modeling and Knowledge Distillation

2.4. Toward Interpretable, Resource-Efficient Climate Intelligence

2.5. Comparative Analysis of Existing Approaches

3. Materials and Methods

3.1. Dataset and Preprocessing

3.2. Input Representation and Data Fusion

3.3. Teacher Model Design with High-Resolution Deep Climate Architecture

3.4. Student Model Design with Geospatial Embedding

3.5. Knowledge Distillation Loss Formulation

3.5.1. Hard Supervision Loss

3.5.2. Soft-Target Distillation Loss

3.5.3. Feature Distillation Loss

3.5.4. Attention Alignment Loss

3.6. Training Pipeline and Edge Profiling

4. Results and Discussion

4.1. Quantitative Performance

4.2. Generalization Gap Analysis (Train/Validation vs. Test Performance)

Extreme Flood Event Performance Analysis

4.3. Calibration and Reliability Analysis

4.4. Model Compression Effectiveness

4.5. Regional Generalization

4.6. Ablation Studies

4.7. Rare-Event Evaluation and Class Imbalance Analysis

4.8. Risk Management, Ethics, and Deployment Considerations

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI