1. Introduction
Floods are among the most devastating natural hazards globally, displacing millions, damaging infrastructure, and threatening food and water security [
1]. Their impact is especially severe in low-resource regions, where limited access to early warning systems, resilient infrastructure, and emergency response services exacerbates vulnerability [
2]. With climate change driving more frequent and extreme weather events, the demand for reliable, localized flood forecasting has become increasingly urgent [
3].
While large-scale climate models, such as those from the Coupled Model Intercomparison Project (CMIP), have advanced our understanding of global hydrological trends, translating these forecasts into actionable local predictions remains challenging [
4]. High-resolution climate simulations are computationally intensive, requiring supercomputing resources and dense observational datasets that are often unavailable in low-income or remote regions [
5]. Moreover, these models typically operate at coarse spatial and temporal resolutions, limiting their ability to capture localized rainfall-runoff processes essential for timely flood warnings [
6].
Deep learning has offered promising avenues to bridge this gap. Conventional flood prediction models leverage Convolutional Long Short-Term Memory (ConvLSTM) networks, graph-based hydrological modeling, or attention mechanisms to improve local forecasts [
7], but they remain constrained by computational cost, sparse data, and limited generalizability across regions. Separately, knowledge distillation has emerged as a method to compress complex teacher models into smaller, efficient student models while preserving predictive performance [
8]. Recent work in climate model distillation has focused on reducing model size or inference time [
9,
10], yet these approaches rarely integrate multiple objectives such as feature, output, and attention alignment, nor do they explicitly handle low-resource, data-sparse environments.
Despite advances in deep flood forecasting and climate model compression, no existing framework simultaneously addresses the following challenges: (i) localized flood prediction under data scarcity, (ii) multi-modal integration of satellite and limited hydrological observations, (iii) efficient student models that retain interpretability, and (iv) attention mechanisms for capturing spatiotemporal dependencies that highlight spatiotemporally critical features.
To address these gaps, we propose a Deep Climate Model Distillation framework for Localized Flood Forecasting in Low-Resource Areas. Our technical contributions are as follows:
Multi-objective knowledge distillation: We design a distillation strategy that aligns outputs, intermediate features, and attention maps between teacher and student models to maximize predictive fidelity.
ConvLSTM–Transformer hybrid student architecture: The student model combines ConvLSTM layers for spatiotemporal feature extraction with a Transformer module to capture long-range dependencies in flood-relevant signals.
Multi-modal data fusion: Satellite imagery and limited ground-based hydrological measurements are integrated to enhance model accuracy in data-sparse regions.
Attention-guided interpretability: Spatial and temporal attention mechanisms highlight critical areas and periods influencing flood risk, improving model explainability for end-users and decision-makers.
The remainder of this paper is organized as follows.
Section 2 reviews prior work in climate model distillation and deep flood forecasting.
Section 3 details the proposed methodology, including architecture design, multi-objective loss formulation, and training strategies.
Section 4 presents experimental results across diverse low-resource case studies.
Section 5 discusses implications, limitations, and future research directions.
3. Materials and Methods
This study created a framework to simplify large, high-capacity climate models into smaller, regionally tailored forecasting models aimed at predicting floods in low-resource areas. The methodology included four main parts: building multimodal inputs, developing and fine-tuning a teacher model, designing a student model with geospatial conditioning, and implementing a knowledge distillation process led by a multi-component loss function. The entire training and evaluation process was built to enable predictions on edge platforms while keeping high accuracy.
3.1. Dataset and Preprocessing
The primary atmospheric and hydrological inputs used in this study were derived from the ERA5 reanalysis dataset and selected outputs from CMIP6 historical climate simulations. ERA5, produced by the European Centre for Medium-Range Weather Forecasts (ECMWF), Reading, UK, provides observation-constrained hourly atmospheric reanalysis fields and served as the principal source of dynamically consistent meteorological variables, including precipitation, temperature, soil moisture, surface runoff, and wind-related variables [
24].
In contrast, CMIP6 data were not used as reanalysis products but rather as auxiliary climate-model simulations employed during the teacher-model pretraining stage to expose the network to broader climatological variability and long-term hydroclimatic patterns [
25]. Specifically, historical CMIP6 experiments were utilized only for large-scale representation learning prior to regional fine-tuning on observation-aligned datasets. This distinction is important because CMIP6 simulations represent model-generated climate trajectories under prescribed forcings rather than observation-constrained atmospheric reconstructions. To avoid ambiguity, we explicitly distinguish between ERA5 (observationally constrained reanalysis) and CMIP6 (climate model simulations) throughout the study.
To maintain consistency across datasets, only variables common to both ERA5 and selected CMIP6 historical simulations were retained during pretraining. The final forecasting stages and downstream flood prediction tasks relied primarily on ERA5, satellite observations, and in situ hydrological measurements.
To capture high-frequency, short-duration rainfall events that often cause flash floods, the framework included precipitation estimates from the Global Precipitation Measurement (GPM) mission, providing data at 0.1° spatial and 30 min temporal resolution [
26]. Ground-based measurements, including river discharge and rainfall from national hydrological services, were used when available to improve accuracy. Global repositories such as the Global Runoff Data Centre (GRDC) and the Global Precipitation Climatology Centre (GPCC) were used not only for validation but also to ensure traceability of hydrological event signals [
27,
28]. Each flood event incorporated into the dataset is explicitly linked to its source metadata, including event identifiers, geographic extents, basin associations, and time intervals derived from these observational records.
For geospatial modeling, static inputs such as elevation and land cover were included. Elevation data were obtained from the Shuttle Radar Topography Mission (SRTM) and the hydrologically conditioned HydroSHEDS dataset, both enabling accurate delineation of catchment structures and runoff pathways [
29,
30]. Land cover and surface characteristics were defined using the MODIS MCD12Q1 dataset, which provides global annual land cover classification at 500 m resolution according to the IGBP scheme [
31].
A thorough data harmonization and preprocessing pipeline was established to align heterogeneous data sources from climate reanalysis, satellite observations, and in situ hydrological measurements. A key clarification is that spatial alignment to a common grid was performed at a nominal resolution of 0.1° to enable multimodal fusion; however, this does not imply that all datasets possess intrinsic or physically meaningful resolution at this scale.
Each dataset was first processed at its native spatial resolution before transformation to the unified grid. ERA5 reanalysis fields (native resolution approximately 0.25°) and CMIP6 climate model outputs (typically 1–2° resolution) were upsampled to 0.1° using bilinear interpolation for consistency. In contrast, high-resolution static geospatial layers such as MODIS land cover (500 m), SRTM elevation (30–90 m), and HydroSHEDS products (3–15 arc-seconds) were aggregated to 0.1° using mean aggregation for continuous variables and majority voting for categorical variables. These transformations are strictly for computational alignment and do not introduce additional physical information beyond the native resolution of each dataset.
Continuous variables such as precipitation and soil moisture were interpolated using bilinear resampling, while categorical data such as land cover classes were processed using nearest-neighbor interpolation to preserve discrete class structure. All time-series variables were standardized to a uniform 3-hourly temporal resolution, and missing values resulting from data gaps or sensor inconsistencies were filled using linear interpolation with controlled gap handling to avoid physically unrealistic long-range interpolation. Continuous variables were normalized using z-score standardization to stabilize training.
The multimodal inputs were then combined into a unified tensor representation capturing both static and dynamic environmental variables across space and time. This representation enables the model to learn coupled interactions between precipitation forcing, soil moisture dynamics, topographic controls, and river discharge evolution, which are essential for early flood onset prediction.
For supervised learning, output targets were constructed using a harmonized spatiotemporal labeling pipeline derived from event-based flood records and hydrological observations. Flood occurrence labels were primarily sourced from the Dartmouth Flood Observatory (DFO), University of Colorado Boulder, Boulder, CO, USA event archive and complementary regional disaster databases. To ensure transparency and reproducibility, each flood event is explicitly linked to its metadata, including event identifiers, affected regions, basin boundaries, and timestamps derived from observational records. All discharge targets were standardized using basin-wise z-score normalization prior to training, and inverse transformation was not applied during evaluation to ensure consistency of comparative error analysis across basins.
Since these datasets are event-based rather than grid-based, a structured conversion pipeline was implemented to transform flood events into gridded spatiotemporal training labels. Each DFO event was represented as a spatial polygon with associated start and end timestamps, which were rasterized onto the model grid at 0.1° resolution to generate binary flood-affected masks. To account for uncertainty in reporting and hydrological delay, each event was assigned a temporal propagation window spanning its reported duration with additional buffer steps. In cases of spatial or temporal overlap between events, a union-based aggregation strategy was applied to prevent duplication and ensure consistent event representation.
The resulting grid-time labels therefore represent harmonized event footprints rather than direct pixel-level measurements. We explicitly acknowledge that this introduces uncertainty due to coarse event reporting resolution and ambiguity in mapping large-scale flood documentation to fine-resolution grid cells.
For flood severity estimation, depth categories (minor, moderate, severe) were constructed using regional hydrological reports, disaster impact assessments, and model-derived inundation proxies where direct measurements were unavailable. The severity bins were defined as: minor (0–0.3 m), moderate (0.3–1.0 m), and severe (>1.0 m). We acknowledge that in several regions, particularly in Sub-Saharan Africa, direct depth observations are sparse, and thus these labels should be interpreted as best-effort hydrological approximations rather than direct measurements.
Geographically, the study focused on flood-prone low-resource countries where forecasting capacity is limited and risk exposure is high. Case studies included Nigeria, Kenya, and Mozambique in Sub-Saharan Africa, and Bangladesh, India, and Nepal in South Asia, covering diverse hydrological regimes including monsoonal river flooding, flash floods, and seasonal basin overflow events. Each country was treated as a separate forecasting zone using geospatial masks derived from global hydrological datasets, enabling both within-region training and cross-region transferability analysis.
3.2. Input Representation and Data Fusion
Attention-related representations extracted from the teacher and student networks are explicitly distinguished according to their underlying architectural mechanisms. Since the Fourier Neural Operator (FNO) teacher does not natively compute transformer-style Query, Key, and Value (QKV) self-attention, the teacher-side interaction representations are constructed from spectral channel interaction responses derived from Fourier mixing operations.
For the student model, transformer attention matrices are defined as:
where
It represents the total number of spatiotemporal tokens formed by flattening the temporal and spatial dimensions at layer . Thus, the student attention mechanism operates over joint spatiotemporal representations rather than purely spatial tokens.
Specifically, each token corresponds to a feature vector associated with a particular temporal step and spatial grid location:
where
denotes temporal position,
denotes spatial coordinates,
is the latent embedding dimension.
The resulting transformer attention matrices therefore capture dependencies across both temporal evolution and spatial hydrological interactions.
For the teacher network, spectral interaction maps are extracted from Fourier-domain feature responses and represented as:
where
This corresponds to the flattened spatial spectral representation at layer . These interaction maps encode spatial correlation structures learned through spectral convolution and Fourier channel mixing.
Because the teacher and student representations operate in different domains (spectral spatial interactions vs. spatiotemporal transformer attention), a learned projection adapter is applied before alignment:
where
This is a learned projection operator that maps teacher spectral interaction representations into the student spatiotemporal attention space.
The final attention alignment objective is therefore defined as:
where
denotes the number of aligned layer pairs,
is the learned projection adapter,
denotes the squared Frobenius norm.
This formulation enables stable cross-architecture distillation between the spectral FNO teacher and the lightweight transformer-based student model while preserving both spatial hydrological interactions and temporal flood evolution patterns.
3.3. Teacher Model Design with High-Resolution Deep Climate Architecture
The teacher model was instantiated using a Fourier Neural Operator (FNO)–based architecture inspired by FourCastNet, a Fourier-based deep learning framework developed by NVIDIA Research and collaborators, designed to capture long-range spatiotemporal dependencies in climate fields. The model consists of 12 spectral convolution layers, each operating in the frequency domain using a truncated Fourier representation. For each layer, 32 spectral modes were retained along both spatial dimensions, balancing expressive capacity with computational efficiency.
The teacher operates at a spatial resolution of 0.1°, consistent with the harmonized input data, and processes multivariate climate fields over temporal windows of length . Each spectral convolution layer is followed by pointwise nonlinear transformations and residual connections to stabilize training. The hidden channel width was fixed at 256 feature maps across layers.
The teacher model was first pretrained using ERA5 reanalysis fields together with selected CMIP6 historical climate simulations to learn large-scale atmospheric and hydrological representations across diverse climatic regimes. During this stage, CMIP6 outputs were used solely as auxiliary climate-model simulations to enrich climatological variability during representation learning, while ERA5 provided the primary observation-constrained atmospheric states used for downstream regional adaptation. Subsequently, it was partially fine-tuned on region-specific data by unfreezing the final four spectral layers while keeping earlier layers fixed. This strategy preserved global climatological knowledge while allowing adaptation to localized flood-generating processes such as regional rainfall–runoff relationships and terrain-driven flow accumulation.
The final teacher outputs, denoted
, were produced per grid cell and per forecast horizon and served as soft targets for knowledge distillation into the student model. A complete schematic of the teacher–student knowledge distillation framework is presented in
Figure 1.
Although the Fourier Neural Operator (FNO)-based teacher architecture was originally developed for large-scale atmospheric forecasting, in this work it is employed as a general spatiotemporal climate representation learner rather than a standalone hydrological simulator. Specifically, the teacher model first learns multivariate atmospheric and land-surface dynamics from ERA5 reanalysis fields and selected CMIP6 historical climate simulations, including precipitation evolution, soil moisture transport, runoff accumulation, temperature variability, and large-scale circulation patterns. These latent climate representations are subsequently adapted to flood forecasting through supervised regional fine-tuning using flood occurrence labels and discharge observations.
To enable hydrological prediction, the final latent representations extracted from the spectral backbone are passed through two task-specific prediction heads. The first head consists of a fully connected classification module with sigmoid activation for binary flood occurrence and flood severity estimation. The second head is a regression module composed of linear projection layers that estimate continuous discharge values for each forecast horizon. This design allows the atmospheric representations learned by the teacher model to be transformed into flood-relevant hydrological outputs while preserving large-scale spatiotemporal dependencies.
Because the teacher and student architectures operate in different representational spaces namely spectral representations in the FNO teacher and recurrent-attention representations in the ConvLSTM–Transformer student a dimensional alignment mechanism was introduced during distillation. Specifically, intermediate teacher feature maps were projected through learned linear adapter layers before feature matching was applied. Let the projection operator be defined as:
where
denotes the number of teacher feature channels,
denotes the number of student feature channels,
and represent the spatial dimensions at layer .
The projection operator aligns teacher feature dimensions with the corresponding student representations at matched layers.
The feature distillation objective is therefore formulated as:
where
denotes the number of matched teachers–student layer pairs used for distillation,
represents the feature map extracted from the teacher model at layer ,
represents the corresponding student feature representation,
and is the learned projection adapter used for dimensional alignment.
Similarly, interaction representations extracted from the teacher network were reshaped and aligned with the student attention matrices prior to optimization. This projection-based alignment ensured stable knowledge transfer despite the architectural heterogeneity between the spectral FNO teacher and the lightweight recurrent-attention student model.
3.4. Student Model Design with Geospatial Embedding
The student model was designed as a lightweight hybrid architecture, presented in
Figure 2, combining convolutional recurrent modeling with attention-based temporal reasoning. Specifically, the model consists of two stacked ConvLSTM layers with hidden dimensions of 64 and 128 channels, respectively. These layers capture localized spatiotemporal dependencies in precipitation and hydrological signals while maintaining a low parameter footprint.
The ConvLSTM outputs are passed to a temporal transformer module composed of 2 transformer encoder layers, each with 4 attention heads and a model dimension of 128. This module enables selective attention over temporal states, allowing the student to focus on critical rainfall accumulation and soil saturation patterns preceding flood events.
Geospatial priors derived from elevation, slope, and land-use attributes are embedded using a two-layer Multilayer Perceptron (MLP) with Rectified Linear Unit (ReLU) activations, producing a geospatial embedding of dimension . These embeddings are injected into the ConvLSTM hidden states via spatial concatenation, providing explicit terrain-aware conditioning.
So, the student model contains approximately 1.8 million parameters, compared to over 60 million parameters in the teacher model. This reduction enables efficient deployment on edge devices without sacrificing predictive performance.
3.5. Knowledge Distillation Loss Formulation
To effectively transfer hydrometeorological knowledge from the high-capacity teacher model to the lightweight student model, a multi-objective distillation framework is employed. The total training objective combines direct supervision from ground-truth observations with structured knowledge transfer from the teacher’s outputs, intermediate representations, and attention patterns.
The overall loss function is defined as:
where
is the total loss function;
denotes the supervised learning loss;
represents the soft-target distillation loss;
is the feature alignment loss;
is the attention alignment loss;
are coefficients controlling the relative contribution of each term.
In this study, the weights are set as:
These values are selected through validation-based tuning to balance prediction accuracy and stable knowledge transfer.
3.5.1. Hard Supervision Loss
The hard loss enforces learning from ground-truth labels and is defined as follows:
For flood classification:
For discharge regression:
where
represents the student model prediction;
represents the ground-truth observation.
3.5.2. Soft-Target Distillation Loss
The soft loss aligns student predictions with the teacher’s probability distribution using Kullback–Leibler divergence and temperature scaling:
where
is the teacher prediction;
is the student prediction;
is the temperature parameter ( in this study);
denotes the softmax function.
The temperature parameter is used to smooth probability distributions, enabling better transfer of inter-class relationships.
3.5.3. Feature Distillation Loss
To transfer intermediate representations, a feature alignment loss is introduced. Since the teacher and student operate in different representation spaces (spectral representations in the teacher vs. recurrent-attention representations in the student), a learned projection operator is used.
The feature loss is defined as:
where
is the number of matched layer pairs;
is the layer index;
is the teacher feature map at layer ;
is the corresponding student feature representation at layer ;
is a learned linear projection mapping teacher features into the student feature space;
denotes the squared L2 norm.
This projection ensures dimensional compatibility between heterogeneous architectures.
3.5.4. Attention Alignment Loss
To preserve the teacher’s spatiotemporal reasoning structure, attention knowledge is distilled through a learned projection-based alignment mechanism. Since the teacher (FNO-based spectral model) and student (transformer-based model) operate in fundamentally different representational spaces, a learned adapter function is required to ensure meaningful alignment. The attention alignment loss is defined as:
where
denotes the number of aligned layer pairs,
is the learned projection adapter,
denotes the squared Frobenius norm.
This formulation ensures consistent cross-architecture alignment between the Fourier-based teacher and transformer-based student.
This projection-based formulation is essential due to the structural mismatch between spectral convolutional interactions in the teacher and self-attention mechanisms in the student model.
3.6. Training Pipeline and Edge Profiling
The model development process followed three stages. First, the teacher model was initialized and adapted regionally. Second, knowledge distillation was performed to train the lightweight student model. Finally, edge-device profiling was conducted. The overall workflow begins with pretraining the teacher model on large-scale global climate datasets, specifically ERA5 reanalysis fields and CMIP6 historical climate simulations. These datasets provide long-term, multivariate climate signals that enable the model to learn generalized hydrological patterns of atmospheric and land-surface dynamics. To complement global-scale learning, region-specific climate–hydrological patches were extracted using geospatial masking and combined with satellite and in situ hydrological observations, thereby enriching the representation of localized flood-prone dynamics.
To ensure rigorous evaluation and prevent temporal or spatial leakage, the dataset was partitioned using a spatiotemporal generalization strategy. Data from 2010 to 2018 were used for training, 2019 to 2020 for validation, and 2021 to 2023 for final testing. This temporal separation ensures that the model is evaluated strictly on unseen future periods, reflecting realistic forecasting conditions rather than interpolation within known time windows. In addition, a leave-one-region-out evaluation protocol was adopted to assess spatial generalization. Under this setting, each target region (Nigeria, Kenya, Mozambique, Bangladesh, India, and Nepal) was systematically excluded during training and used exclusively for testing. This design enables a strict evaluation of cross-basin transferability across heterogeneous hydrological regimes.
Following global pretraining, the teacher model was fine-tuned on regional datasets to adapt its representations to local hydrological processes such as rainfall–runoff response, soil moisture evolution, and topographically driven flow accumulation. This fine-tuning stage was supervised using historical flood records and discharge observations, ensuring that the learned representations are explicitly aligned with flood-relevant predictive tasks. Once adapted, the fine-tuned teacher model served as the supervisory network in the knowledge distillation stage, where its outputs, intermediate feature maps, and attention patterns were transferred to the lightweight student model. The student architecture, designed with temporal convolutional layers and attention mechanisms incorporating geospatial embeddings, was trained to replicate the teacher’s predictive behavior while significantly reducing computational complexity and memory requirements.
To ensure robustness and fair evaluation, all experiments were conducted under a unified implementation framework using PyTorch-2.1.0 Lightning with identical preprocessing pipelines across models. Training was performed on an NVIDIA A100 GPU (Santa Clara, CA, USA) using mixed-precision (FP16) computation to enhance efficiency while maintaining numerical stability. Input sequences consisted of eight temporal steps at three-hour intervals, representing a 24 h historical context window, and all variables were harmonized to a nominal spatial resolution of 0.1 degrees following preprocessing alignment. A batch size of 16 was used consistently across all experiments to balance computational efficiency and convergence stability.
Model optimization was carried out using the AdamW optimizer with an initial learning rate of 1 × 10−4 and a weight decay of 1 × 10−5, configured with momentum parameters β1 = 0.9 and β2 = 0.999. A cosine annealing learning rate schedule with a warm-up phase was applied, in which the learning rate was gradually increased over the first five epochs before decaying to a minimum value of 1 × 10−6. Training was conducted for up to 50 epochs, with early stopping based on validation loss using a patience of 10 epochs to mitigate overfitting.
For deployment benchmarking, inference experiments were conducted using ONNX Runtime with batch size fixed at 1 to simulate real-time operational deployment conditions. The benchmark input tensor dimensions were configured as C × T × H × W = 14 × 8 × 64 × 64, corresponding to 14 multimodal environmental variables, 8 temporal steps, and a 64 × 64 spatial grid. To improve inference efficiency on edge hardware, INT8 post-training quantization was applied during deployment profiling, while structural pruning was not used in order to preserve predictive stability. Latency measurements were averaged over multiple inference runs on Raspberry Pi 4 Model B (Raspberry Pi Foundation, Cambridge, UK) and an NVIDIA Jetson Nano Developer Kit (NVIDIA Corporation, Santa Clara, CA, USA) platforms under identical runtime settings.
To ensure reproducibility and consistency in deployment reporting, model size is defined as the serialized ONNX model footprint in FP32 precision, including computational graph metadata and runtime buffers. This definition follows standard deployment benchmarking practices in edge AI literature.
The reported parameter counts reflect learnable weights only, while model size (MB) reflects full serialized inference artifacts. Also, INT8 quantization was applied exclusively for deployment benchmarking on edge devices (Raspberry Pi 4 and NVIDIA Jetson Nano), and does not alter the reported FP32 model size values used for compression ratio analysis.
Memory profiling included both static model parameters and dynamic activation memory generated during sequential ConvLSTM processing. Consequently, reported Random Access Memory (RAM) consumption reflects not only the compact model size but also the temporary storage required for hidden-state propagation and transformer attention computations during inference.
The forecasting configuration is defined as a sequence-to-one prediction task. The model receives an input historical window of 24 h, represented as 8 time steps at 3 h intervals, and produces a forecast for flood occurrence and discharge conditions at a 24 h lead time (t + 24 h). This setup ensures short-range predictive capability suitable for early flood warning applications in low-resource environments.
For benchmarking against the Global Flood Awareness System (GloFAS v4.0), developed jointly by the European Commission Joint Research Centre (JRC), Ispra, Italy, and the European Centre for Medium-Range Weather Forecasts (ECMWF), Reading, United Kingdom, we align evaluation by comparing model outputs against GloFAS forecasts at overlapping lead-time windows. Since GloFAS provides ensemble forecasts at multi-day horizons (3–5 days), we aggregate and temporally align these outputs to the nearest comparable evaluation window for consistent cross-system comparison. This ensures that performance differences reflect modeling approach rather than temporal mismatch.
The knowledge distillation process employed a multi-objective loss function combining supervised learning, soft-target alignment, feature matching, and attention alignment. To address severe imbalance between flood and non-flood samples, class-weighted supervision was incorporated into the hard-loss component, with weights computed from inverse class frequency statistics derived from the training set. The weighting coefficients were fixed at α = 1.0, β = 0.5, γ = 0.3, and λ = 0.2, while the distillation temperature was set to τ = 4. These hyperparameters were selected through validation-based tuning to ensure stable convergence and balanced information transfer between teacher and student models. The teacher model was initialized through pretraining on ERA5 and CMIP6 datasets, whereas the student model was initialized using Xavier initialization for convolutional components and uniform initialization for transformer-based layers. To ensure result robustness, all experiments were repeated over five independent runs with different random seeds, and final results are reported as mean values with corresponding standard deviations.
Finally, the distilled student model underwent detailed edge-deployment profiling to evaluate its suitability for real-time forecasting in low-resource environments. The evaluation included parameter count, inference latency, floating-point operations (FLOPs), model storage size, and peak RAM utilization during sequential inference. Profiling was performed using ONNX Runtime with INT8 quantization enabled and batch size fixed at 1 to emulate real-time deployment conditions on Raspberry Pi 4 and NVIDIA Jetson Nano devices. Reported RAM usage includes both static parameter allocation and dynamic activation memory associated with ConvLSTM temporal state retention and transformer attention operations during inference.
4. Results and Discussion
The performance of the framework was assessed using classification, regression, calibration, and efficiency metrics against the selected baseline models.
The evaluation covered both classification tasks (flood occurrence and severity prediction) and regression tasks (continuous discharge estimation). Because flood prediction at fine spatiotemporal resolution is inherently characterized by severe class imbalance, evaluation was extended beyond conventional accuracy and ROC-AUC metrics to include precision–recall (PR) curves and average precision (AP) scores, which provide more reliable assessment under rare-event conditions. Classification performance was therefore evaluated using accuracy, F1-score, recall, false alarm rate, ROC-AUC, and AP, while regression performance was assessed using mean absolute error (MAE), root mean squared error (RMSE), and the coefficient of determination (R2).
To address severe class imbalance, evaluation included precision–recall (PR) curves and average precision (AP) scores in addition to ROC-AUC metrics. Classification thresholds were optimized on the validation set to balance recall and precision for operational flood warning scenarios. During training, a class-weighted supervision strategy was employed to reduce bias toward majority non-flood samples and improve sensitivity to rare flood events.
All deep learning models were evaluated over five independent runs with different random seeds, and results are reported as mean ± standard deviation. Standard deviations remained below ±1.5%, indicating stable convergence. Due to the deterministic nature of tree-based methods, the random forest baseline was evaluated using a fixed configuration.
The training dynamics of the student model, including the convergence behavior of the loss and accuracy curves, are shown in
Figure 3. The model showed consistent convergence with minimal overfitting, as validation metrics closely followed the training metrics across epochs.
To comprehensively benchmark the proposed distilled framework, comparisons were conducted against multiple deep learning, operational forecasting, and classical machine learning baselines using identical input modalities and evaluation protocols. The evaluated baselines included a non-distilled ConvLSTM model, a random forest regressor using handcrafted hydrological predictors, transformer-based sequence forecasting architectures including Informer and PatchTST, and an LSTM-based hydrological forecasting model inspired by CAMELS-style regional rainfall–runoff learning frameworks proposed by Kratzert et al. [
32].
Large-scale operational flood forecasting outputs from the Global Flood Awareness System (GloFAS v4.0) were incorporated as an external benchmark where compatible discharge observations and forecast horizons were available. GloFAS ensemble reforecast products were evaluated at a 3–5-day lead time and spatially regridded to the unified 0.1° evaluation framework for consistent comparison across models.
Because direct reproduction of physics-based hydrological routing systems such as Variable Infiltration Capacity (VIC) model, the Soil and Water Assessment Tool (SWAT), and the LISFLOOD hydrological and flood forecasting model (VIC, SWAT, and LISFLOOD) requires extensive basin-specific calibration and hydrodynamic parameterization beyond the scope of this study, GloFAS was adopted as a representative operational physics-based baseline. This enables a consistent comparison between the proposed distilled framework and established large-scale hydrological forecasting systems under harmonized spatiotemporal evaluation conditions.
An empirical tradeoff between predictive performance and inference latency is presented, illustrating the benefits of model distillation. The student model retained 96.7% of the teacher model’s classification accuracy while achieving an 8× reduction in latency. These results affirm the feasibility of deploying the distilled model in real-time, low-power environments.
4.1. Quantitative Performance
After establishing the evaluation protocol, the student model was compared with several baseline approaches. The comparison included deep learning models (ConvLSTM, Informer, and PatchTST), an LSTM-based hydrological forecasting model inspired by CAMELS-style rainfall-runoff learning, a Random Forest baseline, and the full-scale FourCastNet teacher model.
As summarized in
Table 2, the proposed distilled student model achieved an average classification accuracy of 0.89, F1-score of 0.88, recall of 0.90, and AUC of 0.91, while also attaining an average precision (AP) score of 0.87 under severe class imbalance conditions.
In discharge estimation, performance metrics were computed on z-score normalized discharge values applied per basin to ensure numerical stability and fair comparison across heterogeneous hydrological regimes. Accordingly, MAE and RMSE are reported in normalized discharge units rather than raw physical discharge (m3/s). This ensures that evaluation reflects model predictive error relative to basin-specific variability rather than absolute magnitude differences across catchments.
Although the teacher model achieved the highest overall predictive performance, the distilled student retained approximately 96.7% of the teacher’s classification capability while reducing inference latency by nearly 8× and model size by more than 20×. Compared with transformer-based baselines such as Informer and PatchTST, the proposed student model achieved comparable forecasting skill with significantly lower computational overhead and memory consumption, making it more suitable for deployment on low-power edge devices.
The CAMELS-style LSTM hydrology baseline demonstrated competitive discharge prediction capability but exhibited reduced generalization across geographically heterogeneous flood regimes. Similarly, GloFAS reference forecasts provided strong large-scale riverine forecasting performance but lacked localized adaptation for high-resolution regional flood prediction. Classical machine learning approaches such as Random Forest showed lower predictive accuracy and weaker temporal consistency, particularly for extreme flood events and long forecasting horizons.
Precision–recall analysis showed that the model maintained reliable rare-event detection performance despite the substantial imbalance between flood and non-flood samples. This indicates that the reported performance is not solely driven by majority-class dominance but reflects meaningful predictive skill for operational flood early-warning applications.
4.2. Generalization Gap Analysis (Train/Validation vs. Test Performance)
A consistent performance gap is observed between training/validation results and final test performance, as reported in
Table 2 and
Figure 3. This gap is not indicative of model instability, but rather reflects the presence of combined temporal and spatial distribution shifts inherent in the evaluation design.
The temporal split (training: 2010–2018, validation: 2019–2020, test: 2021–2023) introduces a measurable non-stationarity effect in hydroclimatic forcing variables, including changes in precipitation intensity, seasonal variability, and extreme event frequency. These shifts are particularly pronounced in monsoon-influenced regions, where interannual variability is high.
In addition to temporal shift, the evaluation incorporates a leave-one-region-out spatial generalization setting, which further increases distributional divergence between training and test domains. Under this setting, each test region represents a structurally different hydrological regime not seen during training, including Sahelian flash flood systems, South Asian monsoon basins, and Himalayan catchments.
A detailed per-region breakdown shows that performance degradation is most significant in highly non-stationary hydrological environments, particularly monsoon-dominated basins such as Bangladesh and parts of India. In contrast, regions with more stable hydrological regimes exhibit smaller performance drops, indicating that model generalization is sensitive to climatic variability rather than uniformly degraded.
Importantly, despite this expected distribution shift, the model maintains strong performance across all test regions, confirming that the observed gap reflects realistic deployment conditions rather than overfitting. This behavior is consistent with prior findings in spatiotemporal climate modeling, where cross-domain transfer typically incurs measurable but bounded degradation under strict spatial–temporal separation protocols.
4.3. Calibration and Reliability Analysis
Because flood forecasting is a high-stakes decision-support task, probability calibration was further evaluated to ensure that predicted flood probabilities correspond closely to observed event frequencies. In addition to ROC-AUC evaluation, calibration quality was assessed using reliability diagrams, Expected Calibration Error (ECE), and Brier scores before and after post hoc temperature scaling.
Temperature scaling was applied on the validation set using a single scalar temperature parameter to recalibrate the student model’s probabilistic outputs without modifying classification boundaries. Reliability analysis demonstrated that calibration substantially improved the agreement between predicted probabilities and observed flood occurrence frequencies, particularly for medium- and high-risk events.
The distilled student model achieved a post-calibration ECE of 0.031 compared to 0.084 before calibration, while the Brier score improved from 0.118 to 0.091. Similar improvements were observed for the teacher model, although the distilled student maintained lower inference cost and memory consumption. Reliability diagrams presented in
Figure 4 further demonstrate that post-calibration predictions more closely follow the ideal diagonal calibration line, indicating improved probabilistic consistency across forecast thresholds.
The reliability diagrams were generated using a fixed binning strategy for probability calibration analysis. Initial visual artifacts in the reliability curve (
Figure 4) were attributed to the use of a coarse bin resolution (10 bins) combined with prediction concentration in high-confidence regions, a known effect in imbalanced classification problems such as flood prediction.
To improve interpretability, the visualization was recomputed using an increased bin resolution (20 bins) without altering the underlying probability estimates used for Expected Calibration Error (ECE) computation. This ensures that reported calibration metrics remain unchanged while improving the visual smoothness and interpretability of reliability plots.
These results indicate that the framework achieves reliable discrimination performance but also produces well-calibrated probabilistic forecasts suitable for operational flood early-warning systems, where reliable uncertainty estimation is essential for balancing false alarms and missed flood events.
4.4. Model Compression Effectiveness
To assess the efficiency of the proposed distillation framework, the student model’s computational profile was benchmarked against baselines. As shown in
Figure 5, the student model achieved an 8× reduction in inference latency compared to the teacher model and a 20× reduction in model size, with only a 3% drop in accuracy. Specifically, the distilled model required 85 million FLOPs and occupied 18 MB of storage, enabling real-time prediction on devices such as the Raspberry Pi 4 and Jetson Nano.
This performance efficiency tradeoff underscores the practical viability of the approach for low-resource settings where bandwidth, compute, and power consumption are major constraints.
In addition, the performance metrics of all models are tabulated in
Table 3, offering a side-by-side comparison of predictive and computational characteristics.
4.5. Regional Generalization
To evaluate spatial generalization, the model was assessed using a structured leave-one-country-out validation protocol across the study regions, namely Nigeria, Kenya, Mozambique, Bangladesh, India, and Nepal. In this evaluation framework, each country was systematically excluded from the training set and used exclusively as a held-out test region, thereby enabling a strict assessment of cross-regional transferability under heterogeneous hydrological and climatic regimes. This design avoids reliance on geographically adjacent testing and instead provides a more robust measure of true out-of-distribution generalization across distinct flood systems.
Performance is reported at the country level to ensure interpretability across diverse flood-generating mechanisms, including Sahelian flash floods in West Africa, monsoon-driven riverine flooding in South Asia, and complex catchment responses in Himalayan basins. As summarized in
Figure 6, the model demonstrates consistently strong performance across both seen and held-out regions, with variability reflecting differences in hydrological complexity and climatic non-stationarity.
Across all evaluated regions, the average degradation in accuracy relative to in-domain performance remains below 4.5%, confirming strong but non-uniform transferability across climatic zones. Importantly, discharge predictions are reported in physical units (m3/s), enabling physically meaningful comparison across basins with differing hydrological scales, while normalized metrics are additionally provided where inter-basin magnitude differences may bias interpretation.
These findings suggest that the proposed student model learns transferable hydrometeorological representations supported by geospatial embeddings and attention-based temporal modeling; however, the degree of transferability is influenced by regional hydrological heterogeneity, particularly in extreme or highly non-stationary flood regimes.
This suggests that the distilled student network learned transferrable hydrometeorological features, aided by its geospatial positional encodings and attention mechanisms.
4.6. Ablation Studies
To understand the contribution of individual components in the distilled student architecture, a series of ablation experiments were conducted. Each experiment involved systematically removing a key module or loss term and retraining the student model under identical conditions. The results provide empirical evidence for the role of each design decision in enhancing predictive accuracy and generalization. In the first variant, attention distillation was removed from the loss function. This modification resulted in a noticeable degradation in model performance, with the AUC score dropping from 0.91 to 0.854 and the F1-score falling by 5%. This performance decay highlights the significance of preserving inter-variable temporal dependencies and spatial correlation patterns learned from the teacher model. The second variant excluded the feature alignment loss, which is responsible for bridging intermediate representations between the teacher and student networks. Although the overall accuracy remained moderately high (86%), the predictions became less calibrated, and the variance in the flood risk maps increased. This resulted in lower interpretability and less robust decision thresholds, indicating the importance of latent feature supervision. A third ablation involved removing geospatial encodings from the student model’s architecture. This variant showed the steepest decline in cross-region generalization, with the accuracy in unseen regions dropping by over 8%. The AUC also fell to 0.837, reinforcing the need for integrating geographical priors in spatially aware environmental modeling. Collectively, these results confirm that the full model configuration offers the best tradeoff between accuracy and generalization, with each component contributing uniquely to the overall architecture. The summary of the results of the ablation study is presented in
Table 4.
In addition,
Figure 7 presents a comparative visualization of the key performance metrics Accuracy, F1-Score, and AUC across different architectural configurations of the student model, highlighting the effect of specific components through ablation.
The results of this study demonstrate the viability and effectiveness of a distilled deep learning framework for localized flood forecasting, particularly in low-resource environments. The proposed student model achieved strong performance across both classification and regression tasks, maintaining high fidelity to its teacher counterpart while operating at a fraction of the computational cost. These gains are especially meaningful in contexts where real-time decision-making is constrained by limited access to infrastructure, connectivity, or power.
To explicitly quantify the contribution of CMIP6 historical simulations during teacher model pretraining, we conducted a controlled ablation study comparing three configurations: (i) training from scratch without large-scale pretraining, (ii) ERA5-only pretraining, and (iii) combined ERA5 + CMIP6 pretraining (proposed approach). The results are summarized in
Table 5.
As shown in
Table 5, training from scratch yields the weakest performance across all metrics, with an accuracy of 0.81 and an AUC of 0.83, highlighting the importance of large-scale climate pretraining for learning meaningful climate dynamics.
Introducing ERA5-only pretraining significantly improves performance, increasing accuracy to 0.87 and AUC to 0.90. This indicates that observation-constrained reanalysis data provides strong supervision for learning hydrometeorological dynamics relevant to flood prediction.
The inclusion of CMIP6 historical simulations further improves performance, albeit more modestly, raising accuracy to 0.89 and AUC to 0.91, while also reducing cross-region generalization error from 5.2% to 4.3%. These gains indicate that CMIP6 primarily contributes to improving long-term climatological diversity and distributional robustness, rather than significantly altering event-scale predictive accuracy.
Importantly, the improvements introduced by CMIP6 are most pronounced in out-of-distribution regional testing, where exposure to broader climate variability helps stabilize predictions in unseen hydrological regimes. In contrast, ERA5 remains the dominant source of fine-scale predictive signal due to its observational constraint and higher temporal fidelity.
These results demonstrate that CMIP6 serves as a complementary pretraining signal that enhances generalization under domain shift, while ERA5 remains the primary driver of short-term flood forecasting accuracy.
A key outcome of this research is the model’s ability to generalize across diverse geographies with minimal degradation. The incorporation of geospatial encodings and region-aware attention mechanisms enabled the student model to transfer hydrometeorological knowledge across climatologically similar but data-scarce areas. This is of particular importance in regions like Sub-Saharan Africa and South Asia, where flood risk is high but local modeling capabilities are limited. Moreover, the distillation framework promotes not only computational efficiency but also interpretability. Despite these advantages, certain limitations must be acknowledged. The model’s predictive performance in regions with extreme or atypical flood dynamics remains less stable, and its reliance on available satellite and ground truth data introduces sensitivity to data quality and resolution. Furthermore, ethical concerns related to under- or over-prediction of flood events necessitate careful calibration before deployment in real-world early warning systems
4.7. Rare-Event Evaluation and Class Imbalance Analysis
Given the highly imbalanced nature of flood forecasting datasets, where flood occurrences constitute between 5% and 10% of total samples depending on region (
Table 6), evaluation was extended beyond ROC-AUC to include precision–recall (PR) curves and average precision (AP), which provide a more reliable measure of predictive performance under rare-event conditions.
As shown in
Table 6, all study regions exhibit strong class imbalance, with non-flood events dominating the dataset. This imbalance significantly impacts evaluation metrics, as ROC-AUC may remain high even when models perform poorly on rare positive events. Therefore, PR-AUC is treated as the primary metric for assessing operational flood detection capability.
To ensure fair and operationally meaningful evaluation, classification thresholds were not fixed at 0.5 but instead selected using validation-set optimization. Specifically, the operating threshold was chosen to maximize the F1-score on the validation split while maintaining recall above a predefined minimum acceptable level for flood warning applications.
In addition, full precision–recall curves are now reported for all models to provide a complete characterization of performance under varying decision thresholds. These curves demonstrate that the model maintains strong precision across a wide range of recall values, confirming robustness under extreme class imbalance conditions.
While ROC-AUC remains included for comparability with prior literature, it is interpreted with caution due to its insensitivity to class prevalence. Consequently, PR-AUC and F1-score are emphasized as the primary indicators of predictive performance in this study.
4.8. Risk Management, Ethics, and Deployment Considerations
Flood forecasting systems carry significant societal responsibility, as both false positives and false negatives can lead to substantial economic and humanitarian consequences. To mitigate these risks, the model incorporates probabilistic outputs that allow for threshold calibration based on local risk tolerance. Calibration techniques, including temperature scaling, were applied to align predicted probabilities with observed frequencies.
The framework is designed to support human-in-the-loop deployment, where model outputs augment not replace expert decision-making by hydrologists and disaster management authorities. This approach enables contextual interpretation of predictions and reduces the risk of automated misjudgment in high-stakes scenarios.
From an ethical and governance perspective, particular care must be taken when deploying predictive models in vulnerable communities. Transparency, uncertainty communication, and continuous post-deployment monitoring are essential to ensure responsible use. The lightweight nature of the student model further supports equitable access by enabling deployment in regions with limited computational infrastructure.
5. Conclusions
This study presented a knowledge distillation framework for transforming high-capacity global climate models into lightweight, regionally adaptable flood forecasting systems suitable for deployment in low-resource environments. Through a teacher–student learning paradigm that integrates spatiotemporal attention, multimodal data fusion, and feature-level supervision, the proposed student model achieved strong predictive performance while significantly reducing computational demands.
Experimental evaluation across multiple regions demonstrated that the distilled student model retained approximately 96.7% of the teacher model’s classification accuracy, achieving an average accuracy of 0.89, an AUC of 0.91, and an F1-score of 0.88, while reducing inference latency by 8× and model size by over 20×. In discharge estimation tasks, the student model achieved a mean absolute error of 0.17, RMSE of 0.24, and an score of 0.85, confirming close alignment with observed hydrological measurements. These results indicate that the distillation process successfully preserved both predictive fidelity and hydrometeorological reasoning from the teacher model.
The framework also demonstrated robust regional and temporal generalization, with accuracy degradation remaining below 4.5% when applied to unseen geographic regions and below 6% during historically extreme flood years. Ablation studies further confirmed the importance of attention distillation, feature alignment, and geospatial encodings, each contributing measurably to predictive accuracy, calibration stability, and cross-region transferability. Importantly, the distilled model achieved real-time inference on edge platforms such as the Raspberry Pi 4 and NVIDIA Jetson Nano, validating its practicality for deployment in bandwidth- and power-constrained settings.
Looking ahead, several promising research directions emerge. First, incorporating active learning mechanisms could improve adaptability by enabling community-level or expert feedback during deployment. Second, extending the multimodal input space with drone imagery and mobile sensor networks may enhance situational awareness in data-scarce regions. Third, embedding the forecasting framework within multilingual and culturally contextualized early warning systems could increase accessibility and trust among end users. Finally, future work may explore cross-region meta-distillation and continual learning strategies to further strengthen generalization across heterogeneous climate regimes.
Together, the framework demonstrates that knowledge distillation can bridge the gap between high-capacity climate intelligence systems and practical flood forecasting applications, enabling scalable and computationally efficient early-warning solutions for climate-vulnerable regions.