1. Introduction
The ionosphere, Earth’s upper atmospheric layer ionized by solar radiation, is integral to radio signal propagation but can introduce significant positioning errors, posing challenges for Global Navigation Satellite System (GNSS) applications [
1]. In safety-critical domains such as civil aviation, operations like aircraft landing are constrained to the certified L1 frequency (1.5 GHz), which demands high positional accuracy. Ionospheric gradients and scintillation, particularly during geomagnetic disturbances, can degrade signal accuracy and threaten positioning integrity [
2]. Beyond these storm-time effects, even under quiet geomagnetic conditions, post-sunset plasma-bubble scintillation in equatorial regions has been reported to occasionally lead to loss of lock [
3,
4]. To mitigate these effects, ground-based (GBAS) and Satellite-Based (SBAS) augmentation systems are used to provide ionospheric corrections and enhance signal reliability. These systems rely heavily on accurate modeling of ionospheric delays, which are typically quantified using the total electron content (TEC), the integrated electron density along a signal path, measured in TEC Units (
).
As GNSS-enabled services continue to expand across sectors, TEC forecasting has become an active research area, with numerous predictive models proposed [
5]. Yet, modeling the ionosphere remains inherently difficult due to its complex and highly dynamic behavior in response to solar and geomagnetic activity [
6,
7,
8]. Existing strategies range from well-established empirical models, such as the International Reference Ionosphere (IRI) [
9,
10] and physics-based simulations [
11,
12], to more recent machine learning (ML)-based approaches. ML techniques have been reported to show improved accuracy compared to traditional models, supported by the growing availability of data and advances in artificial intelligence (AI) [
13,
14,
15,
16].
Motivated by the COSPAR 2015–2025 roadmap’s call for extended forecasting and multi-step prediction capabilities in space weather [
17], recent studies have increasingly adopted recurrent neural networks (RNNs), particularly Long Short-Term Memory (LSTM) architectures to capture longer temporal dependencies in ionospheric behavior. For example, Kaselimi et al. [
13] implemented a sequence-to-sequence LSTM approach for vertical TEC (VTEC), outperforming traditional baselines. Kim et al. [
14] tailored their LSTM network for storm conditions, reporting improved accuracy compared to SAMI2 and IRI-2016. More recently, Ren et al. [
18] presented a hybrid CNN–BiLSTM design that integrates geomagnetic indices such as Kp, ap, and Dst, focusing on storm conditions.
While LSTMs effectively capture temporal dynamics, they lack inherent spatial awareness without additional modules. Thus, in parallel, researchers have explored Convolutional LSTM (ConvLSTM)-based approaches to jointly model spatial and temporal structures. A foundational contribution in this direction is the work of Boulch et al. [
19], who developed one of the few fully open-source AI models for ionospheric prediction. Their best-performing configuration, DCNN, stacked three ConvLSTM layers and applied a heliocentric transformation to account for Earth’s rotation. More recent studies have proposed alternative ConvLSTM-based architectures, often without referencing or benchmarking against DCNN. Liu et al. [
20] refined the loss function and explored storm-specific training; Gao and Yao [
21] introduced multichannel inputs, including solar and geomagnetic indices; and Yang et al. [
16] showed that a TEC-only ConvLSTM could outperform both IRI-Plas and C1PG. Most recently, Liu et al. [
22] focused on parameter optimization and incorporated skip connections and attention mechanisms, achieving improved predictive accuracy and generalization.
Despite these advances, the lack of common baselines makes it difficult to assess progress or establish consistent evaluation standards within the field. As a result, ionospheric ML research risks developing in isolation, without the cumulative rigor seen in more mature spatiotemporal modeling domains such as video prediction or weather forecasting. Shared evaluation practices are critical for both scientific and operational goals, as also emphasized by [
23,
24,
25].
Toward this goal, we take an initial step toward establishing the needed evaluation standards and common baselines by developing the first iteration of IonoBench as a benchmarking framework for comprehensive ionospheric ML forecasting. With this framework we evaluate a DCNN [
19] alongside two representative state-of-the-art (SOTA) models adapted from recent spatiotemporal benchmarks [
26]: SwinLSTM [
27] and SimVPv2 [
28]. The performance of these AI models is benchmarked against the climatological reference (IRI 2020 [
9]) model. Subsequently, the top-performing model is compared against C1PG, the CODE Analysis Center’s 1-day forecast product, to establish an operational reference point.
Furthermore, a recurring limitation in prior work is the narrow temporal scope of evaluation datasets, which often fail to reflect the full spectrum of ionospheric variability. Many studies have been confined to solar cycle 24 (SC24), a period marked by relatively low solar activity, limiting the stress-testing capacity of proposed models. In contrast, solar cycle 23 (SC23) included several of the most intense geomagnetic storms in recent decades and shows F10.7 trends more closely aligned with the ongoing solar cycle 25 (SC25) [
29], making it a more operationally representative basis for model development and evaluation.
To address this, our framework proposes a carefully designed stratified data split that balances solar activity levels across training, validation, and test sets while explicitly preserving major geomagnetic storm events in the test set. This design enables robust, real-world evaluation scenarios that reflect both quiet and disturbed conditions, which is critical to forecast readiness. All datasets, split definitions, models, and configurations are released as open source on GitHub via the IonoBench repository:
https://github.com/Mert-chan/IonoBench.git (accessed on 10 July 2025).
4. Investigating Future Bias in Stratified Split
A potential drawback of stratified splitting is its violation of temporal order, which may allow the model indirect access to future observations. To assess whether stratified splitting still supports generalization under operational constraints, where future data is unavailable, we compared it against a strictly chronological baseline with matched solar activity coverage.
The
valid_v3 subset (a high-activity period; see
Figure A2) served as the common evaluation window, as it avoids future data leakage under both splitting strategies. In the chronological setup, the model was trained on data from 2000 to 2016 and validated on data from 2019 to 2023. This configuration matches the training volume of the stratified split and approximates its solar intensity distribution, enabling a fair comparison of the two approaches (see
Appendix C,
Table A3).
Both configurations were trained with SimVPv2 using an identical pipeline with hyperparameter tuning, as described in
Section 3.2, and were evaluated on the held-out
valid_v3 subset. As shown in
Figure 3 and
Table 2, the stratified split shows consistent performance increase over the chronological baseline across all forecast horizons, yielding a lower RMSE and higher
.
While this comparison focused on a high-activity period, extending the analysis to other solar conditions would require redefining the stratified subsets to maintain balance without introducing data leakage. Such adjustments would compromise the consistency of the comparison and were therefore left outside the scope of this study.
Overall, these results provide encouraging evidence that using a stratified split, despite violating temporal order, may not preclude effective model generalization. However, we must note that for the stratified model, the
valid_v3 evaluation data was a subset of the larger validation set used for tuning, which may contribute to this performance margin. Consequently, for final pre-deployment validation, chronological splits remain indispensable to mitigate any potential for future information leakage and ensure true operational generalizability. This aligns with the hybrid development–deployment strategy discussed in
Section 6.
5. Results
This section presents a comprehensive evaluation of the benchmarked models using the proposed framework. The analysis begins with overall performance across the full test set with a climatological base reference (IRI 2020), followed by a breakdown of model behavior under varying solar activity levels. It then examines performance during distinct phases of selected geomagnetic storms, incorporating average results across all 16 storms. Further comparison is provided through a visual analysis of a superintense storm versus a quiet-day scenario. The evaluation concludes with a benchmarking of the top-performing model against an operational ionosphere reference (C1PG).
5.1. Overall Performance
Table 3 presents the overall test set performance of the three evaluated models. In addition to accuracy, we report training time and normalized memory usage to account for varying batch sizes due to hardware and tuning constraints.
Among the models, SimVPv2 demonstrates the best performance across all metrics, achieving the lowest RMSE (), highest coefficient of determination (), and best structural similarity (). It also shows the lowest performance variance across the test set and completes training faster than the other models under the current configuration.
Interestingly, DCNN, despite being an older and lightweight model, remains highly competitive and slightly outperforms SwinLSTM in every metric. While SwinLSTM has the largest parameter count and memory footprint, these do not yield improved predictive accuracy under the current evaluation conditions.
The IRI 2020 results, derived using default settings and updated drivers, serve as a climatological benchmark across the same test set. With an overall RMSE of TECU (, SSIM = ), the model captures broad TEC patterns. As expected from a monthly median model, its performance underperforms in a test set that is designed to challenge dynamic responses of an ionospheric nature. The performance gap reported is consistent with previous findings that empirical climatologies are outperformed by data-driven models. While IRI remains valuable for climatological reference, its nature limits the model’s responsiveness to real-time solar-terrestrial dynamics. Subsequent analyses omit IRI 2020, as its output is constant across forecast horizons, whereas the remainder of this study focuses on evaluating time-varying AI forecasts and C1PG.
5.2. Performance Across Solar Activity Levels
Evaluating performance across the stratified solar activity classes (
Section 2.3) reveals key differences in model robustness.
Figure 4a reports the average RMSE per solar class, aggregated across all forecast horizons. Under very weak and weak solar conditions, all models perform similarly to their overall test set rankings, with SimVPv2 achieving the lowest error, followed by DCNN and SwinLSTM.
However, as solar activity intensifies, the relative performance of the models begins to shift. In the moderate and intense solar classes, SwinLSTM performs slightly better than DCNN (reversing the trend seen in weak conditions), and the overall error magnitudes increase substantially. This suggests that DCNN and SwinLSTM may be more sensitive to elevated solar forcing, while SimVPv2 maintains better stability.
Figure 4b provides a horizon-wise breakdown of this trend, showing that all models exhibit rising error over time, especially under higher solar activity levels. Notably, SimVPv2 retains a more gradual error progression, with increasing performance separation from the recurrent models as the forecast horizon lengthens.
5.3. Geomagnetic Storm Analysis
This analysis focuses on the 16 distinct geomagnetic storm events (including 2 superintense storms with Dst dips exceeding nT). Storm windows are extracted using a symmetrical period of ±3 days around the minimum Dst value. Within each window, we define the main phase as the ±6 h interval surrounding the Dst minimum, based on the period of most rapid ionospheric change. This design allows phase-specific error analysis.
5.3.1. RMSE Timeseries for Selected Storms
Figure 5 illustrates model behavior across three storm events from the test set, selected to represent a spectrum of geomagnetic severity: two superintense storms (Dst dips of
and
) and one intense, comparatively less severe storm (Dst dip of
). For each storm, the first three columns show the RMSE time series across the full storm window at 2 h, 12 h, and 24 h forecast horizons. The final column summarizes the phase-specific RMSE (commencement, main, recovery) averaged across all horizons, with standard deviations shown as error bars.
An analysis of storm phases indicates that forecasting errors peak during the main phase of geomagnetic storms. This is largely due to the rapid development of electric fields and changes in atmospheric winds that disturb ionospheric density and structure in ways that are difficult for models to accurately capture [
43,
44]. During the recovery phase, errors remain elevated as the ionosphere undergoes a slower return to equilibrium, influenced by lingering effects like altered composition and residual electrodynamic forcing [
45].
Across all three storms, distinct model behaviors emerge. DCNN performs relatively well during the commencement phase of storms 2 and 3 but struggles to capture the sharp dynamics of the main and recovery phases, likely due to its simpler recurrent structure. In contrast, SwinLSTM tends to trail DCNN during early phases but shows marked improvements as the phase complexity increases, potentially reflecting its capacity to model temporally evolving structures. Lastly, SimVPv2 demonstrates strong overall stability, but its underperformance during the commencement of storm 2 might suggest storm-specific challenges.
5.3.2. Average Storm-Phase Impact Compared to Quiet Conditions
To complement the event-level storm evaluation, we compute phase-wise RMSE across all 16 storms, averaged across all forecast horizons. For comparison, we define a set of `quiet periods’ by selecting non-storm test samples from the moderate and intense solar intensity levels to isolate the impact of storm conditions from the background solar activity. Furthermore, only samples with Dst values nT are selected. While this range may still contain weak or moderate geomagnetic storms, they are not yet considered in our dataset’s storm classification and are therefore excluded from the storm category in this analysis.
Table 4 presents the RMSE values of each model during quiet periods and across the three storm phases, alongside the corresponding percentage increases relative to quiet-time performance. All models experience the greatest degradation during the main phase, where DCNN and SimVPv2 show over a 140% increase in RMSE relative to quiet conditions. SwinLSTM records a 119.1% increase, indicating a somewhat more stable response under highly disturbed conditions.
Among the models, SimVPv2 achieves the lowest absolute RMSE with TECU during commencement, TECU in the main phase, and TECU in the recovery phases. In contrast, SwinLSTM, although not the most accurate in absolute terms, demonstrates the smallest relative performance loss, with an average increase of 55.8% over the full storm window.
Overall, SimVPv2 is the most accurate model across storm conditions, while SwinLSTM appears to be the most robust in terms of relative degradation. DCNN performs competitively with its SOTA counterpart SwinLSTM, under quiet conditions but exhibits the highest sensitivity to storm-phase dynamics, particularly during the main and recovery periods.
5.3.3. Visual Comparison: Residual Patterns During Stormy vs. Quiet Conditions
Lastly,
Figure 6 contrasts performance during two representative scenarios from a period of high solar activity: one during quiet geomagnetic conditions (Dst
nT) and one during a superintense storm (Dst
nT). With this, we aimed to qualitatively assess spatial performance across contrasting geophysical conditions.
Figure 6 displays 12 h forecast outputs from DCNN, SwinLSTM, and SimVPv2, alongside the corresponding IGS GIM reference maps.
The storm case reflects one of the highest RMSE instances in the test set and offers insight into model behavior under extreme ionospheric disturbance. All models exhibit substantial residuals, with local errors exceeding 40 TECU. Among them, SimVPv2 captures the global TEC distribution most accurately, maintaining lower spatial bias and a more coherent structure across equatorial and mid-latitude regions. In contrast, both DCNN and SwinLSTM struggle in the Southern Hemisphere, where DCNN exhibits a substantial overestimation of TEC at mid-to-high latitudes, while SwinLSTM shows only a slight overestimation. Statistically, DCNN shows the greatest error dispersion (Max: 46.61 TECU, Std: 13.81), while SimVPv2 demonstrates the most compact residual profile (Max: 35.69 TECU, Std: 8.17).
Under quiet conditions, all models perform well, generating visually consistent TEC maps. Their residual patterns are broadly similar, though subtle differences remain. SimVPv2 again records the lowest RMSE ( TECU for this instance) and standard deviation ( TECU), both DCNN and SwinLSTM exhibit elevated positive residuals concentrated near the eastern equatorial region, indicating that these models might be overestimating TEC in this region under quiet conditions. SimVPv2, on the other hand, produces lower residual magnitudes and shows fewer spatial artifacts, suggesting improved consistency in representing quiet-time TEC structures.
5.4. Comparison with C1PG Operational Baseline
Finally, we compare SimVPv2, the top-performing model in our evaluation, against C1PG, a widely used operational reference. C1PG is a 1-day global TEC forecast produced by the CODE analysis center and updated once per day. While several other operational ionospheric models exist [
46,
47], C1PG remains the predominant baseline in the recent literature due to its ease of access and consistent availability.
This comparison allows us to assess SimVPv2’s performance against a well-established operational standard across different solar activity phases. C1PG has been available since 2014, enabling comparison across the
test_v3 (solar low),
valid_v2 (solar descent), and
valid_v3 (solar high) subsets defined in our framework (
Appendix C). Since C1PG provides daily predictions, we post-processed the SimVPv2 outputs into a matching daily format
for consistent evaluation. Additionally, C1PG outputs were downsampled to a 2-h cadence to align with the IGS GIM resolution.
Figure 7 presents residual histograms across the three subsets, corresponding to different solar activity levels, and
Table 5 summarizes the corresponding performance metrics.
SimVPv2 consistently outperforms C1PG across all the evaluated subsets and metrics. The residual distributions shown in
Figure 7 are visibly more compact and symmetric for SimVPv2, especially during periods of elevated solar activity (solar descent and high activity), where C1PG exhibits broader and more asymmetric error tails. Quantitatively,
Table 5 confirms SimVPv2’s performance, demonstrating significantly lower RMSE, higher
, and higher SSIM values compared to C1PG across all tested activity levels. Notably, SimVPv2 maintains high
values (
–
) across all conditions, while its SSIM score, although consistently higher than C1PG’s, decreases from
during low activity to
during high activity. This underscores the value of using multiple metrics:
confirms the model’s ability to track broad TEC variations, while the SSIM trend reveals that preserving the precise spatial structure of GIMs becomes more challenging under heightened solar activity, even for a high-performing model.
The magnitude of this outperformance is particularly informative. In the demanding high-activity period (valid_v3), SimVPv2 achieves an RMSE approximately 32% lower than C1PG ( vs. TECU). This trend of significant improvement is consistent across all conditions, with a 29% RMSE reduction during solar descent (valid_v2) and a notable 27% reduction during the low-activity period (test_v3). While the valid_v2 and valid_v3 subsets were involved in model tuning, the substantial 27% RMSE reduction on the truly held-out test_v3 subset provides strong evidence of SimVPv2’s generalization capabilities compared to the operational baseline.
6. Discussion
A central insight from this work is the importance of test set composition for evaluating ionospheric models. By stratifying across solar intensity levels and various intense to superintense geomagnetic storms from SC23–SC25, the IonoBench framework allows for evaluating model performance under a wider range of conditions compared to many current studies that focus on narrower testing periods. This stress testing reveals performance degradation during high activity and storms in detail, particularly during the main storm-phase (average RMSE increase >140% vs quiet conditions (
Section 5.3.2)). Although a few recent studies incorporate storm phase distinctions [
15,
18,
48,
49], phase-specific evaluation remains largely overlooked, even among storm-focused ionospheric modeling works, which typically treat storms as a single block rather than decomposing them into commencement, main, and recovery phases.
Further emphasis should be given to the number and diversity of storms included; as every geomagnetic storm is unique, each can elicit different model responses. For example, in the case of storm 2, SimVPv2’s performance was notably poor during the onset (
Section 5.3). Residual pattern analysis revealed that DCNN significantly overestimated TEC in the Southern Hemisphere under certain conditions, whereas SwinLSTM showed only slight overestimation, and SimVPv2 did not exhibit such bias. (
Section 5.3.3). Although an aggregate analysis is reported here, the framework and dataset enable deeper, event-specific assessments.
The results of our comparison in
Section 4 should be contextualized by the primary motivation for the stratified split. The goal is not simply to achieve better aggregate performance metrics, but to address a crucial methodological pitfall in model evaluation: the risk of performance bias from under-represented extreme events and different conditions. AI-based models often adeptly predict nominal conditions, yet their performance can degrade significantly during critical events like geomagnetic storms. A standard chronological test set may lack a sufficient number of these events, leading to an optimistic assessment of a model’s true operational readiness. While the stratified split excels for this robust testing, the inherent nature of operational deployment favors a chronological approach for final validation. This distinction naturally leads to a practical hybrid approach for model development and deployment. We suggest using a stratified split for benchmarking and comparing different architectures under a wide range of solar and geomagnetic conditions. Once the most robust model is identified, it can then be retrained using a purely chronological split for final deployment. This two-step process ensures the deployed model can operate under realistic, time-respecting conditions while having been validated for generalization across diverse solar–terrestrial conditions.
In terms of modeling architectures, the strong performance of SimVPv2, a non-recurrent convolutional model adapted from general spatiotemporal modeling, highlights the potential for cross-domain architectural transfer into the ionospheric domain. Its success suggests that advancements from fields like video prediction can be effectively leveraged, provided they are adapted and evaluated within a domain-specific context using appropriate benchmarks. This potential strengthens the bridge between ionospheric research and the broader machine learning community, supporting the use of shared benchmarks and modular tools.
At the same time, the results reaffirm the continued relevance of architectures developed specifically for ionospheric modeling. DCNN, despite being an earlier open-source model based on ConvLSTM layers, remained competitive, even outperforming the more complex, Transformer-based SwinLSTM in overall metrics under the tested conditions (
Table 3). These results, coupled with the observation that DCNN is rarely compared in recent studies despite its availability, point to a potential issue of knowledge fragmentation and a tendency to overlook existing AI baselines within the field. Maintaining continuity, performing comparisons, and establishing benchmarks are essential for tracking genuine progress and ensuring reproducibility.
The name IonoBench is a conscious reference to WeatherBench, which likewise began with a modest set of models [
50] and grew through broader community engagement and follow-up releases [
51]. This reflects a shared emphasis on accessibility, transparency, and reproducibility as the foundations of a useful and enduring scientific framework.
While this represents the initial foundation of IonoBench, several limitations are acknowledged. Due to practical constraints such as computational resources, this version includes only three AI baselines, chosen to represent distinct and widely adopted modeling paradigms: convolutional, recurrent, and hybrid architectures. An operational (C1PG) and a climatological (IRI 2020) reference are also included; however, they are not yet integrated into plug-in evaluation protocols, which will be a priority in the near future.
Currently, multichannel inputs are supported, limiting out-of-the-box use for single-channel or scalar models without framework adaptation. Although standard deviations are reported to indicate variability, the framework currently lacks formal uncertainty quantification or metrics specifically evaluating the temporal consistency (smoothness) of forecasts. Additionally, support for explainability tools (e.g., feature attribution, and SHAP value analysis) is not yet integrated.
Future iterations of IonoBench will extend support to a wider range of model architectures, informed by recent developments in the use of Transformers and graph neural networks [
24,
52,
53]. Furthermore, it will also expand the auxiliary input types with magnetic latitude and local time, enhancing the models’ ability to capture geomagnetic field-aligned structures.
Additional priorities include enabling support for scalar and single-channel input models, integrating uncertainty quantification methods, and developing metrics to assess the temporal consistency of forecasts. Improving model interpretability is also a focus, with plans to incorporate explainability tools such as feature attribution and SHAP analysis. These improvements aim to strengthen the scientific and operational utility of the framework. As with similar efforts in other domains, the evolution of IonoBench will be guided by community collaboration, and contributions from researchers are both welcomed and encouraged.