1. Introduction
Air pollution has become a serious challenge in low and middle-income countries (LMICs), such as South Africa, which is highly dependent on coal-fired power plants for energy production. However, it is important to note that air pollution is not limited to LMICs [
1,
2]. High-income countries such as Switzerland also experience adverse health effects from relatively low levels of
PM2.5 concentration, particularly among vulnerable populations such as children, the elderly, and those with preexisting conditions. Therefore, studying air pollution is crucial to understanding its impact on people’s health and livelihoods [
3].
Breathing at high levels of pollutants can lead to cancers and various diseases. Particulate matter smaller than 2.5 microns is one of the main carcinogens in air pollution. These particles are so small that they can penetrate deep into lung tissue and enter the bloodstream, causing dangerous effects on the human body. This pollutant alone causes 7 million deaths annually worldwide [
4].
PM2.5 originates from a diverse range of sources, including fossil fuel combustion, mechanical abrasion processes (such as tire, brake, and road wear), agricultural activities, construction and industrial emissions, as well as natural sources like dust and wildfires. More broadly,
PM2.5 can be emitted directly from primary sources-commonly associated with combustion activities and can also form as secondary particles through chemical reactions in the atmosphere involving precursor gases such as sulfur dioxide, nitrogen oxides, and organic compounds. These organic compounds may be natural or anthropogenic, including those emitted from automobile exhausts [
4,
5].
In addition to its profound effects on human health and air quality,
PM2.5 also poses a significant threat to the environment with its diverse and pervasive effects [
6]. When present in the atmosphere,
PM2.5 contributes to climate change, atmospheric haze, reduced visibility, global warming by absorbing, and scattering sunlight, altering temperature patterns, and intensifying the greenhouse effect. Furthermore,
PM2.5 influences the Earth’s radiation balance, affects atmospheric dynamics, and plays a role in cloud formation, thereby influencing precipitation patterns and overall weather conditions.
The multifaceted environmental consequences of
PM2.5 underscore the urgent need for comprehensive strategies to mitigate its emissions and address its detrimental impact on human health and the broader ecosystem [
4,
5,
6]. Forecasting
PM2.5 concentrations is crucial to implement measures to reduce emissions and mitigate
PM2.5 levels in the atmosphere. The study of air quality is of the utmost importance due to its serious impact on human health, the environment, and many other factors. Therefore, it is necessary to use current information to forecast pollutant concentrations and identify necessary changes to reduce
PM2.5 levels atmosphere [
7].
The study of pollutant concentration is inherently a spatiotemporal problem. This is because there is a spatial correlation between the surrounding areas, which affects the pollutant concentration of the studied area. The temporal correlation comes from the fact that current or future pollutant concentrations depend on previous concentration measurements [
8]. This is where spatiotemporal graph neural networks (GNNs) show their utility. A spatiotemporal GNN consists of two stages. First is the graph neural network, which combines graph networks. A graph is a mathematical structure consisting of nodes and edges, where the graph and its constituents can have attributes or features that can be updated. Information on the graph is exchanged using a message-passing step, where nodes exchange information along the edges, aggregating this information and updating their own features. This is where spatial correlation comes into play, as a network of air monitoring stations can be represented as a graph, with the nodes being the air monitoring stations and the edges representing the interaction between the stations. The second stage uses a recurrent neural network (RNN), which captures the temporal correlation and helps study the temporal diffusion. The two-stage model used in this study is the
PM2.5-GNN, a domain-knowledge enhanced spatiotemporal GNN.
2. Literature Review
Air pollution is a significant contributor to serious environmental and health concerns, arising from industrial emissions, atmospheric contamination due to climate and traffic factors, and fossil fuel combustion. Recognizing this as a global issue, many countries have established air pollution control stations in various cities to monitor pollutants such as nitrogen dioxide (NO
2), carbon dioxide (CO
2), sulfur dioxide (SO
2), and particulate matter (
PM2.5,
PM10). These stations alert residents when pollution levels exceed the threshold. Given the increasing air pollution levels, it is crucial to develop machine learning models that capture data, model it, and predict air pollutant concentrations. Africa, in particular, faces a shortage of reliable air quality sensors for monitoring and predicting
PM2.5 compared to other regions, highlighting the potential for expanding research in air pollution control [
9].
As previously mentioned, predicting the concentration of PM2.5 presents an inherent spatiotemporal challenge. These concentrations are influenced not only by pollutant levels in neighboring areas but also by their past values. This complexity aligns with the nature of a multivariate time series regression problem, as PM2.5 concentration is impacted by various factors, including weather conditions and contributions from primary and secondary pollutant sources. While these elements are commonly employed as features, there is growing interest in utilizing spectral indices derived from remote-sensing data obtained through satellite imagery. These indices are combinations of spectral bands designed to enhance the sensitivity of satellite observations to specific environmental variables, such as vegetation health, water content, and air quality. Numerous studies have examined the correlation between spectral indices and PM2.5 concentrations. The underlying concept is based on the interaction of aerosol particles, including PM2.5, with sunlight, which induces spectral modifications in the atmosphere. These alterations are detectable via remote sensing instruments, leading to the identification of empirical relationships between spectral indices and ground-based PM2.5 measurements. Specifically, spectral indices with spectral bands positioned in the visible and near-infrared segments of the electromagnetic spectrum show a strong correlation with PM2.5 concentration.
Dobrae et al. (2020) proposed a method for assessing atmospheric levels, focusing on
PM2.5 and
PM10, using Support Vector Regression (SVR), Autoregressive Integrated Moving Average (ARIMA), and Long Short-Term Memory (LSTM) models. Their comparative analysis revealed that SVR and ARIMA were the most effective methods for predicting air pollutant concentrations [
10]. In a separate study, Tshepang et al. (2023) employed various machine learning models, including Support Vector Machine, Decision Tree, Logistic Regression, K-Nearest Neighbors, CatBoost Regressor, Extreme Gradient Boosting Regressor, and Random Forest Classifier, to evaluate and predict
PM2.5 behavior. The CatBoost Regressor stood out as the most effective for
PM2.5 predictions, while both Random Forest Classifier and Decision Tree were identified as equally successful for determining Air Quality Index (AQI) status [
11].
Sangwon et al. (2021) introduced a real-time prediction model involving data interpolation and a Convolutional Neural Network (CNN) for predicting
PM10 and
PM2.5 concentrations. They demonstrated high performance through the use of spatiotemporal information and suggested a novel approach in prediction methodology [
12]. Chen et al. (2023) proposed a novel approach to predict
PM2.5 concentrations by combining a Convolutional Neural Network (CNN) with a Random Forest (RF) model [
13]. In their study, CNN was employed to extract essential meteorological and pollution data gathered from 13 monitoring stations in Kaohsiung, while the RF algorithm was used for
PM2.5 prediction. Evaluation based on root mean square error (RMSE) and mean absolute error (MAE) demonstrated that the proposed CNN–RF hybrid model outperformed the individual CNN or RF models in terms of modeling capability [
13].
Vignesh et al. (2023) collected daily
PM2.5 observational data (from January 2015 to December 2021) from the OpenAQ air quality database and implemented various machine learning algorithms, such as Linear Regression (LR), Decision Tree (DT), Gradient Boosting Regression (GBR), AdaBoost Regression (ABR), XGBoost (XGB), K-Nearest Neighbors (K-NN), Long Short-Term Memory (LSTM), Random Forest (RF), and Support Vector Machine (SVM), to predict
PM2.5 concentrations [
14]. Zaini et al. (2022) proposed a hybrid deep learning model (EEMD-LSTM) to decompose the original sequence station data of particulate matter into several subseries and predict hour-ahead
PM2.5 concentrations. The performance of the hybrid model was impressive, achieving an
R2 of more than 90 percent [
15]. Liu et al. (2021) [
16] introduced a machine learning approach (Random Forest) to estimate ambient
PM2.5 concentrations, leveraging various meteorological and satellite-derived parameters as predictors. The study demonstrated a rigorous methodology in model development and evaluation, including feature selection techniques and cross-validation methods to ensure robustness and generalizability.
The findings from Liu et al. reveal promising results, with the proposed model achieving high accuracy in predicting
PM2.5 concentrations, offering valuable insights for air quality management strategies in the Gauteng Province, South Africa [
16]. Singh et al. (2024) proposed an Integrated spatiotemporal graph neural network (ISTGNN) that effectively captures both spatial dependencies in road networks and temporal patterns in traffic data, demonstrating that integrating GNNs with temporal models significantly enhances forecasting accuracy [
17]. Yu et al. (2023) combined spatiotemporal learning with Generative Adversarial Networks (GANs) to improve atmospheric nowcasting, showing that GAN-based frameworks can generate sharper and more realistic predictions in meteorological applications [
18]. Similarly, Bentsen et al. (2023) introduced a unified graph-based formulation for wind forecasting, illustrating that representing meteorological fields as graphs enables more flexible modeling of spatial interactions [
19]. Han et al. (2023) incorporated Ollivier–Ricci curvature into spatiotemporal GNNs to enhance traffic flow forecasting, highlighting the value of geometric graph features in strengthening model robustness [
20]. Lastly, Zhu et al. (2024) developed STDNet, a spatiotemporal decomposition network for Arctic sea ice concentration prediction, demonstrating that decomposition-based frameworks effectively capture multi-scale temporal variability in environmental data. Collectively, these studies reinforce the growing trend toward using GNNs and hybrid spatiotemporal architectures for complex environmental prediction tasks, supporting their applicability to air quality forecasting, where strong spatial–temporal dependencies are also present [
21].
Yu, Yin, and Zhu (2018) introduced one of the earliest unified spatiotemporal GNN frameworks, demonstrating that coupling graph convolutions with temporal convolutions can accurately model traffic dynamics across complex road networks [
22]. Roy et al. (2021) advanced this idea by proposing USTGCN, a model that jointly aggregates spatial and temporal dependencies while incorporating recurring daily traffic patterns [
23]. Ju et al. (2024) further expanded the modeling capacity of STGNNs through COOL, which uses heterogeneous graph construction and a self-attention decoder to capture diverse and evolving traffic relationships [
24]. Ahn et al. (2024) improved short-term traffic speed prediction by integrating STGCN with CNN modules, enabling better extraction of local spatial features [
25]. The study published in Complex and Intelligent Systems (2023) strengthened interpretability by fusing external knowledge-such as weather and points-of-interest—into a spatiotemporal graph neural network [
26]. Ahn et al. (2023) introduced AASTGNet, which employs adaptive graph learning and attention mechanisms to dynamically reflect changing road conditions [
27]. Liu, Shojaee, and Reddy (2023) proposed a hybrid graph neural ODE framework capable of modeling both short-term local traffic fluctuations and longer-range temporal dynamics [
28].
Choi and Kim (2022) presented Rad-cGAN, a conditional GAN-based model that enhances precipitation nowcasting by generating realistic radar-based spatiotemporal forecasts [
29]. Zhang, Song, Han, and Zhang introduced a GAN-powered remote sensing fusion method that improves the spatiotemporal consistency of satellite images, providing higher-quality inputs for earth observation and environmental monitoring tasks [
30]. One of the challenges is that
PM2.5 levels often exhibit strong spatial and temporal dependencies. The pollution level at one location can be influenced by nearby locations, making it difficult for machine learning algorithms such as LSTM and CNN to yield optimal results when predicting
PM2.5 In addressing this predictive challenge, Graph Neural Networks (GNNs) emerge as a prime candidate. Unlike Convolutional Neural Networks and Recurrent Neural Networks, GNNs are suited for non-Euclidean data representations, such as networks of air monitoring stations. In this network, nodes symbolize the monitoring stations, while edges depict interactions between stations, collectively forming a graph structure. Both nodes and edges within the graph can possess distinct features, including meteorological attributes, measurements of other pollutant concentrations, and remote sensing spectral indices.
3. Data Acquisition and Methodological Framework
In this section, we discuss the two distinct study areas and describe the datasets used in the forecasting model to address the PM2.5 concentration prediction problem. Since we are working with two different study areas and the datasets for each are not identical, we will describe the datasets for each area in the following subsections. Firstly, we describe the air monitoring data, which is used for training and validating the model. The air monitoring dataset includes measurements of various air pollutants, including PM2.5, collected from multiple monitoring stations in each study area. This dataset serves as the primary input for our model and is essential for understanding the spatial and temporal variations in PM2.5 concentrations.
The second dataset is the meteorological dataset, which supplements the air monitoring data. This dataset includes various meteorological parameters, such as temperature, humidity, wind speed, and wind direction, all of which can influence PM2.5 levels. By incorporating meteorological data, we aim to improve the accuracy of our PM2.5 concentration predictions by accounting for the effects of weather conditions on pollutant dispersion and transformation. Each subsection will provide detailed information on the data collection methods, sources, and preprocessing steps for both the air monitoring and meteorological datasets for the two study areas. This comprehensive approach ensures that the forecasting model is well informed by a diverse set of relevant factors, thereby enhancing its predictive capabilities for PM2.5 concentrations.
3.1. Study Area
Switzerland, a landlocked country covering an area of 41,285 km
2, is one of the study areas in this research. It is bordered by Germany, France, Italy, Luxembourg, and Austria, with a maximum length of 220 km along the north–south axis and a width of 350 km extending from east to west. Switzerland is divided into three main geographic regions: the Swiss Alps, which occupy the southern and eastern parts of the country; the Jura mountains in the northwest; and the central plateau. Switzerland is known for having relatively low
PM2.5 concentrations compared to many other countries because it has strict environmental regulations and policies. The Swiss National Air Pollution Monitoring Network evaluates air quality at 16 sites across Switzerland, as shown in
Figure 1. The different colors indicate the types of environments in which the sensor sites are located. These sites are strategically positioned to capture representative pollution levels across diverse settings, including urban roadsides, residential areas, and rural locations. This network provides a comprehensive overview of air quality across various geographic and demographic regions, contributing valuable data for our study on
PM2.5 concentration prediction.
South Africa, covering an area of 1,221,037 km
2, is also one of the study areas in this research. It is bordered by Botswana and Zimbabwe to the north, and Mozambique and Eswatini (formerly Swaziland) to the northeast and east. The country consists of nine provinces, but our study focuses on one province: Gauteng. Gauteng, in South Africa, is a busy province with a lot of industries and traffic, causing high levels of air pollution, including
PM2.5. The air quality also changes with the seasons due to temperature inversions, which trap pollution close to the ground. Studying
PM2.5 levels in Gauteng province can help understand pollution patterns in other cities with similar problems. The South African Air Quality Information Systems (SAAQIS) provides near real-time air quality data from over 70 monitoring stations.
Figure 2 shows the locations of these monitoring stations across South Africa. However, due to data limitations, our analysis will concentrate on data from only eight stations located within Gauteng province. The different colors indicate the types of environments in which the sensor sites are located. This clarification has been added in
Section 3.1, and a revised figure has been included to address the issues with overlapping markers and improve overall clarity. The gray color indicates the type of environment in which the sensor sites are located. Unfortunately, this source does not provide the specific site names, similar to the data available for Switzerland. These stations provide the most comprehensive and reliable data for our study, allowing us to accurately predict
PM2.5 concentrations in this region.
Switzerland, a high-income country with strict environmental policies and relatively low pollution levels, contrasts strongly with Gauteng, a heavily industrialized and densely populated region facing persistent air quality challenges. Examining these regions provides a valuable opportunity to understand how the varying urban settings and regulatory frameworks impact air pollution patterns, offering insights that could support the development of more flexible and internationally applicable air quality management approaches.
3.2. Air Quality Datasets
3.2.1. Switzerland Monitoring Stations
The datasets used in this study are sourced from the National Air Pollution Monitoring Network (NABEL), which collects hourly air pollution data from 16 monitoring stations. However, only 8 of these stations have gathered sufficient
PM2.5 concentration data (see
Table 1) for the entire study period, which spans from 1 January 2016 to 31 December 2021. As a result, only these 8 monitoring stations were included in this study. The monitoring stations also record temperature, precipitation, and global radiation. This comprehensive dataset provides a solid foundation for analyzing
PM2.5 concentrations and their relationship with various environmental factors.
3.2.2. South Africa: Gauteng Province Monitoring Stations
The datasets used in this study are sourced from Open Meteo and the South African Air Quality Information Systems (SAAQIS), which collect hourly air pollution data from available monitoring stations. However, only 8 of these stations have gathered sufficient
PM2.5 concentration data (see
Table 2) for the entire study period, which spans from 1 January 2016 to 16 November 2022. This dataset provides the necessary information for analyzing
PM2.5 concentrations and understanding their temporal and spatial variations within the specified timeframe.
3.2.3. Meteorological and Pollutant Concentration Data
Meteorological data, such as wind speed and wind direction, is essential for the model. This data was obtained from the open-sources historical weather API, OpenMeteo and SAAQIS, which provides data at an hourly sampling rate.
Table 3 lists the features collected from this source. It is important to note that the list of features presented here differs from those provided in the
PM2.5-GNN paper by Zaini et al. (2022) [
15]. Specifically, the Planetary Boundary Layer (PBL) height and K-index were not included in this study. This represents a limitation, as research has shown that the PBL height is related to the vertical dilution of pollutants and exhibits an inverse linear relationship with pollutant concentrations.
3.2.4. Remote Sensing Hyperspectral Imaging Data
In addition to meteorological and pollutant concentration data, we included remote-sensing spectral index data. These data were sourced from the MODIS satellite and sampled at a daily rate. However, because the satellite collects observations only every 1 to 2 days, this results in periodic data gaps, which were addressed during preprocessing to ensure temporal consistency in the dataset. The sampling frequency of the spectral data impacted the NABEL, OpenMeteo and SAAQIS datasets, requiring Z-score normalization. This technique transforms a dataset so that its features have a mean of 0 and a standard deviation of 1. This is achieved by subtracting the mean of each feature from its values and then dividing by the standard deviation. This was crucial because features in a dataset often have different scales, and standardization ensures that each feature contributes equally to the analysis or model training. As a result, our study includes 2192 daily samples from Switzerland and 2512 daily samples from Gauteng. The remote sensing spectral indices cover the area surrounding each station, although the exact resolution is unknown. The integration of these spectral indices enriches the dataset, providing additional context for analyzing and predicting PM2.5 concentrations.
3.3. Data Processing
The meteorological pollutant concentration data were visualized to check for any significant gaps in the data. Aside from the stations that did not record certain pollutant concentrations, there were minimal instances of missing data. For the missing data, the cubic-spline interpolation method was used to fill in the gaps. This technique proved effective for handling missing values. Once the gaps were filled, all features were converted to SI units and resampled daily to match the remote sensing data. This pre-processing step ensured consistency across the datasets, enabling accurate analysis and modeling.
3.3.1. Remote Sensing Spectral Index
The remote sensing data were less reliable compared to the monitoring station and meteorological data, with many gaps and anomalies present in the datasets. For the remote sensing data, the linear interpolation method was applied, as the cubic-spline interpolation technique caused the data to become unstable at the endpoints. This step was essential because, even after addressing missing values, environmental time-series data may still exhibit abnormal spikes, sensor malfunctions, or unrealistic fluctuations that do not correspond to actual atmospheric conditions. To mitigate these issues, we employed an algorithm designed to identify outliers by examining both the temporal dynamics of each variable and its underlying statistical distribution. Measurements that deviated substantially from expected temporal patterns-such as isolated extreme peaks, negative pollutant concentrations, or values beyond physically plausible thresholds were flagged as anomalous. Once identified, these data points were removed or corrected using appropriate imputation strategies to maintain the coherence and continuity of the dataset. This additional cleaning process ensured that the final dataset used for model training was reliable, internally consistent, and free from distortions that could adversely affect the predictive performance of the ST-GNN model.
Given the large number of features used to predict our target, there was a risk of the model suffering from the curse of dimensionality. To address this, the BorutaShap feature selection method was employed to identify the most important remote sensing spectral indices for predicting
PM2.5 concentrations. The feature selection process highlighted the following indices as the most important: Fluorescence Correction Vegetation Index (FCVI), Green Atmospherically Resistant Vegetation Index (GARI), Normalized Multi-band Drought Index (NMBDI), Normalized Difference Vegetation Index, Aerosol-Free Vegetation Index (2100 nm), Green Leaf Index, and Global Vegetation Moisture Index, as shown in Equations (
1)–(
7), where
N is Near-infrared (NIR) reflectance,
R,
G,
B are Red, Green, and Blue reflectances, and
,
are Shortwave infrared bands (typically SWIR1 and SWIR2). These vegetation indices have shown a strong correlation with
PM2.5 concentrations, particularly the spectral bands in the visible and near-infrared regions.
3.3.2. Preparation
Once the remote spectral indices were obtained, they were concatenated with the weather and pollutant concentration datasets for each station. This integration resulted in a comprehensive dataset that combined spectral, meteorological, and pollutant data, enabling a more robust analysis and prediction of PM2.5 concentrations.
3.4. Methodology
The aim of this study is to forecast
PM2.5 concentrations up to 7 days ahead at a given monitoring station. One model that has proven effective for forecasting
PM2.5 concentrations is the
PM2.5-GNN, a hybrid model that combines a Graph Neural Network (GNN) and a Recurrent Neural Network (RNN). In this section, we provide a detailed description of the model used for predicting
PM2.5 concentrations. The
PM2.5-GNN model was originally proposed by Wang S. and Ling et al. in 2020 [
31], demonstrating significant effectiveness in leveraging domain-specific sensitivity and capturing long-term dependencies-critical features for
PM2.5 forecasting.
In our study, we adapted the model and applied it to our dataset to investigate whether the inclusion of remote-sensing hyperspectral indices as node features, along with meteorological information and pollutant concentration data, could improve the prediction accuracy of PM2.5 concentrations.
3.4.1. Model Overview
The PM2.5-GNN is a hybrid model that combines a Graph Neural Network (GNN) and a Recurrent Neural Network (RNN) to model spatial dependencies and temporal dynamics, respectively.
This two-stage model represents the data as a graph, where the nodes correspond to ground-based sensors that monitor air quality, measuring various pollutants and meteorological data. The edges represent the interactions between these sensors. The GNN employs a message-passing paradigm, where nodes exchange information along the edges, iteratively aggregating and updating their representations. This process allows the model to understand spatial correlations and capture the horizontal transport of pollutants.
In the second stage, a Gated Recurrent Unit (GRU) is applied to the knowledge-enhanced GNN, effectively capturing the temporal dependencies in the data. By combining the GNN’s ability to model spatial dependencies with the GRU’s capability to handle temporal sequences, the PM2.5-GNN model provides a robust framework for accurately predicting PM2.5 concentrations. The inclusion of remote-sensing hyperspectral indices, meteorological data, and pollutant concentration data as features further enhances the model’s predictive power.
3.4.2. Model Architecture
The prediction of PM2.5 concentration is framed as a spatiotemporal sequence prediction problem. We denote the PM2.5 concentration at time step t as , where N is the number of monitoring stations measuring PM2.5. A directed graph is constructed, where V is the set of nodes representing the monitoring stations, A is the adjacency matrix determining the potential edges, and E is the set of edges representing the interactions between the monitoring stations.
We define and as the feature matrices for the nodes and edges, respectively, at time step t, where p and q are the corresponding numbers of features, and is the number of edges. The PM2.5-GNN explicitly encodes domain knowledge such as meteorological and geographical information into the attribute matrices and the graph structure. Additionally, it leverages domain knowledge from the near future by incorporating meteorological information from weather forecasting services.
Formally, forecasting at any starting point
t is performed by feeding the observed
PM2.5 concentrations
at the current time step, the next
T steps of the attribute matrices
and
, and the graph structure
G into the model. The framework for the
PM2.5 concentration prediction problem is presented in Equation (
8), with an illustration provided in
Figure 3.
3.4.3. Model Parameters
Hyperparameter tuning was performed using the Weights and Biases (WandB) platform, which assisted in identifying the optimal hyperparameters for the model. For the Switzerland datasets, the best hyperparameters were a batch size of 98, a weight decay of 0.0001, and a learning rate of 0.005. For the Gauteng, South Africa datasets, the optimal hyperparameters were a batch size of 184, a weight decay of 0.0248, and a learning rate of 0.0588. The differences in optimal hyperparameters between Switzerland and Gauteng reflect variations in the datasets. Switzerland’s lower pollution variability favors a smaller batch size and learning rate, while Gauteng’s more heterogeneous data requires a larger batch size and adjusted learning rate. These region-specific settings do not compromise the general applicability of the PM2.5-GNN framework, as the architecture remains fully transferable and adapting hyperparameters to local data is standard practice in machine learning.
The model was run with two types of datasets: one consisting of meteorological and pollutant concentration data only, and the other combining meteorological data, pollutant concentrations, and remote-sensing spectral indices. The model was trained for 200 epochs, with an early stopping criterion set at 10 epochs to prevent overfitting. Optimizers such as Adam and SGD, along with loss functions such as SmoothL1 and L1Loss, were used for the Swiss and Gauteng datasets, respectively. This approach ensured that the model was finely tuned to the specific characteristics of each dataset, maximizing the accuracy of PM2.5 concentration predictions.
3.5. Adaptation to Our Datasets
Nodes. Due to the limited number of monitoring stations in both Switzerland and Gauteng, we treated the monitoring stations themselves as the nodes. This approach differs from
PM2.5-GNN paper by Zaini et al. (2022) [
15], where monitoring stations were averaged for each city. While we aimed to keep the node features as similar as possible to those in the original paper, we also incorporated additional features, such as pollutant concentrations and spectral indices. This addition provided a more detailed and comprehensive set of data for each node, thereby enhancing the model’s ability to predict
PM2.5 concentrations accurately.
Graph Construction. To adapt our dataset for use in the PM2.5-GNN model, we needed to create a graph representation of the data. This was essential since we were working with a multivariate time series regression problem across multiple nodes. We constructed the graph using the geographical coordinates of the monitoring stations, specifically their longitude and latitude. By calculating the distance and direction between each pair of nodes, we ensured that each node was aware of the positions of other stations and their spatial relationships.
We also encoded these spatial relationships into the graph by using distance and direction as edge features. By using distance as an edge feature, we ensure that stations within a specific range are more likely to be connected, which better represents their potential to influence each other, especially regarding pollutant levels or other important factors. Several constraints were implemented to refine the graph structure and encode geographical information effectively. For instance, we used a distance threshold of 300 km to determine whether a station should be connected to another. Furthermore, we took the altitude of mountain ranges into account, applying a threshold of 1200 m—if mountain ranges exceeded this altitude between stations, they were considered disconnected. Setting a 300 km distance threshold is valuable for restricting the graph connections to stations that are geographically close enough to affect each other. This approach helps reduce unnecessary complexity in the graph and ensures that only meaningful spatial relationships are included. Additionally, we excluded interactions between stations that had an altitude difference greater than 450 m, ensuring that these stations were not connected in the graph. These conditions were incorporated into the adjacency matrix, which structured the spatial relationships and interactions between monitoring stations. This approach enabled the PM2.5-GNN model to process both the geographical and pollutant data effectively, leading to better predictions.
3.6. Experiment
Since the aim of this study is to compare the prediction accuracy of
PM2.5 concentration with and without the use of spectral indices, the model was tested on two datasets. The first dataset served as the control, where the input features consisted of pollutant concentrations and weather data. In
Table 4 the data were split into three subsets: training, validation, and test, as shown in the table below. The training datasets for both Gauteng and Switzerland spanned four years of data, which were used to train the model.
During the training process, the termination criterion was evaluated after each epoch. If the maximum number of epochs was reached or if the validation error did not improve for a specified period, the training process was terminated. In the final step, the model’s prediction accuracy was tested using the test dataset. This approach ensures that the model is trained on a comprehensive dataset and that the results are reliable for predicting PM2.5 concentrations under different input conditions. The comparison of prediction accuracy between the two datasets will allow us to assess the impact of incorporating remote-sensing spectral indices on the model’s performance.
The model’s performance is evaluated using several metrics. The commonly used root mean squared error (RMSE) and mean absolute error (MAE), as shown in Equation (
9),
where
and
are the ground truth and the prediction, respectively, and
n is the total number of observed data samples, are employed to test accuracy in time series prediction. In addition to these standard metrics, domain-specific meteorological metrics are used to assess the model’s performance near the pollution threshold. The threshold for both Switzerland and Gauteng is set at 10
. These additional metrics include the probability of detection (POD), critical success index (CSI), and false alarm rate (FAR).
By utilizing these metrics, we can comprehensively assess the model’s accuracy and its effectiveness in predicting
PM2.5 concentrations, particularly near critical pollution thresholds. This dual approach ensures that the model’s predictions are not only statistically sound but also practically relevant for air quality monitoring and public health advisories. The hits are calculated as (prediction = 1, truth = 1), misses (prediction = 0, truth = 1), false alarms (prediction = 1, truth = 0) and CSI, FAR and POD are calculated by Equation (
10).
5. Conclusions
This study aimed to enhance the prediction of PM2.5 concentrations by integrating remote-sensing spectral indices with traditional meteorological and pollutant data using spatiotemporal graph neural networks (ST-GNNs). The findings demonstrate that incorporating hyperspectral indices significantly improves the accuracy of PM2.5 forecasts. The performance of the proposed model was evaluated using RMSE, MAE, POD, CSI, and FAR. The results consistently showed that the dataset including spectral indices outperformed the dataset without them across all metrics. For Switzerland, the model integrating spectral indices achieved an RMSE of 1.4591, compared to 1.4660 without the indices. Similarly, MAE improved from 1.1147 to 1.1053, CSI increased from 0.8345 to 0.8387, POD improved from 0.8961 to 0.8972, and FAR decreased from 0.0760 to 0.0719. In Gauteng, South Africa, the improvements were also notable. RMSE decreased from 6.3486 to 6.2319, MAE from 4.4891 to 4.4066, CSI from 0.9555 to 0.9560, and POD from 0.9699 to 0.9732. However, FAR slightly increased from 0.0154 to 0.0181. In real-life situations, especially in public alert systems like weather warnings or disaster alerts, even a small increase in FAR can cause problems. If the FAR is too high, people might stop taking alerts seriously and may ignore real warnings. This can be risky, as it could lead to slow responses or no action at all during an actual emergency, putting lives in danger. Error analysis over the prediction length indicated that while the initial one-day-ahead forecast without spectral indices had a lower error, the dataset with spectral indices outperformed from the two-day-ahead forecast onwards. This suggests that spectral indices provide a more robust prediction for longer forecasting periods, particularly as the error stabilizes over time for Swiss monitoring stations.
Compared with the results reported in Ref. [
31], the
PM2.5 GNN achieves substantially better performance for both Gauteng and Switzerland, even though several important features could not be included due to limitations in the available data sources. The study successfully demonstrated that integrating remote-sensing hyperspectral indices with traditional meteorological and pollutant data improves the accuracy of
PM2.5 concentration forecasts. Furthermore, by evaluating the proposed spatial-temporal graph neural network (ST-GNN) framework in regions with distinct environmental characteristics, such as Switzerland, with its strict regulatory context and low levels of pollution, and Gauteng, a densely populated and industrialized area with persistent challenges to air quality, we confirmed the ability of the model to generalize in spatially diverse settings.
Although the magnitude of overall improvements in standard metrics (
Table 5 and
Table 6) appears modest, it is important to highlight that the inclusion of spectral indices may provide more substantial benefits for rare or extreme pollution events, which are particularly critical for public health and emergency response. Accurate forecasting of such events can significantly enhance early warning systems and targeted interventions, even if improvements in average metrics are small. The study successfully demonstrated that integrating remote-sensing hyperspectral indices with traditional meteorological and pollutant data significantly improves the accuracy of
PM2.5 concentration forecasts. Furthermore, by evaluating the proposed spatial-temporal graph neural network (ST-GNN) framework in regions with distinct environmental characteristics, such as Switzerland, with its strict regulatory context and low levels of pollution, and Gauteng, a densely populated and industrialized area with persistent challenges to air quality, we confirmed the ability of the model to generalize in spatially diverse settings. These findings underscore the potential of ST-GNNs not only for accurate and robust air quality prediction but also for broader application in international contexts. Despite certain limitations, the results provide a strong foundation for developing flexible and transferable forecasting tools, with significant implications for public health, policy-making, and environmental management.
Limitations
Several limitations were identified in this study. Firstly, the exclusion of PBL height and K-index from the Swiss and South African studies could have influenced the predictions, as these parameters have been shown to have significant correlations with pollutant levels in other research. Additionally, the lack of terrain data in the South African graph construction may have affected the spatial interactions modeled in the Graph Neural Network (GNN). This was further compounded by the insufficient data for these parameters, which likely impacted the overall accuracy of the model. Another challenge was the high percentage of missing data for the remote sensing spectral indices, which led to uncertainties in selecting the final input features. This study also did not include Aerosol Optical Depth (AOD) due to its low spatial coverage and temporal resolution, which is often considered a crucial feature in air quality modeling.