You are currently viewing a new version of our website. To view the old version click .
Electronics
  • Article
  • Open Access

17 January 2025

Decoding Pollution: A Federated Learning-Based Pollution Prediction Study with Health Ramifications Using Causal Inferences

and
School of Computer Science and Engineering, Galgotias Univeristy, Greater Noida 201310, India
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Advancing Healthcare Analytics: The Role of Federated Learning and Explainability in Ensuring Data Privacy and Security

Abstract

Unprecedented levels of air pollution in our cities due to rapid urbanization have caused major health concerns, severely affecting the population, especially children and the elderly. A steady loss of ecological balance, without remedial measures like phytoremediation, coupled with alarming vehicular and industrial pollution, have pushed the Air Quality Index (AQI) and particulate matter (PM) to dangerous levels, especially in the metropolitan cities of India. Monitoring and accurate prediction of inhalable Particulate Matter 2.5 (PM2.5) and Particulate Matter 10 (PM10) levels, which cause escalations in and increase the risks of asthma, respiratory inflammation, bronchitis, high blood pressure, compromised lung function, and lung cancer, have become more critical than ever. To that end, the authors of this work have proposed a federated learning (FL) framework for monitoring and predicting PM2.5 and PM10 across multiple locations, with a resultant impact analysis with respect to key health parameters. The proposed FL approach encompasses four stages: client selection for processing and model updates, aggregation for global model updates, a pollution prediction model with necessary explanations, and finally, the health impact analysis corresponding to the PM levels. This framework employs a VGG-19 deep learning model, and leverages Causal Inference for interpretability, enabling accurate impact analysis across a host of health conditions. This research has employed datasets specific to India, Nepal, and China for the purposes of model prediction, explanation, and impact analysis. The approach was found to achieve an overall accuracy of 92.33%, with the causal inference-based impact analysis producing an accuracy of 84% for training and 72% for testing with respect to PM2.5, and an accuracy of 79% for training and 74% for testing with respect to PM10. Compared to previous studies undertaken in this field, this proposed approach has demonstrated better accuracy, and is the first of its kind to analyze health impacts corresponding to PM2.5 and PM10 levels.

1. Introduction

Recent innovations in information technology have ushered in a new paradigm in the way that healthcare-related information can be processed and analyzed, and outcomes can be predicted. It is common knowledge that the air quality in our urban spaces is rapidly deteriorating due to the rampant growth of industry, vehicular traffic, and environmental imbalances. The fact that the levels of particulate matter in the air directly affects our health has made it a key area of focus for developing cutting-edge data analysis and prediction techniques for establishing preventative measures. The steady stream of pollutant particles emitted every second has become a constant health hazard for our citizens and high-precision monitoring and prediction of the Air Quality Index (AQI) and particulate matter have become imperative for avoiding a health crisis in big cities. To address these health issues, this work has introduced a collaborative learning framework for predicting pollution particulates in different locations, explained the model predictions, and analyzed the impact of various diseases caused by pollution. The accurate monitoring and timely prediction of short-term and long-term AQI levels and their deviation from the standard guidelines can help in timely communications to people, in formulating preventative measures, in raising awareness, and even in influencing policy matters for better air pollution control and environmental protection. With this in mind, this research work aims to leverage recent advancements in technology for the accurate prediction of AQI particles in the atmosphere, and study the impact of pollution on health parameters so that remedies like phytoremediation can be effectively utilized for tackling rising pollution levels.
Generally, air pollution is classified as outdoor and indoor, and it is caused by different sources such as physical, biological, and chemical agents. These air pollutants harm human well-being and surrounding environments [1]. According to the World Health Organization, around 99% of population areas exceed air quality regulations due to the different sources of air pollution caused by humans, causing various diseases such as stroke, heart disease, lung cancer, diabetes, respiratory issues, and Chronic obstructive pulmonary disease (COPD). Due to these diseases, 7 million deaths happen annually [2]. In the AQI, based on the different aerodynamic diameters, the other particulate matter (PM) categories are grouped into coarse, fine, and ultra-fine particles, represented as PM10, PM2.5, and PM0.1. The sources of particulate matter are wildfires, dusty roads, pollen, agricultural waste, etc. [3]. The specific impact of PMs is presented in Table 1.
Table 1. Pollutants and impact analysis.
In addition, a case study on the impact analysis of pollution conducted in Delhi is presented in Table 2 [6]. This case study clearly illustrates the adverse impact of pollution and the number of people affected by it. It is also noteworthy that various studies have demonstrated the beneficial outcomes of phytoremediation in reducing the spread and effects of pollution.
Table 2. Impact analysis case study.
Federated learning has, of late, played a key role in improving the prediction and classification of location specific data. It is a decentralized learning technique where different devices are used to collaboratively train the relevant model while keeping the data private. This method is effective in handling data distributed across various locations. Some of the key benefits of using federated learning in pollution prediction systems are reduced bandwidth usage in live predictions, data privacy, fault tolerance, and scalability. Thus, federated learning helps in increasing prediction accuracy by leveraging data across diverse locations.
Another recent development in the field of artificial intelligence that enables fair and transparent prediction output is Trustworthy Artificial Intelligence. Trustworthy AI (TAI) in pollution prediction ensures that the artificial intelligence system detects pollution levels in a transparent, ethical, reliable, and robust way. Various authors have presented the primary properties of TAI and researchers [7,8] have explained how TAI is utilized in various models and how the predicted output is assessed. This work uses a federated trustworthy AI-based prediction model to predict pollution data, while the corresponding impact is analyzed using a distinct impact analysis component. A key contribution of the proposed work is examining the effects of respiratory issues in children and elders using the pollution detection model. It aims to ensure that the prediction of pollution and its effects is validated with the assistance of the federated TAI model. In summary, the important contributions of the proposed work are as follows:
I.
The main novelty of the proposed work is the introduction of a federated learning framework for pollution detection and analysis. The proposed federated framework uses reward-based client selection and initially aggregates locally before using the FedProx aggregation method for global aggregation.
II.
The proposed framework uses the VGG-19 model for image-based pollution prediction. The model uses the LIME tool for better explanation and causal inference for impact analysis of the predicted PM2.5 and PM10.  
III.
The datasets utilized for the implementation were collected from ITO-Delhi, Knowledge Park-III Greater Noida, Oragadam-Tamil Nadu, New town—Faridabad, some locations from Nagaland, and Mumbai. Additionally, some datasets were collected from Biratnagar—Nepal, and some images from Beijing. All in all, this work utilizes a vast dataset for the prediction of AQI, PM2.5, and PM10.
IV.
The federated learning framework was found to attain an accuracy of 92.33%, and the causal inference-based impact analysis was found to produce an accuracy of 84% for training and 72% for testing with respect to PM2.5, and an accuracy of 79% for training and 74% for testing with respect to PM10. Hence, the proposed federated learning method was seen to produce better accuracy than previous state-of-the-art studies.
The rest of the manuscript is organized as follows: Section 2 presents related research work on various pollution prediction models using sensor-based collected data and image-based prediction models. Impact analysis models are also summarized. Section 3 proposes the federated trustworthy AI architecture along with the pollution prediction model. Section 4 presents the results and discussion, including the impact analysis of the prediction model as well as the impact analysis of PM2.5 and PM10 corresponding to specific diseases. Finally, Section 5 presents a summary of the manuscript along with recommendations for future directions.

3. Federated Learning-Based Framework for Health Assessment Using Pollution Detection

The proposed hybrid health assessment system using a pollution detection framework consists of three parts: i. Federated Learning Framework; ii. Pollution Prediction Model; and iii. Impact Analysis Model. The proposed framework is presented in Figure 1. It consists of the initial training model, the aggregation algorithm, and the representation of different locations or clients. The initial training model includes a hybrid model for pollution detection and impact analysis, an aggregation algorithm used to combine results from varying locations or client prediction neurons, and clients for the end-user results.
Figure 1. Federated learning-based framework for pollution prediction and impact analysis.

3.1. Federated Learning Framework

The FL framework consists of a global pollution prediction and impact analysis model, a client selection process, and aggregation for a decentralized learning approach. This work does not consider the privacy between the clients and the global model; instead, a continuous upgrade is initiated from the distributed locations. The working process of the federated learning framework consists of three parts: i. client selection and aggregation of clients (Section 3.1.1) and (Section 3.1.2); ii. prediction and analysis of the model with the global server (Section 3.2); and iii. impact analysis model (Section 3.3).

3.1.1. Client Selection and Aggregation of Clients

Reward-Based Client Selection: The federated client selection process is important because effective client selection reduces communication and avoids unnecessary updates. In this work, reward-based or incentive mechanisms have been used for client selection [40]. The reward-based client selection process optimizes client utility while considering server goals. The utility of client i is represented as E i based on the resource expenditure.   
E i = f ( δ { W } i t , E i )
where f is a model update contribution of δ { W } i t and E i represents resources. The model update is an important contribution to client-level processing. Numerous updates on the client are considered best for learning, and the | | is used for concatenating client updates. Here, the model update is represented as Q.
Q = | | δ { W i t } | |
Based on the updates of each client, utilize the function of model representation as follows:
U i = α | | δ { W i t } | | β E i
where α and β are the higher parameters, controlling the variations between resource update costs and model contributions. The | | is used to concatenate the features or parameters of clients, combining them into a larger vector for evaluation. Based on the updates of each clients, utilize E i , and depending on active participants of clients ( C i ) , the maximum reward function ( R i ) is represented as follows:
R i = γ ( U i ) ( i ) ( C t ) U j
where γ represents total rewards in the different rounds, i , and C t represents the reward distribution of various clients. The ∑ is used to represent the sum of different distributions of rewards. Thus, an optimized mathematical model of client selection is as follows:
C l i e n t S e l e c t i o n = E C t | | δ { W i t } | |
where E represents the expectation over different clients in different rounds t and the ∑ represent the sum of each client’s rewards and resources. The process of client selection is shown in Algorithm 1.
Algorithm 1 Client Selection Algorithm
Input:
Status of Clients, Reward Status;
Output:
Selecting the clients for participation
1. Initialize the Model [S = 0, R = 0] //S = Set of Clients, R = Rewards of Clients
2. S = ++; R = ++; // Status information of Clients
3. for S i and R i ranges to do [Threshold value for Clients and Rewards]
4. S i updates ( S i ++)
5. R i updates ( R i ++)
6. End for
7. Analysis and setting of the threshold value of participation
8. R i = R q ( R q is the Qualifies Clients)
9. for i in ranges to do
10. if ( S i and S n > threshold Value) then
11. R q = marked Q (Q-Qualified Devices)
12. End if
13. End for
14. Qualified Devices for Participation
15. Q s = Best Q Devices
17. Q s = Best Q s (0: f*n)
18. Return to R i and S i
19. End

3.1.2. Aggregation

The second part of the framework is the aggregation model. The aggregation model is classified into two types: local and global. In this work, both local and global aggregation models have been used—local aggregation was used to obtain updates from one particular revision of information, whereas global aggregation was used to receive updates from multiple locations. This work required aggregations of a dynamic nature, as well as some dynamic aggregation algorithms, such as FedProx and FedNova, which were required for supporting heterogeneous data and dynamic participants. FedOpt, Adative FedAvg, and robust aggregation algorithms helped improve data quality and client reliability. In this framework, a modified Fedprox algorithm was used for aggregation. FedProx [41] is a combination of FedAvg and proximal used for local client updates. It is supported based on dynamic changes and the heterogeneity of local clients. The modified FedProx uses FedAvg [42], proximal, and the normalization of the average of inconsistency. The basic global model was updated using FedAvg. The Federated Averaging (FA) representation is as follows.
F A = i = 1 n S i . γ k = i r 1 g i ( C i ( t , k ) )
where C denotes the client model for the Kth to tth communications, γ denotes learning rates, g i is the stochastic gradient, and S i denotes the sample size. For consistency of averaging, proximal was used, and using this proximal consistently, the average of single and multi-location data was determined. The local and global aggregation with geometric regions is represented using proximal. The basic n dimensional space representation is as follows.
( x i , y i ) , ( x t + 1 , y t + 1 ) , ( x p , y p ) ( x p + 1 , y p + 1 ) ,
where x and y represent positive and negative inputs. The negative and positive aggregation of the location is represented as A + and A . Positive aggregation is used to update the changes in the number of clients at a particular threshold time from the number of locations or clients. Similarly, within the threshold time, there is no update on the number of clients and it consists of the negative aggregation. The approximation representation of the positive and negative representations is as follows.
A + : w . x o t + P + a n d A : w . y 0 t + p
The new features or samples within new or closer parameters (N) are represented as follows:
N = w . x 0 t + P j ( p + q )
where (p + q) denotes the samples, (p) denotes old samples, P J denotes new similar samples, and q denotes new similar samples. x 0 t denotes updating of the positive input in the time interval between 0 to t. The decentralized aggregation representation of global aggregation is as follows:
U + = C + 1 e γ ( w g n t w k t )
The decentralized aggregation was updated based on the threshold and user or administrator requirements. Using the FedProx algorithm, multiple client models exchanged updates. New raw or positive input prediction parameters were exchanged, and based on the threshold value, all eligible clients participated in the aggregation process. In the proposed framework, the aggregation ensured that all the above-mentioned entities were updated from the client models to the global model.

3.2. Pollution Prediction Model

The proposed model for predicting PM2.5, PM10, and AQI used VGG19, and an explainable AI (XAI) model was used to understand the model predictions. This model used a transfer learning model to accelerate the prediction performance. Several well-known model techniques were examined for the pupose of prediction, such as ResNet50, InterceptionV3, VGG16, Mobile Net V3, etc. For prediction, previously, a combination of models such as VGG16 with Relu was used. However, this yielded a score of 76.92% accuracy. In this work, VGG19 was used with transfer learning for better accuracy and XAI for better prediction understanding. Figure 2 represents the pollution prediction model, and the architecture consists of fully connected layers with updated flattening layers. With the model, a regression model was also used to estimate PM2.5 and PM10. The initial connected layer consisted of 512 neurons, 256 neurons, and 128 neurons. All three fully connected layers used ReLU for the simplicity and effectiveness of the model. Subsequently, Leaky ReLU was used, rather than Simple ReLU, for better accuracy.
Figure 2. Pollution prediction model using images.
The second part of the proposed work consists of the model’s understanding of the prediction and explanation of the outcomes. Different models, such as LIME, SHARP, and Grad-CAM, can be used to explain the outcomes. This work used LIME because it is more usable and has better prediction readability. Figure 3 represents the LIME-based explainable model for the interpretation of pollution prediction, and the step-by-step process of understanding the pollution prediction is also presented. Different factors were used to predict and explain traffic density, wind-speed, cloud movement, industrial emissions, and other precipitations. These factors were used for the predictions and to explore the spikes and the reasons behind them.
Figure 3. Explainable model for pollution prediction.

3.3. Pollution and Impact Analysis Model

Building an impact analysis using causal inference is an effective way of predicting the impact of pollution on nature and living organisms. This inference model helps determine how pollution affects health. The impact analysis model used here consisted of input and processing layers, as shown in Figure 4. The input layer used a pollution dataset, geographical data, health mapping data, and age group ranges. The processing unit consisted of a prediction and estimation model based on VGG-16 and Leakey ReLU. The identification and detection of the pollution level was extracted from the model, and the pollution inputs were provided using propensity score matching (PSM). PSM matches regions or individual characteristics that are similar to the baseline. The outcome of the prediction model was composed of baseline health data.
Figure 4. Pollution prediction and impact analysis.
The processing steps of PSM are as follows [43]. The propensity match e ( X ) is the probation of receiving conditions with co-variance X.
e ( X ) = P ( T = 1 | x )
T = 1 , if the unit is treated, and T = 0 means others. The property score can be estimated as
e ( X ) = 1 1 + e x p ( ( β 0 + β 1 x 1 + β x x k ) )
The mean marking score is denoted as
j ( i ) = a r g m i n j C o n t | e ( X i e X j ) |
Based on the above equation, the propensity score is matched.
The representation of the entire working process presented in Algorithm 2 and the step-by-step process of the proposed method is as follows.
i.
The federated learning infrastructure uses client selection and local and global aggregation models.
ii.
The PM2.5, PM10, and AQI are predicted using VGG19, and the prediction model is explained using (AXI) using LIME.
iii.
The correlation model is created and combined using the impact analysis model.
iv.
The causal inference model is used to analyze the impact of pollution.
v.
In causal inference, PSM is used to map the outcome.
The above steps were used for each image, and the input and output results were processed simultaneously. The proposed architecture predicted PM2.5, PM10, and AQI levels, and the model was explained. The expected model outcome and new features were initially transferred to the local aggregation, and a certain threshold was transferred to the global model using the global aggregation.
Algorithm 2 Federated learning-based pollution prediction and impact analysis with causal inference.
Input:
Client Data, Global Model, Factors influencing pollution, Outcome variable using PSM
Output:
Predicted Values of PM2.5, PM10 and AQI, Explanation Using LIME, Correlation of Pollution
Algorithm Begin:
{
Create the FL Infrastructure
Identify the participants of clients
updated model = global model //Clients initialized with updated global model
}
// Client selection Process and aggregation
{
if client update reward > previous reward: //Compare client updates to previous rewards
communication status = True //Start communication for aggregation
else:
communication s t a t u s = False
}
{
Initialized to aggregation
If
{
if aggregation result == “positive”:
global model = updated model // Update global model
prediction model = global model // Update prediction model
}
then
for client in participants:
client.model = updated model
}
then
{
Updated model is transferred to the Clients
}
end

4. Results and Discussion

This section consists of the dataset details, AQI classes of datasets, implementation details of datasets, AQI prediction particles such as PM2.5 and PM10, analysis of AQI in different locations, and finally, an impact analysis of particulate prediction and healthcare ramifications.
The dataset images were collected from multiple places, such as India and Nepal, and depicted the pollution levels recorded in the datasets. The datasets were collected using cameras at various time intervals. The dataset consisted of 12,420 images, each of 224 × 224 pixels and saved in the jpeg format. The dataset was labeled according to the city and the corresponding pollution levels at different time intervals. Initially, the datasets were collected from ITO-Delhi, Greater Noida, Faridabad, Oragadam-TN, locations in Nagaland, Mumbai, and some from Biratnagar, Nepal. In this implementation, the locations used were: ITO-Delhi, Greater Noida, New I Faridabad, Nagaland, and Oragadam-TN. The class distribution of the dataset is shown in Figure 5 [40].
Figure 5. Dataset distribution.
The implementation environment included a GPU server with the specifications of Intel (R) 4110 CPU @ 2.10 GHZ, Graphics cards (GeForce RTX 2080 11 GB), 128 GB RAM, and a 64 bit the Ubuntu Operating system. Python 3.5, Keras 2.16, and TensorFlow 1.13.1 were used for the programming environment. Table 3 presents the hyper-parameters of the implementation environment.
Table 3. Hyperparameters of model.
All input images were resized into 224 × 224 for VGG-16. The input image was normalized for a pre-trained weight, and the implementation batch size was 16 images for the training and validation of each epoch. The experiment was conducted in 100 epochs, and the learning rate was 0.000005. The Adam Optimizer and model validation used K-fold cross-validation and training, testing, and verification of the model was conducted in 70:20:10 ratios for better accuracy and faster convergence. The initial model accuracy was achieved using the VGG-19 presented in Figure 6. The 20 epochs of the initial model achieved 0.92 accuracy during the model’s training; during testing, the model accuracy was seen to decrease. Similarly, the model was applied in the federated learning environment across five locations (5-Clients) for initial accuracy measurement. The experiment in the federated learning environment in 20 rounds produced predictions with an initial accuracy of 0.95. Compared to the previous [8] model, it produced better accuracy, but the computational power was high. The implemented model was evaluated using accuracy, F1-Score, R2, RMSE, and inference time. Table 4 shows the primary evaluation for 20 epochs. With an increase in the number of epochs, the average accuracy was also seen to increase. For 100 epochs, an accuracy of 0.95 was observed during the training and 0.88 during the testing. Similarly, for 200 epochs, training and testing accuracy were seen to be 0.96 and 0.89, respectively.
Figure 6. Model initial accuracy.
Table 4. Basic evaluation metrics and predictions.
The model accuracy without federated learning and with federated learning is presented in Figure 6 and Figure 7. Figure 6 and Figure 7 clearly show the difference in accuracy between the initial model and the model with federated learning. VGG-16, when applied into the distributed environment across five clients and 20 rounds of aggregation, showed increased accuracy. The reward-based client selection helped to increase the accuracy on the client side and reduced the energy optimization between the clients and the global model. The reward-based client selection helped to improve data on the client side. During implementation, updates from the client to the global model were made only when any modification occurred on the client side.
Figure 7. Model accuracy using federated learning.
The base model achieved a training accuracy of 88.5% and a testing accuracy of 85.5%. In contrast, the use of federated learning with 15 rounds increased the accuracy to 92.33%. The accuracy of the FL-based model was better than the previous methods because of the datasets and randomized changes to the datasets in the client model.
Results Comparison: The predicted output was compared with different literature reviews on the basis of image-based air quality and PM predictions. The prediction results were compared using training accuracy, testing accuracy, R-square, and RMSE; another comparison was based on the accuracy of the air quality index prediction. The proposed method was evaluated using training accuracy, testing accuracy, RMSE, and R-Square. Training accuracy evaluates the model’s performance on the same dataset it was trained on, indicating how well the model has learned the patterns in the data. Testing accuracy assesses the model’s performance on new and unobserved data, evaluating its generalization capability. RMSE measures the average magnitude of prediction errors by calculating the difference between predicted values and actual values. R-Square indicates the proportion of variance in the dependent variable that can be explained by the independent variables. Table 5 shows the comparisons of R-Square and RMSE values with previous state-of-the-art models from various studies as well as the initial model and the federated learning-based model of this study [44,45,46]. Compared to the previous methods, the proposed framework was seen to produce better accuracy because of factors like better air quality parameters, color and texture properties, exact mapping of features, high resolution of images, etc.
Table 5. Comparison of models using R-Square and RMSE metrics.
Apart from the above-mentioned reasons, the accuracy of air quality prediction was seen to increase due to the number of high-resolution datasets, the collection of numerous images from different locations, and time interval values. Table 6 shows the increased accuracy of the proposed method, compared to the other datasets such as NWNUM-AQI [47], Linyuan-AQI [48], and Beijing [44], due to the rich set of parameters used. The accuracy (85.5%) of the proposed base model (VGG-19), when compared to the similar previous model (VGG-16) explored by Sapdu Utomo el [8], was found to be 9% better in accuracy, and that of the FL-based framework was found to be 17% better, and this was due to higher-quality datasets, dynamic changes in the datasets, and the sharing of datasets in a distributed way.
Table 6. Comparison of image-based air quality prediction accuracy.
Particulate Matter Prediction: The proposed work also predicted the presence of each kind of pollutant matter across different locations and clients of the federated model. Table 7, Table 8 and Table 9 show the predictions of particulate matter. Table 7 demonstrates the prediction of AQI, PM2.5, and PM10 and the corresponding accuracy. Initially, the AQI, PM2.5, and PM10 levels were predicted using the VGG-19 model with and without federated learning. The prediction value was mapped as an average of all types of particulate matter. Based on this, the AQI, PM2.5, and PM10 accuracy was calculated and the accuracy of the PM2.5 and PM10 values was found to be similar to the AQI accuracy. Compared to the base model, the federated learning-based prediction model showed better AQI, PM2.5, and PM10 accuracy.
Table 7. AQI, PM2.5, and PM10 prediction and accuracy.
Table 8. Prediction of AQI, PM2.5, and PM10 in different locations using the base model.
Table 9. Prediction of AQI, PM2.5, and PM10 in different locations using federated learning in different clients.
Table 8 shows the PM predictions from different locations such as Knowledge Park-III-Greater Noida, ITO-Delhi, and New Industrial town, Faridabad. The prediction values and corresponding month’s sensor-based prediction data were mapped for verification. The model prediction and timeline of sensor-based prediction were mapped. Due to air movements and vehicle density, the AQI and PM levels varied across locations. The AQI level of ITO-Delhi was very high due to increased vehicle movement and limited air movement.
Similarly, Table 9 presents the corresponding AQI, PM2.5, and PM10 levels predicted across different clients using federated learning. Each client was mapped with one location, and the corresponding AQI and PM prediction in the concerned location along with the sensor-based prediction were also mapped. Clients were mapped to different places such as Knowledge Park-III-Greater Noida, ITO-Delhi, New Ind town-Faridabad, Oragadam-Tamil Nadu, and some locations of Mumbai. Each client was selected based on client density, update frequency, and locally diverse data, which were compared using different thresholds. All these parameters were considered as client rewards, and based on these rewards, clients participated in the selection process. The prediction accuracy achieved was found to be more than 93% during the validation stage.

4.1. Explainability of Model Predictions

For the model explanation, the India and Nepal datasets were used, along with the Beijing dataset [44], and 100 images. For comparison purposes, the same images that were used in the previous VGG-16 model were also used in the new proposed VGG-19 model. The pre-processing of the images used in VGG-19 resulted in increased accuracy and better prediction of AQI levels. LIME explained the PM levels using the visual features of images, such as cloud patterns and color variations, which are the biggest influencing factors of the model that aid in better prediction and classification. Based on the features, image superpixels, and observations, changes were predicted. For example, the actual value of the PM level corresponding to a good-quality evening image is 11.0, and the expected value is 9.9. The accuracy is slightly varied due to the timing and the brightness features. Similar variations in the projections have been noted in the other categories of original images and the LIME predictions. The differences between the actual prediction of PM2.5 are presented in Figure 8. The sum and the average true value of classes such as good, moderate, worst, unhealthy, and very unhealthy were 365.3 and 73.6, respectively. Similarly, the sum and average values predicted using the LIME model across the different classes were 347.9 and 69.58, respectively. Therefore, the accuracy of the prediction was observed to be 95.23%. In summary, compared to the previous model [8], the proposed VGG-19-based prediction model with federated learning infrastructure was seen to produce better results in terms of prediction accuracy.
Figure 8. Difference between true and actual prediction of PM2.5 using LIME.

4.2. Impact and Health Analysis of PM2.5 and PM10 Model

The proposed work’s main advantage and contribution are an impact analysis using the PM2.5 and PM10 predictions. The impact analysis of PM2.5 and PM10 undertaken with respect to children and the elderly is presented in Table 10. The working process of the impact analysis model is presented in Figure 4. The prepared dataset was analyzed using images and embedded data, such as PM2.5 and PM10 values, in each dataset field. For training and testing purposes, a total of 100 images were used for the impact analysis. The identified PM levels were then further mapped for impact identification, as illustrated in Figure 4.
Table 10. Impact analysis of PM2.5 and PM10 on children and the elderly.
For mapping purposes, causal inference techniques were used, and high levels of PM2.5 and PM10 were found to be linked to significant health factors. As per the analysis, with an increase in the PM2.5 levels, a corresponding increase in the risk of hospitalizations for respiratory conditions was observed along with increased long-term exposure to cardiovascular risks. Similarly, with an increase in the PM10 levels, increased episodes of asthma, COPD, and chronic diseases were observed. Figure 9 clearly illustrates the testing and training accuracies of PM2.5 and PM10 based on the impact analysis and experimental results. The training and testing accuracies of PM2.5 were 84% and 72%, respectively, and the training and testing accuracies of PM10 were 79% and 74%, respectively. In this work, the impact analysis of image-based predictions used relatively fewer datasets from India and China. The impact analysis of the PM2.5 and PM10 predicted values were mapped to various locations. When the locations were changed, the impact analysis was seen to automatically decrease. Therefore, travel distance and mapping predictions are essential issues that require further consideration.
Figure 9. Impact analysis of PM2.5 and PM10.
Unlike other traditional machine learning models, the proposed framework was able to accurately predicts results across various locations without requiring local data sharing. The integration of VGG-19 and a causal inference approach has facilitated accurate predictions and a comprehensive health impact analysis; therefore, implementing this framework on a large scale would enable robust results in both predictive air pollution monitoring and impact analysis. It can be postulated that this research has helped bridge the gap between the applied and technical domains in the health sector, seamlessly integrating multiple disciplines to produce impactful research outcomes. Last but not the least, this interdomain research can help support policy decisions based on the ramifications of the health impact analysis, help build pollution detection and planning guidelines based on geographical regions, and generate actionable insights to prevent avoidable health issues and monitor the environment at the same time.

5. Conclusions

In recent years, air pollution has become the cause of many diseases, especially in vulnerable population groups like children and the elderly. An effective air quality prediction and recommendation system is the need of the hour and necessary for creating public health awareness. This was a primary goal behind the introduction of this federated learning-based framework for pollution prediction and analysis of its impact on health across multiple locations. The proposed federated framework work used reward-based client selection and the FedProx aggregation method for global aggregation, after initially being aggregated locally. The proposed framework used the VGG-19 model for image-based pollution prediction. The model used the LIME tool for better explanation and causal inference for the impact analysis of the predicted PM2.5 and PM10 levels. The proposed framework was observed to produce a 92.33% prediction accuracy, and the impact analysis causal inference model was seen to produce an accuracy of 84% for training and 72% for testing in terms of PM2.5, and an accuracy of 79% for training and 74% for testing in terms of PM10. This impact analysis similarly predicted the possible occurrence of diseases and presented the impact thereof. Compared to previous start-of-the-art studies, the proposed federated learning method was seen to produce better accuracy. In the future, a better impact analysis model can be designed to improve the overall accuracy and to better accommodate location-based and other dynamic changes; this can be investigated in tandem with an improved hybrid model enabling the combination of prediction of PMs, correlation models, and impact analysis results.

Author Contributions

Conceptualization, J.A. and S.B.; Methodology, J.A.; Software, J.A.; Investigation, J.A.; Resources, S.B.; Writing—review & editing, J.A. and S.B.; Funding acquisition, J.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data used in the study are available to other authors who require access to this material.

Acknowledgments

The authors are grateful to Galgotias University, Greater Noida, India.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Messan, S.; Shahud, A.; Anis, A.; Kalam, R.; Ali, S.; Aslam, M.I. Air-MIT: Air Quality Monitoring Using Internet of Things. Eng. Proc. 2022, 20, 45. [Google Scholar] [CrossRef]
  2. Beriwal, S.; John, A. A review of Various Techniques for Forecasting Pollution and Air Quality Indexing. In Proceedings of the 2021 Fifth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), Palladam, India, 11–13 November 2021; pp. 1680–1686. [Google Scholar] [CrossRef]
  3. Gilik, A.; Ogrenci, A.S.; Ozmen, A. Air quality prediction using CNN+LSTM-based hybrid deep learning architecture. Environ. Sci. Pollut. Res. 2022, 29, 11920–11938. [Google Scholar] [CrossRef]
  4. Xing, Y.-F.; Xu, Y.-H.; Shi, M.-H.; Lian, Y.-X. The impact of PM2.5 on the human respiratory system. J. Thorac. Dis. 2016, 8, E69. [Google Scholar] [PubMed]
  5. Available online: https://www.marlborough.govt.nz/environment/air-quality/smoke-and-smog/health-effects-of-pm10 (accessed on 14 January 2025).
  6. Siddique, S.; Ray, M.R.; Lahiri, T. Effects of air pollution on the respiratory health of children: A study in the capital city of India. Air Qual. Atmosphere Health 2011, 4, 95–102. [Google Scholar] [CrossRef]
  7. Utomo, S.; John, A.; Rouniyar, A.; Hsu, H.-C.; Hsiung, P.-A. Federated Trustworthy AI Architecture for Smart Cities. In Proceedings of the 2022 IEEE International Smart Cities Conference (ISC2), Pafos, Cyprus, 26–29 September 2022; pp. 1–7. [Google Scholar] [CrossRef]
  8. Utomo, S.; John, A.; Pratap, A.; Jiang, Z.-S.; Karthikeyan, P.; Hsiung, P.-A. AIX Implementation in Image-Based PM2.5 Estimation: Toward an AI Model for Better Understanding. In Proceedings of the 2023 15th International Conference on Knowledge and Smart Technology (KST), Phuket, Thailand, 21–24 February 2023; pp. 1–6. [Google Scholar] [CrossRef]
  9. McGovern, A.; Ebert-Uphoff, I.; Gagne, D.J.; Bostrom, A. Why we need to focus on developing ethical, responsible, and trustworthy artificial intelligence approaches for environmental science. Environ. Data Sci. 2022, 1, e6. [Google Scholar] [CrossRef]
  10. Yan, R.; Liao, J.; Yang, J.; Sun, W.; Nong, M.; Li, F. Multi-hour and multi-site air quality index forecasting in Beijing using CNN, LSTM, CNN-LSTM, and spatiotemporal clustering. Expert Syst. Appl. 2021, 169, 114513. [Google Scholar] [CrossRef]
  11. Du, S.; Li, T.; Yang, Y.; Horng, S.J. Deep Air Quality Forecasting Using Hybrid Deep Learning Framework. IEEE Trans. Knowl. Data Eng. 2019, 33, 2412–2424. [Google Scholar] [CrossRef]
  12. Zhao, Z.; Qin, J.; He, Z.; Li, H.; Yang, Y.; Zhang, R. Combining forward with recurrent neural networks for hourly air quality prediction in Northwest of China. Environ. Sci. Pollut. Res. 2020, 27, 28931–28948. [Google Scholar] [CrossRef] [PubMed]
  13. Samal, K.K.R.; Babu, K.S.; Das, S.K. Time Series Forecasting of Air Pollution using Deep Neural Network with Multi-output Learning. In Proceedings of the 2021 IEEE 18th India Council International Conference (INDICON), Guwahati, India, 19–21 December 2021. [Google Scholar]
  14. Zou, G.; Zhang, B.; Yong, R.; Qin, D.; Zhao, Q. FDN-learning: Urban PM2.5-concentration Spatial Correlation Prediction Model Based on Fusion Deep Neural Network. Big Data Res. 2021, 26, 100269. [Google Scholar] [CrossRef]
  15. Zhou, Y.; Chang, F.-J.; Chang, L.-C.; Kao, I.-F.; Wang, Y.-S. Explore a deep learning multi-output neural network for regional multi-step-ahead air quality forecasts. J. Clean. Prod. 2018, 209, 134–145. [Google Scholar] [CrossRef]
  16. Beriwal, S.; A, J.; K, S.K. Spatial and Temporal based Pollution Forecasting using Hybrid Model. In Proceedings of the 2022 International Conference on Applied Artificial Intelligence and Computing (ICAAIC), Salem, India, 9–11 May 2022; pp. 991–998. [Google Scholar] [CrossRef]
  17. Huang, Y.; Ying, J.J.-C.; Tseng, V.S. Spatio-attention embedded recurrent neural network for air quality prediction. Knowledge-Based Syst. 2021, 233, 107416. [Google Scholar] [CrossRef]
  18. Zhang, K.; Thé, J.; Xie, G.; Yu, H. Multi-step ahead forecasting of regional air quality using spatial-temporal deep neural networks: A case study of Huaihai Economic Zone. J. Clean. Prod. 2020, 277, 123231. [Google Scholar] [CrossRef]
  19. Zhang, Q.; Han, Y.; Li, V.O.K.; Lam, J.C.K. Deep-AIR: A Hybrid CNN-LSTM Framework for Fine-Grained Air Pollution Estimation and Forecast in Metropolitan Cities. IEEE Access 2022, 10, 55818–55841. [Google Scholar] [CrossRef]
  20. Wang, Q.; Liu, Y.; Pan, X. Atmosphere pollutants and mortality rate of respiratory diseases in Beijing. Sci. Total Environ. 2008, 391, 143–148. [Google Scholar] [CrossRef]
  21. Abe, K.C.; Miraglia, S.G.E.K. Health impact assessment of air pollution in São Paulo, Brazil. Int. J. Environ. Res. Public Health 2016, 13, 694. [Google Scholar] [CrossRef] [PubMed]
  22. Olstrup, H. An Air Quality Health Index (AQHI) with Different Health Outcomes Based on the Air Pollution Concentrations in Stockholm during the Period of 2015–2017. Atmosphere 2020, 11, 192. [Google Scholar] [CrossRef]
  23. Xu, K.; Cui, K.; Young, L.-H.; Wang, Y.-F.; Hsieh, Y.-K.; Wan, S.; Zhang, J. Air Quality Index, Indicatory Air Pollutants and Impact of COVID-19 Event on the Air Quality near Central China. Aerosol Air Qual. Res. 2020, 20, 1204–1221. [Google Scholar] [CrossRef]
  24. Abelsohn, A.; Stieb, D.M. Health effects of outdoor air pollution: Approach to counseling patients using the Air Quality Health Index. Can. Fam. Physician 2011, 57, 881–887. [Google Scholar]
  25. Air Quality, Health Impacts and Burden of Disease Due to Air Pollution (PM10, PM2.5, NO2 and O3): Application of AirQ+ Model to the Camp de Tarragona County (Catalonia, Spain). Available online: https://europepmc.org/article/med/31759725 (accessed on 15 January 2025).
  26. Jalili, M.; Ehrampoush, M.H.; Mokhtari, M.; Ebrahimi, A.A.; Mazidi, F.; Abbasi, F.; Karimi, H. Ambient air pollution and cardiovascular disease rate an ANN modeling: Yazd-Central of Iran. Sci. Rep. 2021, 11, 16937. [Google Scholar] [CrossRef] [PubMed]
  27. Available online: http://aphekom.org/web/aphekom.org/home (accessed on 15 January 2025).
  28. Fei, Z.; Ryeznik, Y.; Sverdlov, A.; Tan, C.W.; Wong, W.K. An overview of healthcare data analytics with applications to the COVID-19 pandemic. IEEE Trans. Big Data 2021, 8, 1463–1480. [Google Scholar] [CrossRef]
  29. Li, L.; Fan, Y.; Tse, M.; Lin, K. A review of applications in federated learning. Comput. Ind. Eng. 2020, 149, 106854. [Google Scholar] [CrossRef]
  30. Niknam, S.; Dhillon, H.S.; Reed, J.H. Federated learning for wireless communications: Motivation, opportunities, and challenges. IEEE Commun. Mag. 2020, 58, 46–51. [Google Scholar] [CrossRef]
  31. Jiang, J.C.; Kantarci, B.; Oktug, S.; Soyata, T. Federated Learning in Smart City Sensing: Challenges and Opportunities. Sensors 2020, 20, 6230. [Google Scholar] [CrossRef] [PubMed]
  32. Nguyen, D.-V.; Zettsu, K. Spatially-distributed Federated Learning of Convolutional Recurrent Neural Networks for Air Pollution Prediction. In Proceedings of the 2021 IEEE International Conference on Big Data (Big Data), Orlando, FL, USA, 15–18 December 2021; pp. 3601–3608. [Google Scholar] [CrossRef]
  33. Abimannan, S.; A, J.; Shukla, S.; Satheesh, D. Federated Learning for Improved Air Pollution Prediction: A Combined LSTM-SVR Approach. In Proceedings of the 2023 IEEE 4th Annual Flagship India Council International Subsections Conference (INDISCON), Mysore, India, 5–7 August 2023; pp. 1–7. [Google Scholar] [CrossRef]
  34. Neo, E.X.; Hasikin, K.; Mokhtar, M.I.; Lai, K.W.; Azizan, M.M.; Razak, S.A.; Hizaddin, H.F. Towards Integrated Air Pollution Monitoring and Health Impact Assessment Using Federated Learning: A Systematic Review. Front. Public Health 2022, 10, 851553. [Google Scholar] [CrossRef]
  35. Smuha, N.A. The EU Approach to Ethics Guidelines for Trustworthy Artificial Intelligence. Comput. Law Rev. Int. 2019, 20, 97–106. [Google Scholar] [CrossRef]
  36. Liu, H.; Wang, Y.; Fan, W.; Liu, X.; Li, Y.; Jain, S.; Tang, J. Trustworthy ai: A computational perspective. ACM Trans. Intell. Syst. Technol. 2022, 14, 1–59. [Google Scholar] [CrossRef]
  37. Ho, C.W.L.; Ali, J.; Caals, K. Ensuring trustworthy use of artificial intelligence and big data analytics in health insurance. Bull. World Health Organ. 2020, 98, 263. [Google Scholar] [CrossRef]
  38. Putra, M.A.P.; Karna, N.; Alief, R.N.; Zainudin, A.; Kim, D.-S.; Lee, J.-M.; Sampedro, G.A. PureFed: An Efficient Collaborative and Trustworthy Federated Learning Framework Based on Blockchain Network. IEEE Access 2024, 1, 82413–82426. [Google Scholar] [CrossRef]
  39. Lee, W. Reward-based participant selection for improving federated reinforcement learning. ICT Express 2022, 9, 803–808. [Google Scholar] [CrossRef]
  40. Rouniyar, A.; Utomo, S.; John, A.; Hsiung, P.A. Air Pollution Image Dataset from India and Nepal. 2023. Available online: https://www.kaggle.com/datasets/adarshrouniyar/air-pollution-image-dataset-from-india-and-nepal (accessed on 15 January 2025).
  41. Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated optimization in heterogeneous networks. Proc. Mach. Learn. Syst. 2020, 2, 429–450. [Google Scholar]
  42. Ayeelyan, J.; Utomo, S.; Rouniyar, A.; Hsu, H.C.; Hsiung, P.A. Federated learning design and functional models: Survey. Artif. Intell. Rev. 2025, 58, 21. [Google Scholar] [CrossRef]
  43. Li, M. Using the propensity score method to estimate causal effects: A review and practical guide. Organ. Res. Methods 2013, 16, 188–226. [Google Scholar] [CrossRef]
  44. Liu, C.; Tsow, F.; Zou, Y.; Tao, N. Particle Pollution Estimation Based on Image Analysis. PLoS ONE 2016, 11, e0145955. [Google Scholar] [CrossRef] [PubMed]
  45. Bo, Q.; Yang, W.; Rijal, N.; Xie, Y.; Feng, J.; Zhang, J. Particle Pollution Estimation from Images Using Convolutional Neural Network and Weather Features. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 3433–3437. [Google Scholar]
  46. Wang, X.; Wang, M.; Liu, X.; Zhang, X.; Li, R. A PM2.5 concentration estimation method based on multi-feature combination of image patches. Environ. Res. 2022, 211, 113051. [Google Scholar] [CrossRef]
  47. Zhang, Q.; Fu, F.; Tian, R. A deep learning and image-based model for air quality estimation. Sci. Total. Environ. 2020, 724, 138178. [Google Scholar] [CrossRef] [PubMed]
  48. APCI. Available online: https://data.moenv.gov.tw/en/dataset/detail/aqx_p_488 (accessed on 10 December 2024).
  49. Zhang, Q.; Tian, L.; Fu, F.; Wu, H.; Wei, W.; Liu, X. Real-Time and Image-Based AQI Estimation Based on Deep Learning. Adv. Simul. 2022, 5, 2100628. [Google Scholar] [CrossRef]
  50. Kow, P.-Y.; Hsia, I.-W.; Chang, L.-C.; Chang, F.-J. Real-time imagebased air quality estimation by deep learning neural networks. J. Environ. Manag. 2022, 307, 114560. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.