Decoding Pollution: A Federated Learning-Based Pollution Prediction Study with Health Ramifications Using Causal Inferences

Beriwal, Snehlata; Ayeelyan, John

doi:10.3390/electronics14020350

Open AccessArticle

Decoding Pollution: A Federated Learning-Based Pollution Prediction Study with Health Ramifications Using Causal Inferences

by

Snehlata Beriwal

and

John Ayeelyan

^*

School of Computer Science and Engineering, Galgotias Univeristy, Greater Noida 201310, India

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(2), 350; https://doi.org/10.3390/electronics14020350

Submission received: 10 December 2024 / Revised: 10 January 2025 / Accepted: 12 January 2025 / Published: 17 January 2025 / Corrected: 29 October 2025

(This article belongs to the Special Issue Advancing Healthcare Analytics: The Role of Federated Learning and Explainability in Ensuring Data Privacy and Security)

Download

Browse Figures

Versions Notes

Abstract

Unprecedented levels of air pollution in our cities due to rapid urbanization have caused major health concerns, severely affecting the population, especially children and the elderly. A steady loss of ecological balance, without remedial measures like phytoremediation, coupled with alarming vehicular and industrial pollution, have pushed the Air Quality Index (AQI) and particulate matter (PM) to dangerous levels, especially in the metropolitan cities of India. Monitoring and accurate prediction of inhalable Particulate Matter 2.5 (PM2.5) and Particulate Matter 10 (PM10) levels, which cause escalations in and increase the risks of asthma, respiratory inflammation, bronchitis, high blood pressure, compromised lung function, and lung cancer, have become more critical than ever. To that end, the authors of this work have proposed a federated learning (FL) framework for monitoring and predicting PM2.5 and PM10 across multiple locations, with a resultant impact analysis with respect to key health parameters. The proposed FL approach encompasses four stages: client selection for processing and model updates, aggregation for global model updates, a pollution prediction model with necessary explanations, and finally, the health impact analysis corresponding to the PM levels. This framework employs a VGG-19 deep learning model, and leverages Causal Inference for interpretability, enabling accurate impact analysis across a host of health conditions. This research has employed datasets specific to India, Nepal, and China for the purposes of model prediction, explanation, and impact analysis. The approach was found to achieve an overall accuracy of 92.33%, with the causal inference-based impact analysis producing an accuracy of 84% for training and 72% for testing with respect to PM2.5, and an accuracy of 79% for training and 74% for testing with respect to PM10. Compared to previous studies undertaken in this field, this proposed approach has demonstrated better accuracy, and is the first of its kind to analyze health impacts corresponding to PM2.5 and PM10 levels.

Keywords:

AQI; federated learning framework; pollutant prediction model; impact analysis of PM2.5 and PM10; health analysis

1. Introduction

Recent innovations in information technology have ushered in a new paradigm in the way that healthcare-related information can be processed and analyzed, and outcomes can be predicted. It is common knowledge that the air quality in our urban spaces is rapidly deteriorating due to the rampant growth of industry, vehicular traffic, and environmental imbalances. The fact that the levels of particulate matter in the air directly affects our health has made it a key area of focus for developing cutting-edge data analysis and prediction techniques for establishing preventative measures. The steady stream of pollutant particles emitted every second has become a constant health hazard for our citizens and high-precision monitoring and prediction of the Air Quality Index (AQI) and particulate matter have become imperative for avoiding a health crisis in big cities. To address these health issues, this work has introduced a collaborative learning framework for predicting pollution particulates in different locations, explained the model predictions, and analyzed the impact of various diseases caused by pollution. The accurate monitoring and timely prediction of short-term and long-term AQI levels and their deviation from the standard guidelines can help in timely communications to people, in formulating preventative measures, in raising awareness, and even in influencing policy matters for better air pollution control and environmental protection. With this in mind, this research work aims to leverage recent advancements in technology for the accurate prediction of AQI particles in the atmosphere, and study the impact of pollution on health parameters so that remedies like phytoremediation can be effectively utilized for tackling rising pollution levels.

Generally, air pollution is classified as outdoor and indoor, and it is caused by different sources such as physical, biological, and chemical agents. These air pollutants harm human well-being and surrounding environments [1]. According to the World Health Organization, around 99% of population areas exceed air quality regulations due to the different sources of air pollution caused by humans, causing various diseases such as stroke, heart disease, lung cancer, diabetes, respiratory issues, and Chronic obstructive pulmonary disease (COPD). Due to these diseases, 7 million deaths happen annually [2]. In the AQI, based on the different aerodynamic diameters, the other particulate matter (PM) categories are grouped into coarse, fine, and ultra-fine particles, represented as PM10, PM2.5, and PM0.1. The sources of particulate matter are wildfires, dusty roads, pollen, agricultural waste, etc. [3]. The specific impact of PMs is presented in Table 1.

In addition, a case study on the impact analysis of pollution conducted in Delhi is presented in Table 2 [6]. This case study clearly illustrates the adverse impact of pollution and the number of people affected by it. It is also noteworthy that various studies have demonstrated the beneficial outcomes of phytoremediation in reducing the spread and effects of pollution.

Federated learning has, of late, played a key role in improving the prediction and classification of location specific data. It is a decentralized learning technique where different devices are used to collaboratively train the relevant model while keeping the data private. This method is effective in handling data distributed across various locations. Some of the key benefits of using federated learning in pollution prediction systems are reduced bandwidth usage in live predictions, data privacy, fault tolerance, and scalability. Thus, federated learning helps in increasing prediction accuracy by leveraging data across diverse locations.

Another recent development in the field of artificial intelligence that enables fair and transparent prediction output is Trustworthy Artificial Intelligence. Trustworthy AI (TAI) in pollution prediction ensures that the artificial intelligence system detects pollution levels in a transparent, ethical, reliable, and robust way. Various authors have presented the primary properties of TAI and researchers [7,8] have explained how TAI is utilized in various models and how the predicted output is assessed. This work uses a federated trustworthy AI-based prediction model to predict pollution data, while the corresponding impact is analyzed using a distinct impact analysis component. A key contribution of the proposed work is examining the effects of respiratory issues in children and elders using the pollution detection model. It aims to ensure that the prediction of pollution and its effects is validated with the assistance of the federated TAI model. In summary, the important contributions of the proposed work are as follows:

I.: The main novelty of the proposed work is the introduction of a federated learning framework for pollution detection and analysis. The proposed federated framework uses reward-based client selection and initially aggregates locally before using the FedProx aggregation method for global aggregation.
II.: The proposed framework uses the VGG-19 model for image-based pollution prediction. The model uses the LIME tool for better explanation and causal inference for impact analysis of the predicted PM2.5 and PM10.
III.: The datasets utilized for the implementation were collected from ITO-Delhi, Knowledge Park-III Greater Noida, Oragadam-Tamil Nadu, New town—Faridabad, some locations from Nagaland, and Mumbai. Additionally, some datasets were collected from Biratnagar—Nepal, and some images from Beijing. All in all, this work utilizes a vast dataset for the prediction of AQI, PM2.5, and PM10.
IV.: The federated learning framework was found to attain an accuracy of 92.33%, and the causal inference-based impact analysis was found to produce an accuracy of 84% for training and 72% for testing with respect to PM2.5, and an accuracy of 79% for training and 74% for testing with respect to PM10. Hence, the proposed federated learning method was seen to produce better accuracy than previous state-of-the-art studies.

The rest of the manuscript is organized as follows: Section 2 presents related research work on various pollution prediction models using sensor-based collected data and image-based prediction models. Impact analysis models are also summarized. Section 3 proposes the federated trustworthy AI architecture along with the pollution prediction model. Section 4 presents the results and discussion, including the impact analysis of the prediction model as well as the impact analysis of PM2.5 and PM10 corresponding to specific diseases. Finally, Section 5 presents a summary of the manuscript along with recommendations for future directions.

2. Related Work

Most of the existing work related to this area can be divided into four categories: analysis of different pollution prediction models, federated learning for pollution prediction and how it supports prediction and better performance, how TAI has been used in prediction and analysis, and finally, phytoremediation and its use in impact analyses of pollution. S. Beriwal et al. [2] presented a variety of pollution prediction models in their study, which included metrics and future directions of research. S. Utomo et al. [5] presented AIX for pollution prediction, as well as an explainable tool (LIME) that aided the authors in interpreting the results using a CNN model. Gilik, Aysenur, et al. [9] proposed a hybrid model using a CNN and LSTM for pollution prediction using spatiotemporal features with the help of information on all pollutants and predicted data of meteorological data. In addition, the authors used transfer learning from the source city to the targeted city for better predictions.

Rui Yan et al. [10] used multi-hour and multi-site AQI forecasting using different models such as LSTM, a CNN, and a CNN-LSTM using spatiotemporal clustering. This spatiotemporal clustering and use of different models helped to somewhat improve the forecasting. Also, the CNN-LSTM method’s performance was found to be better than that of CNN and BPNN. Du, Shengdong, et al. [11] presented a hybrid framework for forecasting air pollution using 1D-CNNss and BI-LSTM with different dynamic features and spatiotemporal correlation used for forecasting PM2.5. The authors used two time series datasets for implementation, and different data fusions helped address real-time dynamic variations. Zhao, Zhili, et al. [12] used hybrid mode CERL with a forward NN and recurrent NN to predict the AQI hourly. This model predicted time series data and short-term AQI using the time series data.

Samal, K. et al. [13] proposed a multi-output learning approach for simultaneously forecasting pollution using multiple variable inputs. Using this multi-output, forecasting of different particulate matters such as PM2.5 and PM10 was undertaken. Zou, Guojian, et al. [14] proposed an FDN learning model based on a fusion DNN, combining three models: stacked anti-autoencoder, Gaussian fusion mode, and LSTM. This model has three layers; the final layer is used to predict and forecast pollution. The FDN learning received a better correlation compared to other methods. Yanlai et al. [15] proposed regional air quality forecasting using a deep learning multi-output neural network. In this work, the DM-LSTM model was evaluated by three time series of PM2.5, PM10, and NOx simultaneously at five air quality monitoring stations in Taipei. S. Beriwal et al. [16] presented a location and timing-based pollution forecasting model using hybrid techniques such as SGB, LSTM, and a D-compose tree for forecasting on and managing past and present data. Also, the authors used four locations from Delhi for AQI prediction. Similarly, Huang et al. [17] presented a model for regional air quality forecasting using a spatial–temporal DNN. The authors used historical air measurement data and meteorological data for forecasting. Kefei et al. [18] also used a similar spatial–temporal correlation for AQI prediction. The SPATTRN method was used to monitor the stations with the help of spatial–temporal relationships. Q. Zhang et al. [19] also used a similar pattern for AQI prediction, but the authors used more extended temporal deep learning with transfer learning for better prediction. The authors proposed a TL-BLSTM model for air quality prediction and used fewer spatial features and larger temporal features for PM2.5 prediction.

The authors of [20,21,22,23,24,25,26] presented various atmospheric pollution levels and their impact. Due to the high pollution levels, mortality rates and other health issues were found to be concentrated. Qixin Wang et al. [20] predicted the presence of particulate matters such as SO₂, CO, etc., in Beijing using the ARIMA model. The expected output, correlated using the neural network, predicted the death rate and respiratory diseases from 2005 to 2008. Similarly, Karina Camasmie Abe et al. [21] analyzed the health impact of air pollution in Brazil using the APHEKOM tool from 2009 to 2011 [27]. The authors analyzed the short-term and long-term effects of ozone and PM2.5 and the effect on mortality. Henrik Olstrup [22] analyzed the AQHI and its outcomes using pollution concentration from 2015 to 2017. The author used traditional methods to predict pollutants, such as chemiluminescence, UV absorption, and gravitation. Finally, the risk associated with NO₂, O₃, and mortality rates for three years was summarized using the prediction of pollutants. Kaijie Xu et al. [23] analyzed the impact of air pollutants and COVID-19 in central China. The authors used AQI rates to correlate six classes of AQI values. Based on the distribution of AQI values, the impact was analyzed. The authors of [24,25,26] analyzed the effects of pollutants using different methods and correlated them in various locations such as Canada, Spain, and Iran. The authors of [4,6,25] analyzed pollutants and summarized the impact of pollutants. The authors of [5] presented the effect of contaminants, as presented in Table 1. Similarly, the author of [6] conducted a case study and the relevant affected ratios are presented in Table 2. The authors of [28] presented similar research for healthcare analysis and corresponding pandemic indications. They summarized the computational techniques and methods and corresponding impact usage. For example, deep learning methods (CNNs, RNNs) and corresponding impacts, such as COVID-19 disease classification, are presented.

In recent times, federated learning has played an innovative role in machine learning by helping avoid and overcome issues associated with centralized data gathering and data movement. The primary purpose of federated learning has been to enable model updates while maintaining data privacy and, in this context, the main functionalities of federated learning include client selection, aggregation, data management, data confidentiality, and reduction in communication cost. More specifically, client selection and aggregations are important for model training and learning [29,30]. Ji Chu Jiang et al. [31] presented various challenges and how federated learning is useful in Smart City sensing and data analysis. The authors suggested using a centralized machine learning model based on decentralized local data. The authors also elaborated on various associated challenges, such as user incentives, data trustworthiness, data quality management, etc. The authors of [32,33] used federated learning for pollution predictions. The authors used CRNN and LSTM algorithms for pollution prediction and received more than 95% accuracy in the prediction level. Similarly, En Xin Neo et al. [34] presented the impact analysis of pollutants and suggested federated learning for monitoring pollution. In keeping with the federated learning framework, the authors proposed integrating and extending the impact analysis into medical and environmental data.

Another recent technology innovation, trustworthy artificial intelligence (TAI), helps establish trust and belief in the predicted output. The application of TAI ensures that the AI prediction is safe, secure, and transparent. The AI HLEG has summarized the requirements of TAI [35], which has helped to further evaluate the predictions. Amy McGovern et al. [9] and Liu Haochen et al. [36] elaborated why TAI is recommended for environmental analysis, and the authors held the opinion that TAI is necessary for avoiding misconceptions and misinterpretations of the predicted outcomes. The authors from [37,38,39] utilized TAI for efficient data analysis with federated learning and used federated learning-based TAI for intelligent city application platforms. Similarly, Utomo, Sapdo et al. [39] proposed a TAI-based federated learning platform for smart city applications, and integrated TAI and federated learning in their work. The requirement of TAI is fulfilled through various implementations, including explainable AI, federated learning, and a variety of model implementations. Thus, TAI is deemed crucial for ensuring the accuracy of prediction outputs and for establishing accurate correlations.

Limitation: Recently, various machine and deep learning algorithms have been proposed for the prediction of AQI and pollutant levels. One of the primary concerns in this type of prediction is verifying the accuracy of the expected output. Additionally, further analysis of the correlation between the predicted outcomes and health issues is often necessary before actionable recommendations can be made. Phytoremediation, for instance, which can be utilized for reducing the spread of pollutant particles, can be time consuming and is not without its challenges. To address these issues, an effective prediction framework needs to be established in order to improve the overall AQI prediction as well as the impact analysis of pollution.

3. Federated Learning-Based Framework for Health Assessment Using Pollution Detection

The proposed hybrid health assessment system using a pollution detection framework consists of three parts: i. Federated Learning Framework; ii. Pollution Prediction Model; and iii. Impact Analysis Model. The proposed framework is presented in Figure 1. It consists of the initial training model, the aggregation algorithm, and the representation of different locations or clients. The initial training model includes a hybrid model for pollution detection and impact analysis, an aggregation algorithm used to combine results from varying locations or client prediction neurons, and clients for the end-user results.

3.1. Federated Learning Framework

The FL framework consists of a global pollution prediction and impact analysis model, a client selection process, and aggregation for a decentralized learning approach. This work does not consider the privacy between the clients and the global model; instead, a continuous upgrade is initiated from the distributed locations. The working process of the federated learning framework consists of three parts: i. client selection and aggregation of clients (Section 3.1.1) and (Section 3.1.2); ii. prediction and analysis of the model with the global server (Section 3.2); and iii. impact analysis model (Section 3.3).

3.1.1. Client Selection and Aggregation of Clients

Reward-Based Client Selection: The federated client selection process is important because effective client selection reduces communication and avoids unnecessary updates. In this work, reward-based or incentive mechanisms have been used for client selection [40]. The reward-based client selection process optimizes client utility while considering server goals. The utility of client i is represented as

E_{i}

based on the resource expenditure.

E_{i} = f (δ {W}_{i}^{t}, E_{i})

(1)

where f is a model update contribution of

δ {W}_{i}^{t}

and

E_{i}

represents resources. The model update is an important contribution to client-level processing. Numerous updates on the client are considered best for learning, and the

| |

is used for concatenating client updates. Here, the model update is represented as Q.

Q = | | δ {W_{i}^{t}} | |

(2)

Based on the updates of each client, utilize the function of model representation as follows:

U_{i} = α | | δ {W_{i}^{t}} | | - β E_{i}

(3)

where

α

and

β

are the higher parameters, controlling the variations between resource update costs and model contributions. The

| |

is used to concatenate the features or parameters of clients, combining them into a larger vector for evaluation. Based on the updates of each clients, utilize

E_{i}

, and depending on active participants of clients

(C_{i})

, the maximum reward function

(R_{i})

is represented as follows:

R_{i} = γ \frac{(U_{i})}{(\sum_{i}) (\sum_{C t}) U_{j}}

(4)

where

γ

represents total rewards in the different rounds,

\sum_{i}

, and

\sum_{C t}

represents the reward distribution of various clients. The ∑ is used to represent the sum of different distributions of rewards. Thus, an optimized mathematical model of client selection is as follows:

C l i e n t S e l e c t i o n = E \sum_{\subset} C t | | δ {{W_{i}}^{t}} | |

(5)

where E represents the expectation over different clients in different rounds t and the ∑ represent the sum of each client’s rewards and resources. The process of client selection is shown in Algorithm 1.

Algorithm 1 Client Selection Algorithm

Input:

Status of Clients, Reward Status;

Output:

Selecting the clients for participation

1. Initialize the Model [S = 0, R = 0] //S = Set of Clients, R = Rewards of Clients

2. S = ++; R = ++; // Status information of Clients

3. for

S_{i}

and

R_{i}

ranges to do [Threshold value for Clients and Rewards]

4.

S_{i}

updates (

S_{i}

++)

5.

R_{i}

updates (

R_{i}

++)

6. End for

7. Analysis and setting of the threshold value of participation

8.

R_{i}

=

R_{q}

(

R_{q}

is the Qualifies Clients)

9. for i in ranges to do

10. if (

S_{i}

and

S_{n}

> threshold Value) then

11.

R_{q}

= marked Q (Q-Qualified Devices)

12. End if

13. End for

14. Qualified Devices for Participation

15.

Q_{s}

= Best Q Devices

17.

Q_{s}

= Best

Q_{s}

(0: f*n)

18. Return to

R_{i}

and

S_{i}

19. End

3.1.2. Aggregation

The second part of the framework is the aggregation model. The aggregation model is classified into two types: local and global. In this work, both local and global aggregation models have been used—local aggregation was used to obtain updates from one particular revision of information, whereas global aggregation was used to receive updates from multiple locations. This work required aggregations of a dynamic nature, as well as some dynamic aggregation algorithms, such as FedProx and FedNova, which were required for supporting heterogeneous data and dynamic participants. FedOpt, Adative FedAvg, and robust aggregation algorithms helped improve data quality and client reliability. In this framework, a modified Fedprox algorithm was used for aggregation. FedProx [41] is a combination of FedAvg and proximal used for local client updates. It is supported based on dynamic changes and the heterogeneity of local clients. The modified FedProx uses FedAvg [42], proximal, and the normalization of the average of inconsistency. The basic global model was updated using FedAvg. The Federated Averaging (FA) representation is as follows.

F A = - \sum_{i = 1}^{n} S_{i} . γ \sum_{k = i}^{r - 1} g_{i} (C_{i}^{(t, k)})

(6)

where C denotes the client model for the Kth to tth communications,

γ

denotes learning rates,

g_{i}

is the stochastic gradient, and

S_{i}

denotes the sample size. For consistency of averaging, proximal was used, and using this proximal consistently, the average of single and multi-location data was determined. The local and global aggregation with geometric regions is represented using proximal. The basic n dimensional space representation is as follows.

(x_{i}, y_{i}) \dots, (x_{t + 1}, y_{t + 1}), (x_{p}, y_{p}) \dots (x_{p + 1}, y_{p + 1}),

(7)

where x and y represent positive and negative inputs. The negative and positive aggregation of the location is represented as

A +

and

A -

. Positive aggregation is used to update the changes in the number of clients at a particular threshold time from the number of locations or clients. Similarly, within the threshold time, there is no update on the number of clients and it consists of the negative aggregation. The approximation representation of the positive and negative representations is as follows.

A + : w . x_{o}^{t} + P + a n d A - : w . y_{0}^{t} + p -

(8)

The new features or samples within new or closer parameters (N) are represented as follows:

N = w . x_{0}^{t} + P_{j} * (p + q)

(9)

where (p + q) denotes the samples, (p) denotes old samples,

P_{J}

denotes new similar samples, and q denotes new similar samples.

x_{0}^{t}

denotes updating of the positive input in the time interval between 0 to t. The decentralized aggregation representation of global aggregation is as follows:

U^{+} = C + 1 \frac{e γ}{(w_{g n}^{t} - w_{k}^{t})}

(10)

The decentralized aggregation was updated based on the threshold and user or administrator requirements. Using the FedProx algorithm, multiple client models exchanged updates. New raw or positive input prediction parameters were exchanged, and based on the threshold value, all eligible clients participated in the aggregation process. In the proposed framework, the aggregation ensured that all the above-mentioned entities were updated from the client models to the global model.

3.2. Pollution Prediction Model

The proposed model for predicting PM2.5, PM10, and AQI used VGG19, and an explainable AI (XAI) model was used to understand the model predictions. This model used a transfer learning model to accelerate the prediction performance. Several well-known model techniques were examined for the pupose of prediction, such as ResNet50, InterceptionV3, VGG16, Mobile Net V3, etc. For prediction, previously, a combination of models such as VGG16 with Relu was used. However, this yielded a score of 76.92% accuracy. In this work, VGG19 was used with transfer learning for better accuracy and XAI for better prediction understanding. Figure 2 represents the pollution prediction model, and the architecture consists of fully connected layers with updated flattening layers. With the model, a regression model was also used to estimate PM2.5 and PM10. The initial connected layer consisted of 512 neurons, 256 neurons, and 128 neurons. All three fully connected layers used ReLU for the simplicity and effectiveness of the model. Subsequently, Leaky ReLU was used, rather than Simple ReLU, for better accuracy.

The second part of the proposed work consists of the model’s understanding of the prediction and explanation of the outcomes. Different models, such as LIME, SHARP, and Grad-CAM, can be used to explain the outcomes. This work used LIME because it is more usable and has better prediction readability. Figure 3 represents the LIME-based explainable model for the interpretation of pollution prediction, and the step-by-step process of understanding the pollution prediction is also presented. Different factors were used to predict and explain traffic density, wind-speed, cloud movement, industrial emissions, and other precipitations. These factors were used for the predictions and to explore the spikes and the reasons behind them.

3.3. Pollution and Impact Analysis Model

Building an impact analysis using causal inference is an effective way of predicting the impact of pollution on nature and living organisms. This inference model helps determine how pollution affects health. The impact analysis model used here consisted of input and processing layers, as shown in Figure 4. The input layer used a pollution dataset, geographical data, health mapping data, and age group ranges. The processing unit consisted of a prediction and estimation model based on VGG-16 and Leakey ReLU. The identification and detection of the pollution level was extracted from the model, and the pollution inputs were provided using propensity score matching (PSM). PSM matches regions or individual characteristics that are similar to the baseline. The outcome of the prediction model was composed of baseline health data.

The processing steps of PSM are as follows [43]. The propensity match

e (X)

is the probation of receiving conditions with co-variance X.

e (X) = P (T = 1 | x)

(11)

T = 1

, if the unit is treated, and

T = 0

means others. The property score can be estimated as

e (X) = \frac{1}{1 + e x p (- (β_{0} + β_{1} x_{1} \dots + β_{x} x_{k}))}

(12)

The mean marking score is denoted as

j (i) = a r g m i n_{j \in C o n t} | e (X_{i} - e_{X j}) |

(13)

Based on the above equation, the propensity score is matched.

The representation of the entire working process presented in Algorithm 2 and the step-by-step process of the proposed method is as follows.

i.: The federated learning infrastructure uses client selection and local and global aggregation models.
ii.: The PM2.5, PM10, and AQI are predicted using VGG19, and the prediction model is explained using (AXI) using LIME.
iii.: The correlation model is created and combined using the impact analysis model.
iv.: The causal inference model is used to analyze the impact of pollution.
v.: In causal inference, PSM is used to map the outcome.

The above steps were used for each image, and the input and output results were processed simultaneously. The proposed architecture predicted PM2.5, PM10, and AQI levels, and the model was explained. The expected model outcome and new features were initially transferred to the local aggregation, and a certain threshold was transferred to the global model using the global aggregation.

Algorithm 2 Federated learning-based pollution prediction and impact analysis with causal inference.

Input:

Client Data, Global Model, Factors influencing pollution, Outcome variable using PSM

Output:

Predicted Values of PM2.5, PM10 and AQI, Explanation Using LIME, Correlation of Pollution

Algorithm Begin:

{

Create the FL Infrastructure

Identify the participants of clients

updated model = global model //Clients initialized with updated global model

}

// Client selection Process and aggregation

{

if client update reward > previous reward: //Compare client updates to previous rewards

communication status = True //Start communication for aggregation

else:

{communication}_{s t a t u s}

= False

}

{

Initialized to aggregation

If

{

if aggregation result == “positive”:

global model = updated model // Update global model

prediction model = global model // Update prediction model

}

then

for client in participants:

client.model = updated model

}

then

{

Updated model is transferred to the Clients

}

end

4. Results and Discussion

This section consists of the dataset details, AQI classes of datasets, implementation details of datasets, AQI prediction particles such as PM2.5 and PM10, analysis of AQI in different locations, and finally, an impact analysis of particulate prediction and healthcare ramifications.

The dataset images were collected from multiple places, such as India and Nepal, and depicted the pollution levels recorded in the datasets. The datasets were collected using cameras at various time intervals. The dataset consisted of 12,420 images, each of 224 × 224 pixels and saved in the jpeg format. The dataset was labeled according to the city and the corresponding pollution levels at different time intervals. Initially, the datasets were collected from ITO-Delhi, Greater Noida, Faridabad, Oragadam-TN, locations in Nagaland, Mumbai, and some from Biratnagar, Nepal. In this implementation, the locations used were: ITO-Delhi, Greater Noida, New I Faridabad, Nagaland, and Oragadam-TN. The class distribution of the dataset is shown in Figure 5 [40].

The implementation environment included a GPU server with the specifications of Intel (R) 4110 CPU @ 2.10 GHZ, Graphics cards (GeForce RTX 2080 11 GB), 128 GB RAM, and a 64 bit the Ubuntu Operating system. Python 3.5, Keras 2.16, and TensorFlow 1.13.1 were used for the programming environment. Table 3 presents the hyper-parameters of the implementation environment.

All input images were resized into 224 × 224 for VGG-16. The input image was normalized for a pre-trained weight, and the implementation batch size was 16 images for the training and validation of each epoch. The experiment was conducted in 100 epochs, and the learning rate was 0.000005. The Adam Optimizer and model validation used K-fold cross-validation and training, testing, and verification of the model was conducted in 70:20:10 ratios for better accuracy and faster convergence. The initial model accuracy was achieved using the VGG-19 presented in Figure 6. The 20 epochs of the initial model achieved 0.92 accuracy during the model’s training; during testing, the model accuracy was seen to decrease. Similarly, the model was applied in the federated learning environment across five locations (5-Clients) for initial accuracy measurement. The experiment in the federated learning environment in 20 rounds produced predictions with an initial accuracy of 0.95. Compared to the previous [8] model, it produced better accuracy, but the computational power was high. The implemented model was evaluated using accuracy, F1-Score, R2, RMSE, and inference time. Table 4 shows the primary evaluation for 20 epochs. With an increase in the number of epochs, the average accuracy was also seen to increase. For 100 epochs, an accuracy of 0.95 was observed during the training and 0.88 during the testing. Similarly, for 200 epochs, training and testing accuracy were seen to be 0.96 and 0.89, respectively.

The model accuracy without federated learning and with federated learning is presented in Figure 6 and Figure 7. Figure 6 and Figure 7 clearly show the difference in accuracy between the initial model and the model with federated learning. VGG-16, when applied into the distributed environment across five clients and 20 rounds of aggregation, showed increased accuracy. The reward-based client selection helped to increase the accuracy on the client side and reduced the energy optimization between the clients and the global model. The reward-based client selection helped to improve data on the client side. During implementation, updates from the client to the global model were made only when any modification occurred on the client side.

The base model achieved a training accuracy of 88.5% and a testing accuracy of 85.5%. In contrast, the use of federated learning with 15 rounds increased the accuracy to 92.33%. The accuracy of the FL-based model was better than the previous methods because of the datasets and randomized changes to the datasets in the client model.

Results Comparison: The predicted output was compared with different literature reviews on the basis of image-based air quality and PM predictions. The prediction results were compared using training accuracy, testing accuracy, R-square, and RMSE; another comparison was based on the accuracy of the air quality index prediction. The proposed method was evaluated using training accuracy, testing accuracy, RMSE, and R-Square. Training accuracy evaluates the model’s performance on the same dataset it was trained on, indicating how well the model has learned the patterns in the data. Testing accuracy assesses the model’s performance on new and unobserved data, evaluating its generalization capability. RMSE measures the average magnitude of prediction errors by calculating the difference between predicted values and actual values. R-Square indicates the proportion of variance in the dependent variable that can be explained by the independent variables. Table 5 shows the comparisons of R-Square and RMSE values with previous state-of-the-art models from various studies as well as the initial model and the federated learning-based model of this study [44,45,46]. Compared to the previous methods, the proposed framework was seen to produce better accuracy because of factors like better air quality parameters, color and texture properties, exact mapping of features, high resolution of images, etc.

Apart from the above-mentioned reasons, the accuracy of air quality prediction was seen to increase due to the number of high-resolution datasets, the collection of numerous images from different locations, and time interval values. Table 6 shows the increased accuracy of the proposed method, compared to the other datasets such as NWNUM-AQI [47], Linyuan-AQI [48], and Beijing [44], due to the rich set of parameters used. The accuracy (85.5%) of the proposed base model (VGG-19), when compared to the similar previous model (VGG-16) explored by Sapdu Utomo el [8], was found to be 9% better in accuracy, and that of the FL-based framework was found to be 17% better, and this was due to higher-quality datasets, dynamic changes in the datasets, and the sharing of datasets in a distributed way.

Particulate Matter Prediction: The proposed work also predicted the presence of each kind of pollutant matter across different locations and clients of the federated model. Table 7, Table 8 and Table 9 show the predictions of particulate matter. Table 7 demonstrates the prediction of AQI, PM2.5, and PM10 and the corresponding accuracy. Initially, the AQI, PM2.5, and PM10 levels were predicted using the VGG-19 model with and without federated learning. The prediction value was mapped as an average of all types of particulate matter. Based on this, the AQI, PM2.5, and PM10 accuracy was calculated and the accuracy of the PM2.5 and PM10 values was found to be similar to the AQI accuracy. Compared to the base model, the federated learning-based prediction model showed better AQI, PM2.5, and PM10 accuracy.

Table 8 shows the PM predictions from different locations such as Knowledge Park-III-Greater Noida, ITO-Delhi, and New Industrial town, Faridabad. The prediction values and corresponding month’s sensor-based prediction data were mapped for verification. The model prediction and timeline of sensor-based prediction were mapped. Due to air movements and vehicle density, the AQI and PM levels varied across locations. The AQI level of ITO-Delhi was very high due to increased vehicle movement and limited air movement.

Similarly, Table 9 presents the corresponding AQI, PM2.5, and PM10 levels predicted across different clients using federated learning. Each client was mapped with one location, and the corresponding AQI and PM prediction in the concerned location along with the sensor-based prediction were also mapped. Clients were mapped to different places such as Knowledge Park-III-Greater Noida, ITO-Delhi, New Ind town-Faridabad, Oragadam-Tamil Nadu, and some locations of Mumbai. Each client was selected based on client density, update frequency, and locally diverse data, which were compared using different thresholds. All these parameters were considered as client rewards, and based on these rewards, clients participated in the selection process. The prediction accuracy achieved was found to be more than 93% during the validation stage.

4.1. Explainability of Model Predictions

For the model explanation, the India and Nepal datasets were used, along with the Beijing dataset [44], and 100 images. For comparison purposes, the same images that were used in the previous VGG-16 model were also used in the new proposed VGG-19 model. The pre-processing of the images used in VGG-19 resulted in increased accuracy and better prediction of AQI levels. LIME explained the PM levels using the visual features of images, such as cloud patterns and color variations, which are the biggest influencing factors of the model that aid in better prediction and classification. Based on the features, image superpixels, and observations, changes were predicted. For example, the actual value of the PM level corresponding to a good-quality evening image is 11.0, and the expected value is 9.9. The accuracy is slightly varied due to the timing and the brightness features. Similar variations in the projections have been noted in the other categories of original images and the LIME predictions. The differences between the actual prediction of PM2.5 are presented in Figure 8. The sum and the average true value of classes such as good, moderate, worst, unhealthy, and very unhealthy were 365.3 and 73.6, respectively. Similarly, the sum and average values predicted using the LIME model across the different classes were 347.9 and 69.58, respectively. Therefore, the accuracy of the prediction was observed to be 95.23%. In summary, compared to the previous model [8], the proposed VGG-19-based prediction model with federated learning infrastructure was seen to produce better results in terms of prediction accuracy.

4.2. Impact and Health Analysis of PM2.5 and PM10 Model

The proposed work’s main advantage and contribution are an impact analysis using the PM2.5 and PM10 predictions. The impact analysis of PM2.5 and PM10 undertaken with respect to children and the elderly is presented in Table 10. The working process of the impact analysis model is presented in Figure 4. The prepared dataset was analyzed using images and embedded data, such as PM2.5 and PM10 values, in each dataset field. For training and testing purposes, a total of 100 images were used for the impact analysis. The identified PM levels were then further mapped for impact identification, as illustrated in Figure 4.

For mapping purposes, causal inference techniques were used, and high levels of PM2.5 and PM10 were found to be linked to significant health factors. As per the analysis, with an increase in the PM2.5 levels, a corresponding increase in the risk of hospitalizations for respiratory conditions was observed along with increased long-term exposure to cardiovascular risks. Similarly, with an increase in the PM10 levels, increased episodes of asthma, COPD, and chronic diseases were observed. Figure 9 clearly illustrates the testing and training accuracies of PM2.5 and PM10 based on the impact analysis and experimental results. The training and testing accuracies of PM2.5 were 84% and 72%, respectively, and the training and testing accuracies of PM10 were 79% and 74%, respectively. In this work, the impact analysis of image-based predictions used relatively fewer datasets from India and China. The impact analysis of the PM2.5 and PM10 predicted values were mapped to various locations. When the locations were changed, the impact analysis was seen to automatically decrease. Therefore, travel distance and mapping predictions are essential issues that require further consideration.

Unlike other traditional machine learning models, the proposed framework was able to accurately predicts results across various locations without requiring local data sharing. The integration of VGG-19 and a causal inference approach has facilitated accurate predictions and a comprehensive health impact analysis; therefore, implementing this framework on a large scale would enable robust results in both predictive air pollution monitoring and impact analysis. It can be postulated that this research has helped bridge the gap between the applied and technical domains in the health sector, seamlessly integrating multiple disciplines to produce impactful research outcomes. Last but not the least, this interdomain research can help support policy decisions based on the ramifications of the health impact analysis, help build pollution detection and planning guidelines based on geographical regions, and generate actionable insights to prevent avoidable health issues and monitor the environment at the same time.

5. Conclusions

In recent years, air pollution has become the cause of many diseases, especially in vulnerable population groups like children and the elderly. An effective air quality prediction and recommendation system is the need of the hour and necessary for creating public health awareness. This was a primary goal behind the introduction of this federated learning-based framework for pollution prediction and analysis of its impact on health across multiple locations. The proposed federated framework work used reward-based client selection and the FedProx aggregation method for global aggregation, after initially being aggregated locally. The proposed framework used the VGG-19 model for image-based pollution prediction. The model used the LIME tool for better explanation and causal inference for the impact analysis of the predicted PM2.5 and PM10 levels. The proposed framework was observed to produce a 92.33% prediction accuracy, and the impact analysis causal inference model was seen to produce an accuracy of 84% for training and 72% for testing in terms of PM2.5, and an accuracy of 79% for training and 74% for testing in terms of PM10. This impact analysis similarly predicted the possible occurrence of diseases and presented the impact thereof. Compared to previous start-of-the-art studies, the proposed federated learning method was seen to produce better accuracy. In the future, a better impact analysis model can be designed to improve the overall accuracy and to better accommodate location-based and other dynamic changes; this can be investigated in tandem with an improved hybrid model enabling the combination of prediction of PMs, correlation models, and impact analysis results.

Author Contributions

Conceptualization, J.A. and S.B.; Methodology, J.A.; Software, J.A.; Investigation, J.A.; Resources, S.B.; Writing—review & editing, J.A. and S.B.; Funding acquisition, J.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in the study are available to other authors who require access to this material.

Acknowledgments

The authors are grateful to Galgotias University, Greater Noida, India.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Messan, S.; Shahud, A.; Anis, A.; Kalam, R.; Ali, S.; Aslam, M.I. Air-MIT: Air Quality Monitoring Using Internet of Things. Eng. Proc. 2022, 20, 45. [Google Scholar] [CrossRef]
Beriwal, S.; John, A. A review of Various Techniques for Forecasting Pollution and Air Quality Indexing. In Proceedings of the 2021 Fifth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), Palladam, India, 11–13 November 2021; pp. 1680–1686. [Google Scholar] [CrossRef]
Gilik, A.; Ogrenci, A.S.; Ozmen, A. Air quality prediction using CNN+LSTM-based hybrid deep learning architecture. Environ. Sci. Pollut. Res. 2022, 29, 11920–11938. [Google Scholar] [CrossRef]
Xing, Y.-F.; Xu, Y.-H.; Shi, M.-H.; Lian, Y.-X. The impact of PM2.5 on the human respiratory system. J. Thorac. Dis. 2016, 8, E69. [Google Scholar] [PubMed]
Available online: https://www.marlborough.govt.nz/environment/air-quality/smoke-and-smog/health-effects-of-pm10 (accessed on 14 January 2025).
Siddique, S.; Ray, M.R.; Lahiri, T. Effects of air pollution on the respiratory health of children: A study in the capital city of India. Air Qual. Atmosphere Health 2011, 4, 95–102. [Google Scholar] [CrossRef]
Utomo, S.; John, A.; Rouniyar, A.; Hsu, H.-C.; Hsiung, P.-A. Federated Trustworthy AI Architecture for Smart Cities. In Proceedings of the 2022 IEEE International Smart Cities Conference (ISC2), Pafos, Cyprus, 26–29 September 2022; pp. 1–7. [Google Scholar] [CrossRef]
Utomo, S.; John, A.; Pratap, A.; Jiang, Z.-S.; Karthikeyan, P.; Hsiung, P.-A. AIX Implementation in Image-Based PM2.5 Estimation: Toward an AI Model for Better Understanding. In Proceedings of the 2023 15th International Conference on Knowledge and Smart Technology (KST), Phuket, Thailand, 21–24 February 2023; pp. 1–6. [Google Scholar] [CrossRef]
McGovern, A.; Ebert-Uphoff, I.; Gagne, D.J.; Bostrom, A. Why we need to focus on developing ethical, responsible, and trustworthy artificial intelligence approaches for environmental science. Environ. Data Sci. 2022, 1, e6. [Google Scholar] [CrossRef]
Yan, R.; Liao, J.; Yang, J.; Sun, W.; Nong, M.; Li, F. Multi-hour and multi-site air quality index forecasting in Beijing using CNN, LSTM, CNN-LSTM, and spatiotemporal clustering. Expert Syst. Appl. 2021, 169, 114513. [Google Scholar] [CrossRef]
Du, S.; Li, T.; Yang, Y.; Horng, S.J. Deep Air Quality Forecasting Using Hybrid Deep Learning Framework. IEEE Trans. Knowl. Data Eng. 2019, 33, 2412–2424. [Google Scholar] [CrossRef]
Zhao, Z.; Qin, J.; He, Z.; Li, H.; Yang, Y.; Zhang, R. Combining forward with recurrent neural networks for hourly air quality prediction in Northwest of China. Environ. Sci. Pollut. Res. 2020, 27, 28931–28948. [Google Scholar] [CrossRef] [PubMed]
Samal, K.K.R.; Babu, K.S.; Das, S.K. Time Series Forecasting of Air Pollution using Deep Neural Network with Multi-output Learning. In Proceedings of the 2021 IEEE 18th India Council International Conference (INDICON), Guwahati, India, 19–21 December 2021. [Google Scholar]
Zou, G.; Zhang, B.; Yong, R.; Qin, D.; Zhao, Q. FDN-learning: Urban PM2.5-concentration Spatial Correlation Prediction Model Based on Fusion Deep Neural Network. Big Data Res. 2021, 26, 100269. [Google Scholar] [CrossRef]
Zhou, Y.; Chang, F.-J.; Chang, L.-C.; Kao, I.-F.; Wang, Y.-S. Explore a deep learning multi-output neural network for regional multi-step-ahead air quality forecasts. J. Clean. Prod. 2018, 209, 134–145. [Google Scholar] [CrossRef]
Beriwal, S.; A, J.; K, S.K. Spatial and Temporal based Pollution Forecasting using Hybrid Model. In Proceedings of the 2022 International Conference on Applied Artificial Intelligence and Computing (ICAAIC), Salem, India, 9–11 May 2022; pp. 991–998. [Google Scholar] [CrossRef]
Huang, Y.; Ying, J.J.-C.; Tseng, V.S. Spatio-attention embedded recurrent neural network for air quality prediction. Knowledge-Based Syst. 2021, 233, 107416. [Google Scholar] [CrossRef]
Zhang, K.; Thé, J.; Xie, G.; Yu, H. Multi-step ahead forecasting of regional air quality using spatial-temporal deep neural networks: A case study of Huaihai Economic Zone. J. Clean. Prod. 2020, 277, 123231. [Google Scholar] [CrossRef]
Zhang, Q.; Han, Y.; Li, V.O.K.; Lam, J.C.K. Deep-AIR: A Hybrid CNN-LSTM Framework for Fine-Grained Air Pollution Estimation and Forecast in Metropolitan Cities. IEEE Access 2022, 10, 55818–55841. [Google Scholar] [CrossRef]
Wang, Q.; Liu, Y.; Pan, X. Atmosphere pollutants and mortality rate of respiratory diseases in Beijing. Sci. Total Environ. 2008, 391, 143–148. [Google Scholar] [CrossRef]
Abe, K.C.; Miraglia, S.G.E.K. Health impact assessment of air pollution in São Paulo, Brazil. Int. J. Environ. Res. Public Health 2016, 13, 694. [Google Scholar] [CrossRef] [PubMed]
Olstrup, H. An Air Quality Health Index (AQHI) with Different Health Outcomes Based on the Air Pollution Concentrations in Stockholm during the Period of 2015–2017. Atmosphere 2020, 11, 192. [Google Scholar] [CrossRef]
Xu, K.; Cui, K.; Young, L.-H.; Wang, Y.-F.; Hsieh, Y.-K.; Wan, S.; Zhang, J. Air Quality Index, Indicatory Air Pollutants and Impact of COVID-19 Event on the Air Quality near Central China. Aerosol Air Qual. Res. 2020, 20, 1204–1221. [Google Scholar] [CrossRef]
Abelsohn, A.; Stieb, D.M. Health effects of outdoor air pollution: Approach to counseling patients using the Air Quality Health Index. Can. Fam. Physician 2011, 57, 881–887. [Google Scholar]
Air Quality, Health Impacts and Burden of Disease Due to Air Pollution (PM10, PM2.5, NO2 and O3): Application of AirQ+ Model to the Camp de Tarragona County (Catalonia, Spain). Available online: https://europepmc.org/article/med/31759725 (accessed on 15 January 2025).
Jalili, M.; Ehrampoush, M.H.; Mokhtari, M.; Ebrahimi, A.A.; Mazidi, F.; Abbasi, F.; Karimi, H. Ambient air pollution and cardiovascular disease rate an ANN modeling: Yazd-Central of Iran. Sci. Rep. 2021, 11, 16937. [Google Scholar] [CrossRef] [PubMed]
Available online: http://aphekom.org/web/aphekom.org/home (accessed on 15 January 2025).
Fei, Z.; Ryeznik, Y.; Sverdlov, A.; Tan, C.W.; Wong, W.K. An overview of healthcare data analytics with applications to the COVID-19 pandemic. IEEE Trans. Big Data 2021, 8, 1463–1480. [Google Scholar] [CrossRef]
Li, L.; Fan, Y.; Tse, M.; Lin, K. A review of applications in federated learning. Comput. Ind. Eng. 2020, 149, 106854. [Google Scholar] [CrossRef]
Niknam, S.; Dhillon, H.S.; Reed, J.H. Federated learning for wireless communications: Motivation, opportunities, and challenges. IEEE Commun. Mag. 2020, 58, 46–51. [Google Scholar] [CrossRef]
Jiang, J.C.; Kantarci, B.; Oktug, S.; Soyata, T. Federated Learning in Smart City Sensing: Challenges and Opportunities. Sensors 2020, 20, 6230. [Google Scholar] [CrossRef] [PubMed]
Nguyen, D.-V.; Zettsu, K. Spatially-distributed Federated Learning of Convolutional Recurrent Neural Networks for Air Pollution Prediction. In Proceedings of the 2021 IEEE International Conference on Big Data (Big Data), Orlando, FL, USA, 15–18 December 2021; pp. 3601–3608. [Google Scholar] [CrossRef]
Abimannan, S.; A, J.; Shukla, S.; Satheesh, D. Federated Learning for Improved Air Pollution Prediction: A Combined LSTM-SVR Approach. In Proceedings of the 2023 IEEE 4th Annual Flagship India Council International Subsections Conference (INDISCON), Mysore, India, 5–7 August 2023; pp. 1–7. [Google Scholar] [CrossRef]
Neo, E.X.; Hasikin, K.; Mokhtar, M.I.; Lai, K.W.; Azizan, M.M.; Razak, S.A.; Hizaddin, H.F. Towards Integrated Air Pollution Monitoring and Health Impact Assessment Using Federated Learning: A Systematic Review. Front. Public Health 2022, 10, 851553. [Google Scholar] [CrossRef]
Smuha, N.A. The EU Approach to Ethics Guidelines for Trustworthy Artificial Intelligence. Comput. Law Rev. Int. 2019, 20, 97–106. [Google Scholar] [CrossRef]
Liu, H.; Wang, Y.; Fan, W.; Liu, X.; Li, Y.; Jain, S.; Tang, J. Trustworthy ai: A computational perspective. ACM Trans. Intell. Syst. Technol. 2022, 14, 1–59. [Google Scholar] [CrossRef]
Ho, C.W.L.; Ali, J.; Caals, K. Ensuring trustworthy use of artificial intelligence and big data analytics in health insurance. Bull. World Health Organ. 2020, 98, 263. [Google Scholar] [CrossRef]
Putra, M.A.P.; Karna, N.; Alief, R.N.; Zainudin, A.; Kim, D.-S.; Lee, J.-M.; Sampedro, G.A. PureFed: An Efficient Collaborative and Trustworthy Federated Learning Framework Based on Blockchain Network. IEEE Access 2024, 1, 82413–82426. [Google Scholar] [CrossRef]
Lee, W. Reward-based participant selection for improving federated reinforcement learning. ICT Express 2022, 9, 803–808. [Google Scholar] [CrossRef]
Rouniyar, A.; Utomo, S.; John, A.; Hsiung, P.A. Air Pollution Image Dataset from India and Nepal. 2023. Available online: https://www.kaggle.com/datasets/adarshrouniyar/air-pollution-image-dataset-from-india-and-nepal (accessed on 15 January 2025).
Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated optimization in heterogeneous networks. Proc. Mach. Learn. Syst. 2020, 2, 429–450. [Google Scholar]
Ayeelyan, J.; Utomo, S.; Rouniyar, A.; Hsu, H.C.; Hsiung, P.A. Federated learning design and functional models: Survey. Artif. Intell. Rev. 2025, 58, 21. [Google Scholar] [CrossRef]
Li, M. Using the propensity score method to estimate causal effects: A review and practical guide. Organ. Res. Methods 2013, 16, 188–226. [Google Scholar] [CrossRef]
Liu, C.; Tsow, F.; Zou, Y.; Tao, N. Particle Pollution Estimation Based on Image Analysis. PLoS ONE 2016, 11, e0145955. [Google Scholar] [CrossRef] [PubMed]
Bo, Q.; Yang, W.; Rijal, N.; Xie, Y.; Feng, J.; Zhang, J. Particle Pollution Estimation from Images Using Convolutional Neural Network and Weather Features. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 3433–3437. [Google Scholar]
Wang, X.; Wang, M.; Liu, X.; Zhang, X.; Li, R. A PM2.5 concentration estimation method based on multi-feature combination of image patches. Environ. Res. 2022, 211, 113051. [Google Scholar] [CrossRef]
Zhang, Q.; Fu, F.; Tian, R. A deep learning and image-based model for air quality estimation. Sci. Total. Environ. 2020, 724, 138178. [Google Scholar] [CrossRef] [PubMed]
APCI. Available online: https://data.moenv.gov.tw/en/dataset/detail/aqx_p_488 (accessed on 10 December 2024).
Zhang, Q.; Tian, L.; Fu, F.; Wu, H.; Wei, W.; Liu, X. Real-Time and Image-Based AQI Estimation Based on Deep Learning. Adv. Simul. 2022, 5, 2100628. [Google Scholar] [CrossRef]
Kow, P.-Y.; Hsia, I.-W.; Chang, L.-C.; Chang, F.-J. Real-time imagebased air quality estimation by deep learning neural networks. J. Environ. Manag. 2022, 307, 114560. [Google Scholar] [CrossRef]

Figure 1. Federated learning-based framework for pollution prediction and impact analysis.

Figure 2. Pollution prediction model using images.

Figure 3. Explainable model for pollution prediction.

Figure 4. Pollution prediction and impact analysis.

Figure 5. Dataset distribution.

Figure 6. Model initial accuracy.

Figure 7. Model accuracy using federated learning.

Figure 8. Difference between true and actual prediction of PM2.5 using LIME.

Figure 9. Impact analysis of PM2.5 and PM10.

Table 1. Pollutants and impact analysis.

S. No.	Pollutants	Impact of Pollutants
1	PM2.5	Asthma, respiratory inflammation, jeopardized lung function, and lung cancers [4].
2	PM10	Asthma, bronchitis, high blood pressure, heart attacks, and strokes [5].

Table 2. Impact analysis case study.

Participants	Affected by Respiratory Problem
11,628 (7757 Boys and 3871 Girls) School Children	4536 (2950 Boys and 1586 Girls) [4]

Table 3. Hyperparameters of model.

S. No	Hyperparameters of Model	Options
1	Image Size	224 × 224
2	Batch Size	16
3	Epochs	20
4	Optimizer	Adam
5	Dropout	0.5
6	Filter Size	64,128,256,512
7	Global Server	1
8	Federated Client	5

Table 4. Basic evaluation metrics and predictions.

Parameters	Training Prediction	Testing Prediction
Epochs	20	20
RMSE	1.86	15.06
R2	0.87	0.83
F1-Score	0.67	0.87
Accuracy	0.92	0.84
Inference Time	0.180 ms	0.250 ms

Table 5. Comparison of models using R-Square and RMSE metrics.

Reference Number and Year	R-Square	RMSE
Liu et al. (2016) [44]	0.68	40.43
Bo et al. (2018) [45]	0.60	56.03
Wang et al. (2022) [46]	0.80	33.07
Sapdo Utomo et al. (2022) [8]	0.0.83	30.10
Our Model without FL	0.84	28.10
Our Model with FL	0.85	25.02

Table 6. Comparison of image-based air quality prediction accuracy.

Reference Number and Year	Dataset Details	Accuracy
Zhang and Y. Zou et al. (2020) [47]	NWNU-AQI [47]	74.00
Zhang and Y. Xie et al. (2022) [49]	NWNU-AQI [47]	75.15
Kow and X. Zhang et al. (2022) [50]	Linyuan [48]	76.00
Sapdo Utomo al et al. (2022) [8]	Beijing [44]	76.15
Our model without FL	Indian and Nepal	85.15
Our Model with FL	Indian and Nepal	92.22

Table 7. AQI, PM2.5, and PM10 prediction and accuracy.

Model	AQI	PM2.5	PM10
Model Training Accuracy	167	142	145
Model Prediction without FL	143	110	121
Model with FL	150	123	128
Accuracy of Model without FL	85.6	84.01	83.12
Accuracy of Model with FL	92.2	91.0	93.12

Table 8. Prediction of AQI, PM2.5, and PM10 in different locations using the base model.

Model	AQI	PM2.5	PM10
Knowledge Park-III-Greater Noida	210	194	223
ITO-Delhi	240	170	130
New Ind town—Faridabad	230	188	190

Table 9. Prediction of AQI, PM2.5, and PM10 in different locations using federated learning in different clients.

Pollutants	Client-1	Client-2	Client-3	Client-4	Client-5
AQI	225	248	240	180	210
PM2.5	185	248	190	160	190
PM10	240	140	198	150	175

Table 10. Impact analysis of PM2.5 and PM10 on children and the elderly.

Pollutant	Common Impact	Impact on Children	Impact on Elderly
PM2.5	Asthma exacerbations, bronchitis, reduced lung function, heart attacks	Asthma attacks, bronchitis, reduced immune efficiency	Heart attacks, arrhythmia, and reduced oxygen supply
PM10	Coughing, wheezing, cardiovascular stress	Coughing, wheezing, and asthma attacks,	Bronchitis symptoms and cardiovascular stress

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Beriwal, S.; Ayeelyan, J. Decoding Pollution: A Federated Learning-Based Pollution Prediction Study with Health Ramifications Using Causal Inferences. Electronics 2025, 14, 350. https://doi.org/10.3390/electronics14020350

AMA Style

Beriwal S, Ayeelyan J. Decoding Pollution: A Federated Learning-Based Pollution Prediction Study with Health Ramifications Using Causal Inferences. Electronics. 2025; 14(2):350. https://doi.org/10.3390/electronics14020350

Chicago/Turabian Style

Beriwal, Snehlata, and John Ayeelyan. 2025. "Decoding Pollution: A Federated Learning-Based Pollution Prediction Study with Health Ramifications Using Causal Inferences" Electronics 14, no. 2: 350. https://doi.org/10.3390/electronics14020350

APA Style

Beriwal, S., & Ayeelyan, J. (2025). Decoding Pollution: A Federated Learning-Based Pollution Prediction Study with Health Ramifications Using Causal Inferences. Electronics, 14(2), 350. https://doi.org/10.3390/electronics14020350

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Decoding Pollution: A Federated Learning-Based Pollution Prediction Study with Health Ramifications Using Causal Inferences

Abstract

1. Introduction

2. Related Work

3. Federated Learning-Based Framework for Health Assessment Using Pollution Detection

3.1. Federated Learning Framework

3.1.1. Client Selection and Aggregation of Clients

3.1.2. Aggregation

3.2. Pollution Prediction Model

3.3. Pollution and Impact Analysis Model

4. Results and Discussion

4.1. Explainability of Model Predictions

4.2. Impact and Health Analysis of PM2.5 and PM10 Model

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI