Assessment of Fine Particulate Matter for Port City of Eastern Peninsular India Using Gradient Boosting Machine Learning Model

: An assessment and prediction of PM 2.5 for a port city of eastern peninsular India is presented. Fifteen machine learning (ML) regression models were trained, tested and implemented to predict the PM 2.5 concentration. The predicting ability of regression models was validated using air pollutants and meteorological parameters as input variables collected from sites located at Visakhapatnam, a port city on the eastern side of peninsular India, for the assessment period 2018–2019. Highly correlated air pollutants and meteorological parameters with PM 2.5 concentration were evaluated and presented during the period under study. It was found that the CatBoost regression model outperformed all other employed regression models in predicting PM 2.5 concentration with an R 2 score (coefﬁcient of determination) of 0.81, median absolute error (MedAE) of 6.95 µ g/m 3 , mean absolute percentage error (MAPE) of 0.29, root mean square error (RMSE) of 11.42 µ g/m 3 and mean absolute error (MAE) of 9.07 µ g/m 3 . High PM 2.5 concentration prediction results in contrast to Indian standards were also presented. In depth seasonal assessments of PM 2.5 concentration were presented, to show variance in PM 2.5 concentration during dominant seasons.


Introduction
The World Health Organization (WHO) has reported that 9 out of 10 people breathe air containing high levels of pollutants, and it is also estimated that air pollution is responsible for approximately 7 million deaths every year [1]. The burden of ill-health is not equally distributed, as approximately two-thirds of deaths occur in the developing countries of Asia. Air pollution in Asian countries is mainly due to increasing trends in economic and social development. In India, rapidly increasing industrialization, urbanization, and demand for transportation influence air pollution in many Indian cities [2]. The major health impacts of air pollutant PM 2.5 are shown in Figure 1, which indicates that PM 2.5 is the main air pollutant responsible for the most significant health problems viz. impacts on the central nary disease, lung cancer, cardiovascular diseases, impacts on the reproductive system, etc., [3,4]. Therefore, it is of utmost importance to determine and forecast the PM2.5 concentration with more reliable and better forecasting models.
Many studies have reported on pollutant forecasting models using ML approaches because it is very difficult to estimate originator pollutants with varying environmental conditions by traditional methods. ML approaches are capable of determining the causes of high pollutants, along with better forecasting of the increase and decrease in pollutants in the environment. Forecasting can help regulatory agencies or the government to control emission levels of pollutants such as nitric oxide (NO), nitrogen dioxide (NO2), nitrogen oxides (NOx), ammonia (NH3), sulphur dioxide (SO2), carbon monoxide (CO), benzene (C6H6), toluene (C7H8), etc., and alert residents to avoid outdoor activity, especially patients with respiratory problems. ML approaches are based on data collected through various sensors located in different parts of the city. ML algorithms have advanced over the past few years, and their prediction is based on the quality of the data collection, i.e., data required for training the models. In our study, we considered the data collected at Visakhapatnam, a port city on the eastern side of peninsular India, for the assessment period 2018-2019. This paper is organized as follows: Section 2 provides information about the literature review; Section 3 introduces the study area, methods of data collection, and descriptions of data and the machine learning regression model. Section 4 describes performance measurement indicators. Section 5 describes the results related to Visakhapatnam's three predominant seasons and a comparison of prediction results of fifteen ML regression models. Section 6 presents the conclusion of the proposed work.

Literature Review
A number of studies have reported the implementation of ML approaches for urban air pollution, including PM2.5 concentration in recent years. Masood and Ahmad [5] presented a comparison of support vector machine (SVM) and artificial neural network (ANN) for the prediction of anthropogenic fine particulate matter (or PM2.5) for Delhi, based on various meteorological and pollutant parameters corresponding to a 2 year period from 2016 to 2018. ANN provided faster prediction and better accuracy, compared with SVM. Deters et al. [6] carried out research on the prediction of PM2.5 with the help of Many studies have reported on pollutant forecasting models using ML approaches because it is very difficult to estimate originator pollutants with varying environmental conditions by traditional methods. ML approaches are capable of determining the causes of high pollutants, along with better forecasting of the increase and decrease in pollutants in the environment. Forecasting can help regulatory agencies or the government to control emission levels of pollutants such as nitric oxide (NO), nitrogen dioxide (NO 2 ), nitrogen oxides (NO x ), ammonia (NH 3 ), sulphur dioxide (SO 2 ), carbon monoxide (CO), benzene (C 6 H 6 ), toluene (C 7 H 8 ), etc., and alert residents to avoid outdoor activity, especially patients with respiratory problems.
ML approaches are based on data collected through various sensors located in different parts of the city. ML algorithms have advanced over the past few years, and their prediction is based on the quality of the data collection, i.e., data required for training the models. In our study, we considered the data collected at Visakhapatnam, a port city on the eastern side of peninsular India, for the assessment period 2018-2019. This paper is organized as follows: Section 2 provides information about the literature review; Section 3 introduces the study area, methods of data collection, and descriptions of data and the machine learning regression model. Section 4 describes performance measurement indicators. Section 5 describes the results related to Visakhapatnam's three predominant seasons and a comparison of prediction results of fifteen ML regression models. Section 6 presents the conclusion of the proposed work.

Literature Review
A number of studies have reported the implementation of ML approaches for urban air pollution, including PM 2.5 concentration in recent years. Masood and Ahmad [5] presented a comparison of support vector machine (SVM) and artificial neural network (ANN) for the prediction of anthropogenic fine particulate matter (or PM 2.5 ) for Delhi, based on various meteorological and pollutant parameters corresponding to a 2 year period from 2016 to 2018. ANN provided faster prediction and better accuracy, compared with SVM. Deters et al. [6] carried out research on the prediction of PM 2.5 with the help of ML regression models considering selected meteorological parameters, and reported a high correlation between Atmosphere 2022, 13, 743 3 of 21 estimated data and real data. Moisan et al. [7] compared the performances of a dynamic multiple equation (DME) model in forecasting PM 2.5 concentrations, with an ANN model and an ARIMAX model. They concluded that, although ANN in very few instances showed more significant and accurate results than the DME model, the overall performance of the DME model was slightly better than the ANN. Jiang et al. [8] reported the prediction of PM 2.5 with the help of a long short-term memory (LSTM) model, with better accuracy. Suleiman et al. [9] compared three air quality control strategies, including SVM, ANN and boosted regression trees (BRT) for forecasting PM 2.5 and PM 10 concentrations in the roadside environment. They found that regression models with neural networks had better prediction performance, compared with SVM. A regression model which used extra trees regression and AdaBoost, for further boosting, was proposed for estimating PM 2.5 for Delhi by Kumar S. et al. [10]. Regression and statistical methods were proposed by Chandu and Dasari [11] to study the causal relationship between PM 2.5 and gaseous pollutants for the city of Visakhapatnam, but no forecasting model was proposed.
We chose Visakhapatnam, a port city of east peninsular India, which is surrounded on three sides by mountains and the Bay of Bengal on the fourth. This city is studded with major industries, including Hindustan Petroleum Corporation Limited (HPCL), Hindustan Zinc Limited (HZL), Bharat Heavy Plates and Vessels (BHPV), Hindustan Polymers Limited (HPL), Visakhapatnam Steel Plant (SP), Coastal Chemicals (CC), Andhra Cement Company (ACC) and Simhadri Thermal Power Corporation (STPC) [11]. Approximately 200 ancillary industries operate to supplement these main industries thereby causing air pollution, hence air quality research becomes a priority.
There are many ML regression-based models available in the literature; of these, fifteen ML regression models (presented in Section 3.3) were tested and implemented to predict the PM 2.5 concentrations for Visakhapatnam. Among them, gradient boosting models (CatBoost, XGBoost and LightGBM) and voting regression models provided significant prediction results. Overall, the CatBoost model (that utilizes oblivious decision trees as the base learner) as proposed by Yandex Company [12] resulted in excellent prediction performance. The CatBoost model was able to archive comparable results, as obtained by other state of the art regression models such as RNN-LSTM [13], multivariate linear regression model [14] and LSTM [15]. Prediction accuracy of these regression models was validated using air pollutants (NO, NO 2 , NO x , NH 3 , SO 2 , CO, C 6 H 6 , C 7 H 8 ) and meteorological parameters (relative humidity (RH), wind speed (WS), wind direction (WD), solar radiation (SR) and barometric pressure (BP), and temperature) as input variables.
This research work is different from previously reported studies on PM 2.5 , in multiple ways. In this study, in-depth seasonal and yearly variations in PM 2.5 concentration were analyzed for two years. Fifteen machine learning methods were analyzed for optimal performance, compared with only two to three methods in most other studies, offering a more robust analysis for comparing models for real world implementation [6][7][8][9][10].
The main highlights of this manuscript are: (1) Analysis of the concentration of PM 2.5 and air pollutants in the air of the eastern peninsular port city of India, on yearly and seasonal bases; (2) Evaluation of the correlation between PM 2.5 concentration, air pollutants and meteorological parameters for the port city of Visakhapatnam; (3) Observation of the CatBoost prediction model as the most efficient prediction model for assessing and predicting the concentration of PM 2.5 in the air; (4) Analysis of high PM 2.5 concentration prediction results for the period under observation. Pradesh Pollution Control Board (APPCB) [16]. Visakhapatnam is the largest city of Andhra Pradesh and the second largest east coast city in India, lying between the coast of the Bay of Bengal and the Eastern Ghats. According to the 2011 Indian census, Visakhapatnam had a population of 1,728,128 with a population density of 18,480/km 2 [17]. Figure 2 shows the area under observation and wind rose diagram of the observation period. The study area, Visakhapatnam, a port city on the eastern side of peninsular India, with latitude 17°42′15″ N and longitude 83°17′52″ E, is located in the Indian state of Andhra Pradesh. The monitoring station for data collection was set up by the Andhra Pradesh Pollution Control Board (APPCB) [16]. Visakhapatnam is the largest city of Andhra Pradesh and the second largest east coast city in India, lying between the coast of the Bay of Bengal and the Eastern Ghats. According to the 2011 Indian census, Visakhapatnam had a population of 1,728,128 with a population density of 18,480/km 2 [17]. Visakhapatnam observes tropical wet and dry climates during the year, with three dominant seasons: a summer season from March to June, a monsoon season from July to September, and a winter season from October to February. Though the summer extends from March to June, the maximum temperature is observed mainly in the month of May. The monsoon season extends from July to September, where an annual average rainfall of 44.05 inches was witnessed [18]. During winter, the minimum temperature is observed mainly in the month of January. The annual mean temperature of the city varies from 24.7 to 30.6 °C and observes an RH in the range of 68-80% [18].

Study Area
During the period of observation, wind in the city blew with a mean speed of 2.32 m/sec in the direction of southwest (213.80 degrees, mean direction). The wind rose diagram in Figure 2 (based on a 24 h mean value) represents the direction of the wind, speed of wind and wind frequency of the location under observation for 2018-2019. The extent of the spoke determines the frequency of wind blowing in a specific direction. It is noted that the current PM2.5 concentration in Visakhapatnam air is 8 times above the WHO annual air quality guideline value [2]. Due to the extreme variation in climatic conditions and a very high concentration of PM2.5, there is a need to analyze ambient air quality, in addition to the impact of climatic factors, on the air quality of the city. Visakhapatnam observes tropical wet and dry climates during the year, with three dominant seasons: a summer season from March to June, a monsoon season from July to September, and a winter season from October to February. Though the summer extends from March to June, the maximum temperature is observed mainly in the month of May. The monsoon season extends from July to September, where an annual average rainfall of 44.05 inches was witnessed [18]. During winter, the minimum temperature is observed mainly in the month of January. The annual mean temperature of the city varies from 24.7 to 30.6 • C and observes an RH in the range of 68-80% [18].

Data Description
During the period of observation, wind in the city blew with a mean speed of 2.32 m/sec in the direction of southwest (213.80 degrees, mean direction). The wind rose diagram in Figure 2 (based on a 24 h mean value) represents the direction of the wind, speed of wind and wind frequency of the location under observation for 2018-2019. The extent of the spoke determines the frequency of wind blowing in a specific direction. It is noted that the current PM 2.5 concentration in Visakhapatnam air is 8 times above the WHO annual air quality guideline value [2]. Due to the extreme variation in climatic conditions and a very high concentration of PM 2.5 , there is a need to analyze ambient air quality, in addition to the impact of climatic factors, on the air quality of the city.

Data Description
In the present work, the data for air pollutants and PM 2.5 concentration in the air, along with meteorological parameters, were collected for two consecutive years 2018-2019. Visakhapatnam has nine monitoring stations located at Industrial Estate Marripalem, Parawada, GVMC, Raitu Bajar, Police Barracks, Pedagantyada Gajuwada, Naval Area, Seethammadhara and Ganapuram Area, to measure the air quality index of the city. The original readings of air pollutant concentration and meteorological parameters were provided by the Central Pollution Control Board (CPCB) website [2] and APPCB [16] on 30 min, 1 h, 4 h, 8 h, 24 h and annual bases. For the present study, 24 h mean values of concentration of air pollutants and meteorological parameters were noted and utilized as prime variables for our prediction models. The raw dataset contained the record of air pollutants and meteorological parameters for 730 days. Of these days, the records for 32 days (19 days in 2018, and 13 days in 2019) were completely missed (either due to power failure, device failure or other reasons). After removing the 32 days, approximately 5% to 6% of values were further missed due to various parameters. The missing values were imputed using K-Nearest Neighbour (KNN) imputation method, and the values were imputed using the mean value from the n nearest neighbors. The value with k = 1, using the heterogeneous euclidean overlap metric (HEOM) distance, was chosen for the missing imputation for the presented dataset. After carefully processing, and missing value imputation, the observational record of air pollutants and meteorological parameters for 698 days was considered for our PM 2.5 concentration analysis. The statistical descriptions of primary variables considered for analysis are presented in Table 1. For the period under observation, the city observed mean PM 2.5 concentration of 48.63 µg/m 3 with standard deviation of 30.05 µg/m 3 . Similarly, Table 1 shows the mean, standard deviation, minimum and maximum values of air pollutants and meteorological parameters.
However, despite predicting efficient results, regression models have their advantages and limitations. Table 2 presents the pros and cons of the proposed regression models for prediction of PM 2.5 concentration. • Good for heterogeneous data, but may not be the optimal learner for homogeneous data.

•
The performance of voting regression combines the performance of many models, so poor performance of one model can be offset by strong performance of other models; • The performance of voting regression is not largely affected by one strong/weak model.  • Sensitive to outliers, noisy and missing data; • Finding the optimal value of k is challenging.

CatBoost (Based on Gradient Boosting Algorithm) Model Description
Gradient boosting is a significant and effective machine learning technology implemented to deal with noisy, diverse features and complex correlated information. Using iteration, the technique amalgamates weak machine learning models with the aid of gradient descent in function space [34]. A gradient boosting based CB model was proposed by Yandex Company and the model utilizes oblivious decision trees as the base learner [12]. The decision trees are implemented for regression. Each tree indicates division of feature space and output value. Decision rule/splitting criteria are used during division of trees. Individual splitting criteria resembles a pair p = (q, m) having a feature indicator q = 1, 2, . . . .n, and threshold value m D. On implementing the decision rule/splitting criteria, a set of feature vectors X can be disjointed into two subsets of X C and X D , so that for every x = x 1 , x 2 , x 3 , . . . .x n X, we have: After implementing decision rule/splitting criteria to e disjoint sets X 1 , X 2 . . . . . . .X e D n , we obtain 2e disjoint sets X C 1 , X D 1 , X C 2 , X D 2 . . . . . . . . . X C e , X D e . For a specified collection of sets N = {X 1 , X 2 , X 3 , . . . . . . ., X e } and the target variable Y : D n → D , the decision rule/splitting criteria can be given as: where N functions to estimate the optimality of the splitting criteria/decision rule p and the collection N with respect to the target variable Y. For an oblivious decision tree, G can be defined as: where Y(X a ) is the target variable score with respect to the sample X a . In contrast to other regression models, the CB model has following advantages: (a) Categorical features: The model is capable of handling categorical features. In conventional gradient boosting decision tree-based algorithms, categorial features are replaced by their mean label value. If mean values are used to characterize features, then it will give rise to an effect of conditional shift [35]. However, in CB, an approach known as greedy target statistics is employed, and the model inculcates prior values to greedy target statistics. The employed technique reduces overfitting with minimum information loss; (b) Combining features: CB implements a greedy way to amalgamate all of the multiple categorical features and their combinations by the current tree during the formation of the new split. All the splits in the decision tree are considered as categories with two disjoint values and are employed during amalgamation; (c) The CB models are fast scorers. They are based on oblivious decision trees which are balanced and less inclined to overfitting.

Performance Measurement Indicators
Evaluation matrices for verification of high PM 2.5 concentration: The National Ambient Air Quality Standards (NAAQS) [36] were developed by the Central Pollution Control Board, Ministry of Environment and Forests (Government of India), to regulate pollutant emissions into the air. According to the standards, a mean 24 h PM 2.5 concentration of 60 µg/m 3 was classified as higher concentration. For evaluating the high PM 2.5 concentration, the following evaluation parameters were used: hit rate (HR), false alarm rate (FAR), threat score or critical success index (CSI), and true skill statistics (TSS). These parameters were evaluated using the contingency table shown in Table 3, and are presented in Table 4. The parameters were defined in terms of "Hit", "Miss", "False Alarm" and "Correct Rejection". The terms "Hit" and "Correct Rejection" were possible cases when the prediction was accurate, and "False Alarm" and "Miss" were possible cases when the prediction was not accurate. The classification accuracy of the forecasting model was assumed to be good if "Hit" and "Correct Rejection" cases were predominant, with very low cases of "False Alarm" and "Miss". Evaluation matrices for ML regression models: The prediction performance of regression models has been evaluated in terms of R 2 Score, also known as "coefficient of determination", MedAE, MAPE, RMSE, and MAE. Table 4 provides the performance measurement indicators utilized to validate the proposed prediction model. Measures the ratio of false alarms and gives an indication of the occurence of an event when there is no event. Its value ranges between 0 and 1; a value close to 0 indicates better prediction.
Together takes into account hits, misses and false alarms. Its value ranges between 0 and 1. A value close to 1 indicates excellent prediction performance CSI = a a+b+c TSS Determines the ability of the model to distinguish between "Yes" and "No" cases. Its value ranges between −1 and 1, with 1 indicating a perfect forecast, 0 defining a standard forecast, and a negative value indicating a below-standard forecast.
Provides a degree of discrepancy in dependent variables.
Provides the median value of the absolute difference between forecasted and true values.
The MedAE is least influenced by outliers. Where A j andÂ j are the true and predicted values of the dependent variable for jth sample, respectively. s is the total number of samples, ∅ is a very small positive number to define the result if A j = 0 and A j can be given as 1 s ∑ s j=1 A j .

Analysis of Results
In this section, we report a comprehensive analysis of the results, which comprises Sections 5.1-5.4. Section 5.1 presents the correlation of PM 2.5 concentration with air pollutants and meteorological parameters. Seasonal variation in PM 2.5 concentration, along with other parameters, is presented in Section 5.2. PM 2.5 concentration prediction using machine learning based regression models is reported in Section 5.3, and Section 5.4 presents an evaluation of the results for verification of higher concentration.

Correlation of PM 2.5 Concentration with Air Pollutants and Meteorological Parameters
To statistically explore the relation between the concentration of PM 2.5 with concentration of air pollutants and meteorological parameters, the correlation coefficients were calculated using Pearson's correlation method, and are presented in Table 5. The Pearson's correlation coefficient for two variables x and y can be evaluated as:  Positive and negative correlations between the concentration of PM 2.5 with the concentration of air pollutants and meteorological parameters were observed.
The air pollutants CO, NO 2 and NO x formed a strong correlation with PM 2.5 concentration. Among the meteorological parameters, WS, BP, and RH attained significant correlation with PM 2.5 concentration. The air pollutant C 7 H 8 showed a very low correlation value with PM 2.5 concentration, which may have occurred due to fewer C 7 H 8 emission sources in the city. An increment or decrement in the concentration of air pollutant stipulated a direct impact on the increment or decrement of PM 2.5 concentration. All the meteorological parameters except BP and temperature exhibited a negative correlation with PM 2.5 concentration. The meteorological parameter temperature marked a very low value of Pearson's coefficient, which was probably due to insignificant variation in mean temperature values during the major seasons. The city observed an approximate mean temperature of 29 • C during the major seasons. It was noticed that WS reported a strong negative correlation with PM 2.5 concentration.

Seasonal and Annual Behaviour of PM 2.5 Concentration with Air Pollutants and Meteorological Parameters
An in-depth analysis of seasonal and annual behavior of PM 2.5 concentration is presented in Sections 5.2.1 and 5.2.2.

Seasonal Behavior of PM 2.5 Concentration with Air Pollutants and Meteorological Parameters
For assessing the seasonal impact of air pollutants and meteorological parameters on PM 2.5 concentration, a detailed statistical analysis is presented in Table 6. It was comprehended that the air pollutant concentration reported significant seasonal variations. A rise in PM 2.5 concentration during December-February, and decrease from March-August, was observed for the study period. The seasonal variations in PM 2.5 and air pollutant concentration were primarily due to the varying speed and direction of the wind, seasonal variation in SR, and lack of precipitation in winter which reduces surface vertical mixing and can lead to limited dilution and dispersion [37].
It was found that the air pollutant and PM 2.5 concentration exhibit maximum variation in the winter season. The maximum value of PM 2.5 concentration recorded during the winter season was 202.52 µg/m 3 , which was 106.62 µg/m 3   During the winter season, a higher air pollutant concentration was noticed. The maximum values observed for air pollutants CO, NO 2 and C 6 H 6 were 2.18 mg/m 3 , 131.53 µg/m 3 and 11.69 µg/m 3 , respectively. Due to an increase in the concentration of CO and oxides of nitrogen, a rise in PM 2.5 concentration was observed. The rise in air pollutant concentration was probably due to the slow WS and temperature inversion effect in the winter season. As a result, the air pollutants and PM 2.5 particles were trapped near the earth's surface which, in turn, increased the PM 2.5 concentration [38,39].
A substantial variation in PM 2.5 concentration (24 h) was observed during the summer season, and a variation from 9.09 µg/m 3 to 95.90 µg/m 3 in PM 2.5 concentration was reported during the period of observation. The season reported high SR (mean value = 167.18 W/m 2 ; standard deviation = 56.02 W/m 2 ) and high WS (mean value = 2.79 m/s; standard deviation = 0.73 m/s). Because of the high SR acquired by the earth's near-surface atmosphere, the near-surface temperature increased, promoting upward movement, eventually diffusing PM 2.5 concentration. In addition to higher SR, the high WS and high precipitation diluted the air pollutant concentration at the surface and caused a significant decrease in air pollutants and PM 2.5 concentration during the summer season [38,39].
It was noted that the air pollutant SO 2 concentration showed inadequate seasonal variability and remained approximately constant throughout the year. This was possibly due to SO 2 emission sources (sulphur-containing fuels such as oil, coal and diesel) which constantly emit SO 2 pollutants in the city. The primary sources of CO, SO 2 and nitrate aerosols for the city are presumed to be power generation plants, engines in vehicles and ships, ship-yard industries and steel plants. The key source of C 6 H 6 emission in the city is probably due to the presence of heavy chemical and petroleum industries, as C 6 H 6 is a natural element of petrol and crude oil and is produced as a by-product during the oil refining process. In addition to major industries, the city is surrounded by many small and medium-size industries which add to the concentration of air pollutants. Though heavy, medium and small industries contribute to increasing air pollution, the city has the advantage of sea breezes, by which most of the air pollutant emissions are disseminated to sea and the impact of air pollutants on air quality is reduced. However, a relatively constant NH 3 concentration was observed during the assessment period, which was probably due to the continuous emission of ammonia gases from industrial processes and vehicular emissions.
It was found that the city experiences greater humidity during the monsoon season, i.e., from 63.06% to 85.23%, with a mean humidity of 74.68%, whereas a mean humidity of 70% and 71.89% was observed by the city in the winter and summer seasons, respectively. As noted from Table 4, the metrological parameter 'humidity' exhibits a negative correlation with PM 2.5 concentration. In the highly humid season, raindrops influence gaseous air pollutants by the phenomenon of absorption and collision. The phenomenon leads to wet decomposition and reduces the PM 2.5 concentration [40,41].
As observed from the wind rose diagram ( Figure 3) and from Table 6, slow and the infrequent wind blows during the winter season. To present a detailed analysis of WD, the wind rose diagram is plotted in sixteen directions from N to NNE (counterclockwise). The concentric circles in the wind rose diagram represent the probability percentage of wind blow, and are labeled with percentages increasing outward. As shown in Figure 3, the probability percentage concentric rings are placed at 5% intervals. For analysis, the WS is divided into nine bins and the bins are differentiated, with colors ranging from red to brown. The length of spoke around the circle is related to the frequency of time that the wind blows from a particular direction. The dominant wind directions during the winter season were found to be in the SSE and S directions, with a small secondary lobe in the SSW direction, and with minor lobes in the SW and SE directions, indicating less frequent wind blow in these minor lobe directions. The winter season observed a mean WS of 1.85 m/s, and approximately 65% of the time the wind blew in the direction of SSW to SSE (230-305 degrees). During the season, only 1-2% of infrequent high-speed winds (greater For the summer and monsoon seasons, the high-speed winds swept the air pollutants away from the city, hence, a low concentration of PM 2.5 was observed during these months. 45% of the time the wind blew with a speed greater than 3.18 m/s, and approximately 8-10% of the time the wind blew with a speed of 4.23 m/s to 4.75 m/s. However, infrequent WS greater than 4.75 m/s blew for approximately 1-2% of the entire season. Similar highspeed winds were observed for the summer season, with a SSW prominent wind direction. For the summer and monsoon seasons, the high-speed winds swept the air pollutants away from the city, hence, a low concentration of PM2.5 was observed during these months.

Annual Behavior of PM 2.5 Concentration, Air Pollutants and Meteorological Parameters
Annual variation in PM 2.5 concentration, air pollutant and meteorological parameters for 2018 and 2019 are presented in Table 7. As observed, a minor increase in PM 2.

Machine Learning-Based PM 2.5 Concentration Estimation
In the present study, the performance of the machine learning based regression models, employed to estimate PM 2.5 concentration in the air, were validated using data related to eight air pollutants and six meteorological parameters collected for the year 2018-2019 for Visakhapatnam. The air pollutants and meteorological parameters were utilized as independent input variables to train 15 distinct machine learning regression models to predict PM 2.5 concentration. The model parameters were tuned using a grid search optimization technique. The dataset was divided into training and test datasets with a ratio of 80-20%, namely, 80% of observations were used to train the model and 20% of observations were sed to test the model. In our proposed methodology, the dataset was randomly categorized into a training dataset and a test dataset. The experiments were simulated using Python 3.8 open-source software on an IBM PC with Intel Core i-7-6700 CPU @ 3.40 GHz processor supported with 8 GB RAM. Table 8 presents the performance matrices of 15 regression models for the prediction of PM 2.5 concentration. It was observed that the VR and gradient boosting regression models (CB, LGBM and XGB) showed notable performance, in contrast to other presented regression models.

Performance of Regression Models
From Table 8, it was noted that the gradient boosting and VR models achieved a notable R 2 score (0.71 to 0.81). The higher R 2 score signified that the dataset values were well fitted in the model. Furthermore, the gradient boosting and VR prediction models achieved low error scores in terms of RMSE (11.42 µg/m 3 to 14.03 µg/m 3 ) and MAE (9.09 µg/m 3 to 10.34 µg/m 3 ). The CB prediction model outperformed RMSE, MAE, MAPE and MedAE, in terms of R 2 score. The model yielded an R 2 score of 0.81, RMSE of 11.42 µg/m 3 , MAPE of 0.29, and an MAE of 9.07 µg/m 3 . The prediction performance of the LGBM model was found to have deteriorated in contrast to the CB which predicted the results with an R 2 score of 0.76, RMSE of 12.94 µg/m 3 , MAPE of 0.29 and MAE of 9.85 µg/m 3 . Comparable accuracies were observed for VR and XGB models. However, the models showed lower prediction performance against CB and LGBM regression models.
As observed, in addition to high R 2 , the CB regression predicted the PM 2.5 concentration with minimum error amongst all the presented models. It was found that the CB model attained minimum MedAE (6.95 µg/m 3 ) and MAPE (0.29) errors, revealing that the model was robust to the outliers presented in the dataset. Very low performance was observed for QR and KNN regression models. Low-performance scores in terms of R 2 and high errors attained by these models indicated their inefficacy to fit the dataset values for predicting PM 2.5 concentration. However, the RF model showed slightly improved performance in comparison to the KNN model. Least prediction performance was observed for the QR model, with low R 2 scores (0.47) and unfavorable high error in terms of RMSE It was observed that the penalized-regularization model showed enhanced prediction performance in contrast to MLP, and diminished performance in comparison to voting and gradient boosting models. Comparative prediction performance with slightly increased MedAE was noted for the PLS prediction model. Figures 4 and 5 present the regression plots and residual error plots for the test dataset. Figure 4 shows the regression plots mapped between observed and predicted PM 2.5 concentrations. As observed from Figure 4, compared with other models, the data points for CB model were highly concentrated on the 'fitting line' in the regression curve, indicating that the values were well fitted in the model. A low concentration of data points was observed on the 'fitting line' for LGBM, XGB and VR regression models, indicating that the values were not well fitted to the models.  Figure 4 shows the regression plots mapped between observed and predicted PM2.5 concentrations. As observed from Figure 4, compared with other models, the data points for CB model were highly concentrated on the 'fitting line' in the regression curve, indicating that the values were well fitted in the model. A low concentration of data points was observed on the 'fitting line' for LGBM, XGB and VR regression models, indicating that the values were not well fitted to the models.     Figure 4 shows the regression plots mapped between observed and predicted PM2.5 concentrations. As observed from Figure 4, compared with other models, the data points for CB model were highly concentrated on the 'fitting line' in the regression curve, indicating that the values were well fitted in the model. A low concentration of data points was observed on the 'fitting line' for LGBM, XGB and VR regression models, indicating that the values were not well fitted to the models.   Figure 5 presents the residual error plots with their histograms showing residuals of the regression models evaluated for the test dataset. The results report that for XGB and VR regression models, the residuals and histogram peaks reside at around 0 to −20, yielding negatively biased results. It indicates that the model's prediction was too high, and  Figure 5 presents the residual error plots with their histograms showing residuals of the regression models evaluated for the test dataset. The results report that for XGB and VR regression models, the residuals and histogram peaks reside at around 0 to −20, yielding negatively biased results. It indicates that the model's prediction was too high, and that the models probably predict higher PM 2.5 concentration than compared with observed PM 2.5 concentration. The LGBM model shows satisfactory improvement having a random and dispersed distribution of residual. However, the model observes low negatively biased residual error, and the results turn out to be slightly negative, biased with a moderate difference between predicted and observed PM 2.5 concentrations. Amongst all the fifteen implemented models, the CB model shows significant prediction performance with minimum residual error. As observed from the residual error plot Figure 5, the model is least biased to positive and negative residuals and shows random residual distribution. The residual error in CB lies in the range of −20 to 30, with maximum residual present in the range of −10 to 10.
Time series and scatter plot between true and predicted PM 2.5 concentrations for the test datasets are presented in Figure 6. As marked in Figure 6a, the models show adequate prediction performance and, as compared with other models, the CB model relatively follows the true PM 2.5 concentration. The scatter plot presented in Figure 6b shows that in contrast to VR, XGB and LGBM models, the CB model predicts PM 2.5 concentration in the proximity of observed/true PM 2.5 concentration at maximum instants of time. For the CB regression model, Figures 5 and 6 show fine agreement between the observed/true and predicted PM 2.5 concentrations. that the models probably predict higher PM2.5 concentration than compared with observed PM2.5 concentration. The LGBM model shows satisfactory improvement having a random and dispersed distribution of residual. However, the model observes low negatively biased residual error, and the results turn out to be slightly negative, biased with a moderate difference between predicted and observed PM2.5 concentrations. Amongst all the fifteen implemented models, the CB model shows significant prediction performance with minimum residual error. As observed from the residual error plot Figure 5, the model is least biased to positive and negative residuals and shows random residual distribution. The residual error in CB lies in the range of −20 to 30, with maximum residual present in the range of −10 to 10. Time series and scatter plot between true and predicted PM2.5 concentrations for the test datasets are presented in Figure 6. As marked in Figure 6a, the models show adequate prediction performance and, as compared with other models, the CB model relatively follows the true PM2.5 concentration. The scatter plot presented in Figure 6b shows that in contrast to VR, XGB and LGBM models, the CB model predicts PM2.5 concentration in the proximity of observed/true PM2.5 concentration at maximum instants of time. For the CB regression model, Figures 5 and 6 show fine agreement between the observed/true and predicted PM2.5 concentrations.

Impact of Input Variables (Air Pollutant Concentration and Meteorological Parameters) on the CB Model
To interpret the CB model, the impact of feature variables (air pollutants and meteorological parameters) on PM 2.5 concentrations were evaluated, and are presented in Figure 7. The influence of feature variables was measured using Shapley additive explanations (SHAP) values [42].
The SHAP framework defines the prediction in terms of a linear combination of binary variables that are used to describe whether an input feature is present in the model or not. The framework defines results in terms of Shapley values. Shapley (SHAP) values define the feature importance and impact of features on the prediction model by considering three required properties: (a) local accuracy, (b) missingness, and (c) consistency [42]. The y-axis points to the input variables indicating their impact on the model. The input variables on the y-axis are arranged according to their importance. The values on the x-axis indicate SHAP values, and points on the plot indicate Shapley values of input variables for the instances. The color gradient (blue to red) indicates variable importance from low to high. The higher the SHAP value, the higher is the variable's impact on the model. As shown in Figure 7, variables NH 3 , BP, CO and NO 2 significantly influenced the predicted PM 2.5 concentration with positive correlation, i.e., the predicted value increased with the high feature values of NH 3 , BP, CO and NO 2 , and conversely, predicted value decreased with the lower feature values of NH 3 , BP, CO and NO 2 .  [42]. The y-axis points to the input variables indicating their impact on the model. The input variables on the y-axis are arranged according to their importance. The values on the x-axis indicate SHAP values, and points on the plot indicate Shapley values of input variables for the instances. The color gradient (blue to red) indicates variable importance from low to high. The higher the SHAP value, the higher is the variable's impact on the model. As shown in Figure 7, variables NH3, BP, CO and NO2 significantly influenced the predicted PM2.5 concentration with positive correlation, i.e., the predicted value increased with the high feature values of NH3, BP, CO and NO2, and conversely, predicted value decreased with the lower feature values of NH3, BP, CO and NO2.
However, the variables SO2, NOx, and C6H6 also positively influenced the predicted results but had less impact on the prediction result. The variables WS, WD, SR, temperature, C7H8 and RH influenced the predicted PM2.5 concentration with negative correlation, i.e., higher values of WS, WD, SR, temperature, C7H8 and RH tended to decrease the predicted value, and vice versa.

Evaluation of High PM2.5 Concentration
During the period of observation, the city was exposed in 2019 to a higher value of PM2.5 concentration compared with 2018. Among the three dominant seasons of the city, the winter season observed a maximum number of days with higher PM2.5 concentration than NAAQS standards. The NAAQS standards classify a mean of 24 h PM2.5 concentration of 60 µ g/m 3 as higher concentration. However, the summer and monsoon seasons observed very few days of PM2.5 concentration greater than NAAQS standards. Table 9 presents the prediction results for high PM2.5 concentration using the CB regression model for the period of observation. The high concentration prediction performance was evaluated using HR, FAR, CSI, TSS and OR measurement indices. The results for high concentration were evaluated on the test dataset. The results showed that the model achieved a high Hit rate of 0.85 for measuring accurate predictions and achieved a low score of 0.02 for false alarms. However, the variables SO 2 , NO x , and C 6 H 6 also positively influenced the predicted results but had less impact on the prediction result. The variables WS, WD, SR, temperature, C 7 H 8 and RH influenced the predicted PM 2.5 concentration with negative correlation, i.e., higher values of WS, WD, SR, temperature, C 7 H 8 and RH tended to decrease the predicted value, and vice versa.

Evaluation of High PM 2.5 Concentration
During the period of observation, the city was exposed in 2019 to a higher value of PM 2.5 concentration compared with 2018. Among the three dominant seasons of the city, the winter season observed a maximum number of days with higher PM 2.5 concentration than NAAQS standards. The NAAQS standards classify a mean of 24 h PM 2.5 concentration of 60 µg/m 3 as higher concentration. However, the summer and monsoon seasons observed very few days of PM 2.5 concentration greater than NAAQS standards. Table 9 presents the prediction results for high PM 2.5 concentration using the CB regression model for the period of observation. The high concentration prediction performance was evaluated using HR, FAR, CSI, TSS and OR measurement indices. The results for high concentration were evaluated on the test dataset. The results showed that the model achieved a high Hit rate of 0.85 for measuring accurate predictions and achieved a low score of 0.02 for false alarms. The model also achieved high CSI and TSS scores, indicating the model's excellent performance and its ability to correctly classify between "Yes" and "No" cases. Moreover, the CSI represented the model's sensitivity to correct forecasts of high concentration, and the high value obtained by the CSI indicated that the high concentration cases of PM 2.5 were generally predicted correctly.
It was found that, for the period under observation, the winter season encountered 166 days with PM 2.5 concentrations greater than the standards set by NAAQS. Approximately 55-60% of days during the winter season witnessed higher PM 2.5 concentration than the prescribed NAAQS standards. In contrast, approximately 4-5% of days witnessed higher PM 2.5 concentration than the prescribed NAAQS standard in the summer and winter season. For approximately more than 90 days of the winter and monsoon seasons, the city was under the impact of higher C 6 H 6 concentration than the prescribed standards.

Conclusions
In the present study, fifteen machine learning models were presented for analysis and prediction of PM 2.5 concentration based on time series data. This study provided detailed insight into the air pollutants and meteorological parameters contributing to PM 2.5 . We targeted the statistical behavior of 24 h average air pollutant and PM 2.5 concentrations along with meteorological parameters observed at the eastern coastal city of Visakhapatnam, India. Using Pearson's correlation coefficient, the correlation of PM 2.5 with air pollutants and meteorological parameters was determined. Seasonal behavior of air pollutants, PM 2.5 concentration and metrological parameters were studied by extracting significant information from raw data collected from the Central Pollution Control Board and APPCB. It was deduced that the summer and monsoon seasons showed lower PM 2.5 and air pollutant concentrations, compared with the winter season. The results revealed that the CB machine learning model is an efficient predictive model. For comparative analysis with other prediction models, we used R 2 score, RMSE, MAE, and MedAE as MAPE performance parameters. The performance of the CB model was not only better than traditional models, such as the linear regression model and MLP, but also better than voting and other boosting models such as XGBoost and LightGBM. The prediction of PM 2.5 concentration is a challenging task, owing to changing metrological and pollutant concentration, yet an amount of credibility in prediction results can be attained using copious data. The present study was carried out for a small period and notable results were achieved, however, significant improvement in forecasting results can be expected by examining the information over a greater period and, in addition, considering nearby geographical locations.