Air Quality Prediction and Ranking Assessment Based on Bootstrap-XGBoost Algorithm and Ordinal Classification Models

: Along with the rapid development of industries and the acceleration of urbanisation, the problem of air pollution is becoming more serious. Exploring the relevant factors affecting air quality and accurately predicting the air quality index are significant in improving the overall environmental quality and realising green economic development. Machine learning algorithms and statistical models have been widely used in air quality prediction and ranking assessment. In this paper, based on daily air quality data for the city of Xi’an, China, from 1 October 2022 to 30 September 2023, we construct support vector regression (SVR), gradient boosting decision tree (GBDT), extreme gradient boosting (XGBoost), random forests (RF), neural network (NN) and long short-term memory (LSTM) models to analyse the influence of the air quality index for Xi’an and to conduct comparative tests. The predicted values and 95% prediction intervals of the AQI for the next 15 days for Xi’an, China, are given based on the Bootstrap-XGBoost algorithm. Further, the ordinal logit regression and ordinal probit regression models are constructed to evaluate and accurately predict the AQI ranks of the data from 1 October 2023 to 15 October 2023 for Xi’an. Finally, this paper proposes some suggestions and policy measures based on the findings of this paper.


Introduction
As industrialisation and urbanisation advance and urban energy consumption increases, the world faces a certain degree of regional and compounded air pollution problems.The deterioration of the ecological environment caused by atmospheric environmental problems has become increasingly evident and is closely related to human health, which has become a significant public health problem.Relevant studies [1][2][3] have shown that severe air pollution can cause a variety of respiratory diseases, such as chronic pharyngitis, chronic bronchitis, and bronchial asthma, and there is a significant correlation between exposure to atmospheric particulate matter and near-surface ozone and the number of morbidities and deaths due to chronic cardiovascular diseases.In addition, severe air pollution exacerbates haze conditions and reduces visibility in the near-air layer, leading to several problems, such as traffic congestion and flight delays [4].This issue is consistent with the Sustainable Development Goals (SDGs) of "good health and well-being" and "sustainable cities and communities", which emphasise the importance of addressing environmental health risks and promoting sustainable urban development.Therefore, it is of great significance to explore the influence of changes in air pollutant concentrations and various factors and to establish an urban air quality prediction model with high efficiency and accuracy to promote the prevention and control of air pollution in the country.
Currently, methods for analysing and predicting quality are divided into two main categories: statistical methods and machine learning algorithms.Statistical methods are mostly linear studies, which mainly use traditional modelling strategies to predict air quality [5].For example, Niu et al. [6] used the auto-regressive moving average model to study the composite air quality index for Chengdu.The results of the study showed that the air quality in Chengdu had no improving trend in the coming period.Jian et al. [7] predicted airborne particulate matter concentrations via the auto-regressive integrated moving average model.Abedi et al. [8] used the auto-regressive distributed lag (ARDL) model to study the relationship between air pollution and respiratory and cardiovascular diseases.The results showed that for every 10-unit increase in the AQI, the number of hospitalisations for cardiovascular diseases increased by 7.3%.Woldu [9] used the ARDL model to study the bidirectional causal relationship between urban globalisation and CO 2 emissions in Mozambique.With its rapid development, new-generation information technology such as big data analytics and machine learning can perform analysis and prediction by learning data patterns and optimising algorithms [10][11][12][13].Because of its high accuracy and reliability, many scholars use different machine learning algorithms to study and predict air quality.Biancofiore et al. [14] used a recursive neural network model that allows one to obtain real-time information on PM 10 and PM 2.5 in Adriatic coastal cities. Yang et al. [15] used a space-time support vector regression model to predict hourly PM 2.5 concentrations.Pawul [16], based on Polish National Environmental Monitoring System data and meteorological data, developed a neural network model for air pollutant prediction.Ma et al. [17] performed an impact analysis of air quality based on an extreme gradient boosting model.The results of the study showed that six factors, such as personal income and power plant density, have the greatest impact on air quality in the United States.Zhao et al. [18] integrated forward neural networks and recurrent neural networks to predict air quality hourly in northwestern China.Bekkar et al. [19] performed the hourby-hour forecasting of PM 2.5 concentration in Beijing based on a neural network with a long short-term memory network model.Huang et al. [20] used spatio-temporal information to forecast air quality through deep convolutional networks.Samad et al. [21] used five machine learning models, such as ridge regression, support vector regression, and random forest, to study pollutant concentrations at monitoring sites.Zhang et al. [22] used a spatial transform network model to forecast PM 2.5 concentrations in the next 6, 12, 24, and 48 h in Beijing and Taizhou.In addition, many other scholars have studied and forecasted the air quality ranks.Liu et al. [23] forecasted the air quality ranks for Bayannur based on an integrated model of stepwise regression, principal component analysis, and BP neural network (STEPDISC-PCA-BP).Ratković et al. [24] accurately predicted air quality levels in the next hour in a city in Montenegro by using a hybrid LSTM model.Zhao et al. [25] proposed a detailed examination model of air quality with a co-training semi-supervised learning approach.
This paper proposes to model and predict the daily air quality data from 1 October 2022 to 30 September 2023 in the city of Xi'an, China, by using the Bootstrap-XGBoost algorithm and ordinal classification models.This paper is organised as follows: Section 2 preprocesses the data and draws the stacked plot of air quality percentage to study the seasonality of air pollution in Xi'an, China, in recent years; Section 3 constructs the SVR model, the GBDT model, the XGBoost model, the RF model, the NN model, and the LSTM model to compare the predictions on the test set, and the Bootstrap-XGBoost algorithm is proposed based on the best-performing XGBoost model combined with the Bootstrap method, based on which the predicted values and prediction intervals of the AQI for the next 15 days in Xi'an, China, are given; Section 4 forecasts the AQI ranks for the next 15 days based on the ordinal logit and ordinal probit regression models; Section 5 proposes the targeted recommendations and improvement measures based on the results of the study.The research framework of this paper is shown in Figure 1.

Data Sources and Preprocessing
This section gives a clear explanation of the data sources used in this study and th descriptive analysis.

Data Sources
The data in this paper come from the data published on the official statistics web of China Air Quality Online Testing Platform (https://www.aqistudy.cn/historydata/cessed on 29 November 2023)), and daily air quality data for one year from 1 October 2 to 30 September 2023 were collected for Xi'an, China.
The data used in this paper include the air quality index (AQI) and the concentrati of six pollutants: sulphur dioxide (SO2), nitrogen dioxide (NO2), fine particulate ma (PM2.5),carbon monoxide (CO), particulate matter (PM10), and ozone (O3).The basic inf mation about these data and the related descriptions are shown in Table 1.In 1999, United States Environmental Protection Agency (USEPA) established the AQI as a qu titative way of interpreting air quality [26].The individual indexes of the six criteria p lutants are calculated by Equation (1), and the maximum value is determined as the A The higher the AQI, the more severe the air pollution and the greater the risk to hum health.
where  is the index for pollutant p,  is the truncated concentration of pollutan  is the concentration breakpoint that is greater than or equal to  ,  is the c centration breakpoint that is less than or equal to  ,  is the AQI value correspond to  , and  is the AQI value corresponding to  .

Data Sources and Preprocessing
This section gives a clear explanation of the data sources used in this study and their descriptive analysis.

Data Sources
The data in this paper come from the data published on the official statistics website of China Air Quality Online Testing Platform (https://www.aqistudy.cn/historydata/(accessed on 29 November 2023)), and daily air quality data for one year from 1 October 2022 to 30 September 2023 were collected for Xi'an, China.
The data used in this paper include the air quality index (AQI) and the concentrations of six pollutants: sulphur dioxide (SO 2 ), nitrogen dioxide (NO 2 ), fine particulate matter (PM 2.5 ), carbon monoxide (CO), particulate matter (PM 10 ), and ozone (O 3 ).The basic information about these data and the related descriptions are shown in Table 1.In 1999, the United States Environmental Protection Agency (USEPA) established the AQI as a quantitative way of interpreting air quality [26].The individual indexes of the six criteria pollutants are calculated by Equation (1), and the maximum value is determined as the AQI.The higher the AQI, the more severe the air pollution and the greater the risk to human health.
where I p is the index for pollutant p, C p is the truncated concentration of pollutant p, BP Hi is the concentration breakpoint that is greater than or equal to C p , BP Lo is the concentration breakpoint that is less than or equal to C p , I Hi is the AQI value corresponding to BP Hi , and I Lo is the AQI value corresponding to BP Lo . ( It is stipulated that December to February is winter, March to May is spring, June to August is summer, and September to November is fall.With the help of Tableau software 2024.2.0 (20242.24.0613.1930), the day-by-day air quality pollution levels in Xi'an, China, were counted, and the proportions of each of the six pollutant levels were calculated in each of the four seasons and analysed seasonally, as shown in Figure 2.

Data Preprocessing and Seasonal Air Pollution Percentage Analysis
Data preprocessing is the first step to establishing a statistical learning model, and this paper uses the interpolation method and imputed package in R to fill in the missing values; the interpolation method formula is ( It is stipulated that December to February is winter, March to May is spring, June to August is summer, and September to November is fall.With the help of Tableau software 2024.2.0 (20242.24.0613.1930), the day-by-day air quality pollution levels in Xi'an, China, were counted, and the proportions of each of the six pollutant levels were calculated in each of the four seasons and analysed seasonally, as shown in Figure 2. It is obvious from Figure 2 that the air quality in Xi'an, China, in the past year was mainly "Good", accounting for 45.21% of the total data.In summer, the degree of air pollution was the smallest, and the air quality was the best, mainly "Good" and "Excellent"; the air quality in spring and fall was "Excellent" and "Good", second only to summer, with an increase in the number of "Mild pollution" days and some days being also accompanied by "Moderate pollution" and average air quality; with the arrival of cold air, the number of days with air quality of "Mild pollution" and above in Xi'an, China, in winter accounted for 16.44%, which is more than twice the number of days with air quality of "Excellent" and "Good".Especially in winter, Xi'an is characterised by severe haze and poor air quality.This is due to the city's geographical location and climate conditions.During the winter months, Xi'an experiences temperature inversions, which trap pollutants close to the ground, leading to a buildup of particulate matter and other air pollutants.Additionally, the increased heating demand in winter leads to greater emissions from coal-fired power plants and residential heating sources, further exacerbating the air quality issues.As a result, the air situation in Xi'an is not optimistic during the autumn and winter seasons.It is obvious from Figure 2 that the air quality in Xi'an, China, in the past year was mainly "Good", accounting for 45.21% of the total data.In summer, the degree of air pollution was the smallest, and the air quality was the best, mainly "Good" and "Excellent"; the air quality in spring and fall was "Excellent" and "Good", second only to summer, with an increase in the number of "Mild pollution" days and some days being also accompanied by "Moderate pollution" and average air quality; with the arrival of cold air, the number of days with air quality of "Mild pollution" and above in Xi'an, China, in winter accounted for 16.44%, which is more than twice the number of days with air quality of "Excellent" and "Good".Especially in winter, Xi'an is characterised by severe haze and poor air quality.This is due to the city's geographical location and climate conditions.During the winter months, Xi'an experiences temperature inversions, which trap pollutants close to the ground, leading to a buildup of particulate matter and other air pollutants.Additionally, the increased heating demand in winter leads to greater emissions from coal-fired power plants and residential heating sources, further exacerbating the air quality issues.As a result, the air situation in Xi'an is not optimistic during the autumn and winter seasons.

Empirical Analysis of AQI Prediction
In this section, the selected dataset are empirically analysed by using SVR, GBDT, XGBoost, RF, NN, and LSTM models.In this case, 80% of the dataset is the training set, and 20% is the test set.

Analysis Based on SVR Model
SVR is a machine learning algorithm for classification and regression that utilises a kernel function to map the data into a high-dimensional space [27].It searches for the optimal hyperplane to separate different categories of data, thus realising classification or regression prediction.
Let a set of training data be (x i , y i ), where x i ∈ R d is the feature vector of the ith sample and y i ∈ R d is the true value of the ith sample.The objective function of the SVR model can be expressed as where w is the weight vector, b is the bias term, C is the regularisation parameter, and ∥w∥ 2 is the L 2 norm of the weight vector.

Analysis Based on GBDT Model
GBDT is an integrated machine learning algorithm based on the Boosting strategy proposed by Friedman (2001) [28].It is an iterative decision tree algorithm consisting of multiple decision trees.The results obtained from each decision tree are accumulated to become the final result, and it is also one of the methods with the best performance in statistical learning.The specific implementation process of the GBDT classifier is as follows.
1. Initialise the learner: For each tree m = 1, 2, • • • , M and each sample i = 1, 2, • • • , N, calculate the corresponding negative gradient, i.e., residuals where f (x i ) is the predicted value of the weak learner and y i is the true value of the weak learner.

Analysis Based on XGBoost Model
XGBoost is a supervised learning model based on combining classifiers with lower classification accuracy into a classifier model with higher accuracy, mainly to reduce the model error [29].It is a model that is based on the combination of multiple CART trees and has a strong generalisation ability.The algorithm uses the tree as a base classifier, expands the loss function to the second-order derivative by Taylor's formula, and adds a regular term to the objective loss function, thus avoiding an overly complex model.The objective function is where L(x) is the loss function, which measures how well the model fits on the training dataset, and E(x) is the canonical term, which measures the complexity of the model.If the model generates j trees and the base learner model is where f j is one of the base classifiers in F, J is the number of base classifiers, and F is the set of all base classifiers, then the objective function of the XGBoost model is

Analysis Based on RF Model
RF is an integrated learning method proposed by Breiman [30] which is commonly used for classification, regression, and other machine learning tasks.The principle is to construct a large number of decision trees during training, each of which is unrelated to the others; when a new sample enters the algorithm, each decision tree makes a judgment separately, and each identifies which category the sample should belong to.Further, according to the votes on the classification tree, the sample is classified into a category.The flow of the algorithm is shown in Figure 3. RF is a robust integration algorithm based on a bagging decision tree that realises RF's random split selection, effectively correcting the problem of decision tree fitting.
Atmosphere 2024, 15, x FOR PEER REVIEW 6 of 14 where  is one of the base classifiers in ,  is the number of base classifiers, and  is the set of all base classifiers, then the objective function of the XGBoost model is

Analysis Based on RF Model
RF is an integrated learning method proposed by Breiman [30] which is commonly used for classification, regression, and other machine learning tasks.The principle is to construct a large number of decision trees during training, each of which is unrelated to the others; when a new sample enters the algorithm, each decision tree makes a judgment separately, and each identifies which category the sample should belong to.Further, according to the votes on the classification tree, the sample is classified into a category.The flow of the algorithm is shown in Figure 3. RF is a robust integration algorithm based on a bagging decision tree that realises RF's random split selection, effectively correcting the problem of decision tree fitting.

Analysis Based on NN Model
NN is a computational model inspired by the human nervous system.It consists of many interconnected neurons that transmit and process information through weights.A neural network usually consists of multiple layers, including input, hidden, and output layers.It is often used to solve complex pattern recognition and machine learning problems.The input layer receives raw data, the hidden layer processes and extracts features from the input data, and the output layer generates the final output.The connection weight of each neuron in the network determines how much the input affects the output.In

Analysis Based on NN Model
NN is a computational model inspired by the human nervous system.It consists of many interconnected neurons that transmit and process information through weights.A neural network usually consists of multiple layers, including input, hidden, and output layers.It is often used to solve complex pattern recognition and machine learning problems.The input layer receives raw data, the hidden layer processes and extracts features from the input data, and the output layer generates the final output.The connection weight of each neuron in the network determines how much the input affects the output.In Figure 4, x i (i = 1, 2, 3) is the value of the input layer, and a denotes the activation value of the ith neuron (the output of that neuron) in the kth layer.
Atmosphere 2024, 15, x FOR PEER REVIEW 6 of 14 where  is one of the base classifiers in ,  is the number of base classifiers, and  is the set of all base classifiers, then the objective function of the XGBoost model is

Analysis Based on RF Model
RF is an integrated learning method proposed by Breiman [30] which is commonly used for classification, regression, and other machine learning tasks.The principle is to construct a large number of decision trees during training, each of which is unrelated to the others; when a new sample enters the algorithm, each decision tree makes a judgment separately, and each identifies which category the sample should belong to.Further, according to the votes on the classification tree, the sample is classified into a category.The flow of the algorithm is shown in Figure 3. RF is a robust integration algorithm based on a bagging decision tree that realises RF's random split selection, effectively correcting the problem of decision tree fitting.

Analysis Based on NN Model
NN is a computational model inspired by the human nervous system.It consists of many interconnected neurons that transmit and process information through weights.A neural network usually consists of multiple layers, including input, hidden, and output layers.It is often used to solve complex pattern recognition and machine learning problems.The input layer receives raw data, the hidden layer processes and extracts features from the input data, and the output layer generates the final output.The connection weight of each neuron in the network determines how much the input affects the output.In

Analysis Based on LSTM Model
Hochreiter and Schmidhuber proposed the LSTM model in the late 1990s [31]; it is a variant of the traditional RNN.Compared with the traditional neural network model, LSTM can deal with the long-term dependence problem of temporal data and, at the same time, avoid the gradient vanishing problem.It can effectively capture long-term dependencies in sequence data by introducing mechanisms such as memory units, input gates, forgetting gates, and output gates combined with the error back-propagation algorithm.The structural flowchart of the LSTM model is shown in Figure 5.

Analysis Based on LSTM Model
Hochreiter and Schmidhuber proposed the LSTM model in the late 1990s [31]; it variant of the traditional RNN.Compared with the traditional neural network mo LSTM can deal with the long-term dependence problem of temporal data and, at the s time, avoid the gradient vanishing problem.It can effectively capture long-term depe encies in sequence data by introducing mechanisms such as memory units, input g forgetting gates, and output gates combined with the error back-propagation algorit The structural flowchart of the LSTM model is shown in Figure 5.

Forecast Results and Comparative Analysis
By preprocessing the data, the AQI values were predicted by using the above mo in the test set, and the actual and predicted values are shown in Figure 6.It can be that the five models have similar prediction trends for the AQI and that the predi values are close to the true values, indicating that the models can capture the relations between the AQI and the concentrations of the six pollutants better and that the mo are well characterised.

Model Evaluation
In order to assess the effectiveness of different models in fitting the AQI, four ev ation indexes, namely, Mean Absolute Error (MAE), Root Mean Square Error (RM Mean Absolute Percentage Error (MAPE), and Goodness-of-Fit (R-squared) were sele to evaluate the results of the six model fits in this paper.

Forecast Results and Comparative Analysis
By preprocessing the data, the AQI values were predicted by using the above models in the test set, and the actual and predicted values are shown in Figure 6.It can be seen that the five models have similar prediction trends for the AQI and that the predicted values are close to the true values, indicating that the models can capture the relationships between the AQI and the concentrations of the six pollutants better and that the models are well characterised.

Analysis Based on LSTM Model
Hochreiter and Schmidhuber proposed the LSTM model in the late 1990s [31]; it is a variant of the traditional RNN.Compared with the traditional neural network model, LSTM can deal with the long-term dependence problem of temporal data and, at the same time, avoid the gradient vanishing problem.It can effectively capture long-term dependencies in sequence data by introducing mechanisms such as memory units, input gates, forgetting gates, and output gates combined with the error back-propagation algorithm.The structural flowchart of the LSTM model is shown in Figure 5.

Forecast Results and Comparative Analysis
By preprocessing the data, the AQI values were predicted by using the above models in the test set, and the actual and predicted values are shown in Figure 6.It can be seen that the five models have similar prediction trends for the AQI and that the predicted values are close to the true values, indicating that the models can capture the relationships between the AQI and the concentrations of the six pollutants better and that the models are well characterised.

Model Evaluation
In order to assess the effectiveness of different models in fitting the AQI, four evaluation indexes, namely, Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Mean Absolute Percentage Error (MAPE), and Goodness-of-Fit (R-squared) were selected to evaluate the results of the six model fits in this paper.

Model Evaluation
In order to assess the effectiveness of different models in fitting the AQI, four evaluation indexes, namely, Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Mean Absolute Percentage Error (MAPE), and Goodness-of-Fit (R-squared) were selected to evaluate the results of the six model fits in this paper.

MAE(y
where y i is the AQI observation of the ith sample, ŷi is the AQI prediction of the ith sample, and y is the sample mean.Smaller MAE, RMSE, and MAPE and larger R-squared mean better final prediction.The fitting results are shown in Table 2. From the table, one can see that the MAE and the MAPE values of fitting using the XGBoost model are the smallest and that the R-squared value is the largest.Therefore, the XGBoost model was used to predict the AQI for Xi'an, China, with the smallest error and the best fit, so the following is based on the model to predict the AQI for the next 15 days.

AQI Prediction Based on Bootstrap-XGBoost
The bootstrap method is a resampling method in statistics that is used to give the standard error of prediction and prediction interval in predictive analysis.In this paper, time-series data were considered, so the residual Bootstrap method was used, and the prediction in each cycle of the Bootstrap method adopted the XGBoost model.The specific algorithm flow is shown in Algorithm 1.

Algorithm 1:
The Bootstrap-XGBoost algorithm 1: Input : Dataset (x t , y t ); B : Bootstrap sample size 2: Output : The B times Boostrap prediction results 3 : Fitting a model to the preprocessed data : y t = f (x t ) + ε t 4: Prediction of preprocessed data : ŷt = f (x t ) 5: Calculation of prediction residuals : ε t = y t − ŷt 6: Residual Bootstrap step: 7: Setting the number of resampling times B 8: Based on the proposed Bootstrap-XGBoost algorithm, the standard deviation and 95% prediction intervals of the predicted AQI for the next 15 days in Xi'an, China, were obtained by setting B = 500 times, as shown in Table 3.   7 show that the actual values generally fall within the 95% prediction interval of the prediction.The accuracy of the prediction for more days in the future is high, which indicates that the prediction of the future short-term AQI using the Bootstrap-XGBoost method has a high degree of reliability and accuracy and that the prediction for long periods needs to take into account other influencing factors.
Based on the proposed Bootstrap-XGBoost algorithm, the standard deviation and 95% prediction intervals of the predicted AQI for the next 15 days in Xi'an, China, were obtained by setting  = 500 times, as shown in Table 3.
Table 3 and Figure 7 show that the actual values generally fall within the 95% prediction interval of the prediction.The accuracy of the prediction for more days in the future is high, which indicates that the prediction of the future short-term AQI using the Bootstrap-XGBoost method has a high degree of reliability and accuracy and that the prediction for long periods needs to take into account other influencing factors.

AQI Rank Assessment
The ordinal logit and probit models are generalised linear models used to establish the relationship between ordinal categorical variables and independent variables and are commonly used in biomedicine, socioeconomics, and machine learning.

Ordinal Logit Model and Ordinal Probit Model
In this section, the response variable AQI ranks are classified into six categories, namely, "Excellent", "Good", "Mild pollution", "Moderate pollution", "Severe pollution", and "Serious pollution", and based on the following two models, the AQI ranks for Xi'an, China, are evaluated and further predicted from 1 October 2023 to 15 October 2023, in terms of AQI ranks.

AQI Rank Assessment
The ordinal logit and probit models are generalised linear models used to establish the relationship between ordinal categorical variables and independent variables and are commonly used in biomedicine, socioeconomics, and machine learning.

Ordinal Logit Model and Ordinal Probit Model
In this section, the response variable AQI ranks are classified into six categories, namely, "Excellent", "Good", "Mild pollution", "Moderate pollution", "Severe pollution", and "Serious pollution", and based on the following two models, the AQI ranks for Xi'an, China, are evaluated and further predicted from 1 October 2023 to 15 October 2023, in terms of AQI ranks.
Assuming that the response y i ∈ {1, 2, • • • , C} are the ordinal variables and the predictor variables are X = x 1 , x 2 , • • • , x k ) T , the expression of the ordinal logit regression model is The ordinal probit regression model expression is where

Model Estimation Results
The results of parameter estimation using the above two models are shown in Table 4. Based on a p-value of less than 0.05, it can be concluded that PM 2.5 , PM 10 , SO 2 , and O 3 concentrations have the most significant effect on the AQI rank, which means that the higher the concentration of this pollutant, the more serious the air pollution situation is, and the higher the AQI rank is (Table 4).According to Equation (10) and Table 4, the probability of each category of the AQI rank for Xi'an, China, under the ordinal logit regression model can be obtained by

AQI Ranking Forecast
The accuracy of predicting AQI rankings on the test set based on the ordinal logit regression and ordinal probit regression models developed in Sections 5.1 and 5.2 is shown in Table 5.As can be seen from Table 5, the prediction accuracy of both the ordinal logit regression and ordinal probit regression models is above 86%, which is high, and the AQI ranks for Xi'an, China, for the period from 1 October to 15 October 2023, were predicted using these two models.Table 6 shows that the predictions for the next 15 days based on the two models are entirely consistent with the actual results, and the prediction probabilities are mostly high, indicating that the two models have a good prediction effect on the AQI ranks.

Conclusions and Suggestions
In this paper, the prediction of AQI by PM 2.5 , PM 10 , SO 2 , NO 2 , O 3 , and CO is investigated by building SVR, GBDT, XGBoost, RF, NN, and LSTM models and comparing the prediction results on the test set.It is found that the XGBoost model has the best prediction effect.Thereby, we propose a robust Bootstrap-XGBoost algorithm to give the AQI prediction values and 95% prediction intervals for the next 15 days in Xi'an, China, and the results show that most of the predicted values coincide with the true values and that the prediction intervals are covered by more than 70%.In addition, in this paper, we fit the air quality classes based on ordinal logit and ordinal probit regression models and forecast the AQI classes from 1 October to 15 October 2023, in Xi'an, China.The results show that the prediction accuracy of both ordinal models is 100%.
Based on the above conclusions, this paper proposes some applicable recommendations and targeted initiatives to improve air quality in Xi'an.
(1) As one of the megacities in China, preventing and controlling particulate pollution in Xi'an has been crucial in recent years.Based on continuing to maintain emission reduction, the prevention and control of gaseous pollutants (especially O 3 and SO 2 ) should be strengthened.The main goals for the present and future are the synergistic prevention and control of atmospheric particulate matter and O 3 and introducing related policy and standards on VOC management-for example, centralised heating, optimised motor vehicle travel restrictions, and increased vegetation cover.In addition, it is also necessary to raise public awareness of air pollution prevention and control.
(2) The air in Xi'an is dry, and in spring and winter, the soil has low water content, which can easily bring dust into the air in windy weather.Relevant governments should strengthen the monitoring of dust from construction sites and material stacking sites in urban areas and urge construction units to cover the bare-ground surface in a timely manner.At the same time, regular urban sanitation sweeps should be carried out to pick up garbage, wipe down fences, clean up roads, and spray on time to ensure the city is clean and tidy.
(3) As one of the cities with a modern industrial system developed, it is essential to maintain the synergistic development of the green economy and create a better living environment by improving the resource utilisation rate in Xi'an, optimising the energy structure and industrial layout, continuing to carry out the comprehensive treatment of multi-pollution sources and pollutants from coal combustion, industry, transportation, and biomass combustion, as well as in-depth emission reduction, and increasing the use of renewable resources and consumable alternative resources.

Atmosphere 2024 ,Figure 1 .
Figure 1.The research framework of this paper.

Figure 1 .
Figure 1.The research framework of this paper.

Figure 2 .
Figure 2. Stacked graph of air quality shares by season.

Figure 2 .
Figure 2. Stacked graph of air quality shares by season.

Figure 3 .
Figure 3.The flow chart of the RF algorithm.

Figure 4 .
Figure 4.The illustration of NN.

Figure 3 .
Figure 3.The flow chart of the RF algorithm.

Figure 3 .
Figure 3.The flow chart of the RF algorithm.

Figure 4 .
Figure 4.The illustration of NN.Figure 4. The illustration of NN.

Figure 4 .
Figure 4.The illustration of NN.Figure 4. The illustration of NN.

Figure 6 .
Figure 6.The plot of predicted vs. actual AQI values on the test set.

Figure 6 .
Figure 6.The plot of predicted vs. actual AQI values on the test set.

Figure 6 .
Figure 6.The plot of predicted vs. actual AQI values on the test set.
the k pollutant concentration indicators affecting the AQI, β is the regression coefficient, α 1 ≤ • • • ≤ α j is the model intercept, and Φ(•) is the cumulative distribution function of the standard normal distribution.

Table 1 .
Basic characteristics of the data.

Table 1 .
Basic characteristics of the data.

Table 2 .
Comparison of model fitting results.
by resampling ε t t is obtained by fitting the new sample via the XGBoost algorithm: x t , y (b) t → ŷ(b) t = fb (x t )} 13: Based on a series of ŷ(b) t , the prediction standard deviation and 95% prediction interval are calculated

Table 3 and
Figure

Table 5 .
Prediction accuracy on test set.

Table 6 .
Predicted results of AQI ranks.