Air Quality—Meteorology Correlation Modeling Using Random Forest and Neural Network

: Under the global warming trend, the diffusion of air pollutants has intensiﬁed, causing extremely serious environmental problems. In order to improve the air quality–meteorology correlation model’s prediction accuracy, this work focuses on the management strategy of the environmental ecosystem under the Artiﬁcial Intelligence (AI) algorithm and explores the correlation between air quality and meteorology. Xi’an city is selected as an example. Then, the theoretical knowledge is explained for Random Forest (RF), Backpropagation Neural Network (BPNN), and Genetic Algorithm (GA) in AI. Finally, GA is used to optimize and predict the weights and thresholds of the BPNN. Further, a fusion model of RF + BP + GA is proposed to predict the air quality and meteorology correlation. The proposed air quality–meteorology correlation model is applied to forest ecosystem management. Experimental analysis reveals that average temperature positively correlates with Air Quality Index ( AQI ), while relative humidity and wind speed negatively correlate with AQI . More-over, the proposed RF + BP + GA model’s prediction error for AQI is not more than 0.32, showing an excellently ﬁtting effect with the actual value. The air-quality prediction effect of the meteorological correlation model using RF is slightly lower than the real measured value. The prediction effect of the BP–GA model is slightly higher than the real measured value. The prediction effect of the air quality–meteorology correlation model combining RF and BP–GA is the closest to the real measured value. It shows that the air quality–meteorology correlation model using the fusion model of RF and BP–GA can predict AQI with the utmost accuracy. This work provides a research reference regarding the AQI value of the correlation model of air quality and meteorology and provides data support for the analysis of air quality problems.


Introduction
In recent years, China's economy has developed rapidly and made remarkable achievements.Subsequent environmental problems, such as Air Pollution (AP), have also attracted public attention [1].The research demonstrates that China's spatial distribution, seasonal variation, and interannual air quality vary greatly.From the perspective of spatial distribution, the national air quality shows an obvious spatial agglomeration law and differentiation.It manifests in the spatial pattern of heavier pollution in the north and east with better air quality in the south and west.For example, the air pollution in Beijing, Tianjin, Hebei, and surrounding areas is relatively high.The southern coastal areas, such as the Pearl River Delta, and western regions, like the Yunnan Guizhou Plateau and the Qinghai Tibet Plateau (QTP), mostly enjoy excellent weather throughout the year.From the viewpoint of seasonal changes, the air quality is generally good in summer, followed by spring and autumn, and moting effect of temperature on forests in the Greater Khingan Mountains will disappear, and the total productivity will decrease as a result of drought and fire.Seasonal drought across regions increases forest mortality risk and limits forest productivity growth.Climate change has also given rise to extreme natural events, such as fires, pests and diseases, hurricanes, and an increased risk of forest degradation and mortality.Climate change is a major challenge worldwide.Thus, curbing climate change requires the joint efforts of all humanity.Forests are powerful carbon sinks.Vigorously carrying out afforestation can effectively mitigate climate change and enable people to participate in combating climate change.
Random Forest (RF) is mainly used for regression and classification.RF algorithms have been applied in various fields by many scholars in and out of China.In the field of image science, image processing research for a large number of scenes includes image recognition and classification, target detection, and other subdivision directions [12].Of these, image processing segments large images into small images.Through pixelization, a single image is divided into multiple small-pixel images, thus developing into a classification problem.In terms of genetics, for massive gene data, high dimensionality leads to a low recognition ability of disease genes using traditional data analysis methods [13].Some scholars use RF algorithms to study gene recognition and protein interaction.In the financial field, the RF algorithm plays an important role in image recognition, anomaly detection, and text classification [14].
AP is closely related to the type, distribution, meteorological conditions, and topography of pollution sources.However, due to a specific area's landform and pollution source conditions being relatively stable within a certain time range, geographical environment, economic environment, and meteorological conditions are the key to determining the concentration distribution of urban air pollutants.Many scholars have studied the relationship between air pollutants and meteorological conditions.Lolli et al. [15] analyzed the relationship between main pollutants and meteorological conditions and found that the northwest wind was more conducive to the diffusion of pollutants in cities, and precipitation, relative humidity, and wind speed have a significant impact on air quality.Ceylan et al. [16] used the Pearson Correlation Coefficient (PCC) and Spearman Correlation Coefficient (SCC) to analyze the correlation between environmental factors, such as temperature, humidity, and air quality.The research results showed that environmental factors impacted air quality.Zhou et al. [17] studied the impact of waste and industrial pollutants on developing nations' air quality and revealed that industrial waste was a primary cause of air pollution.Gan et al. [18] used correlation analysis to analyze the correlation between automobile waste emissions, industrial construction, industrial development, and urban green land occupation.They found that industrial waste contributed most significantly to air pollution.In the research on Machine Learning (ML) and air quality, Shahriar et al. [19] predicted PM 2.5 in the air through Decision Tree (DT) and Support Vector Machine (SVM) model.The prediction results were affected by the concentration of PM 2.5 and meteorological conditions such as precipitation and wind speed.The research showed that DT and SVM models could forecast the concentration of PM 2.5 in the air.Gocheva-Ilieva et al. [20] predicted the air quality through Random Forest (RF), SVM, and Classification and Expression Tree (CART) model.They compared and analyzed the prediction performance of three ML algorithms and discovered that the RF algorithm was the most accurate in predicting meteorological factors and air pollutants.Menéndez García et al. [21] designed an RF model to forecast the air quality of Swiss meteorological factors and air pollutants.They proved that RF had excellent accuracy in air quality prediction.Liu et al. [22] used a Backpropagation Neural Network (BPNN) learning algorithm to predict air quality in Greece.The research showed that the BP-predicted value was very close to the actual air-quality detection value.Li et al. [23] devised BPNN and Long Short-Term Memory (LSTM) models to forecast air quality in Poland and India.The results indicated that BPNN and LSTM models could accurately forecast regional air quality.
Ogeneovo et al. [24] applied an ID3 (Iterative Dichotomizer 3) DT algorithm to environmental monitoring data and achieved good results.Althuwaynee et al. [25], using the DT algorithm, selected two combined attributes in the test attribute similarity as the information-gain correction factor for calculating air quality to evaluate air quality.Singh et al. [26] used the Support Vector Machine (SVM) to classify and predict major urban environmental data in China.Eldakhly et al. [27] used grey theory and SVM to predict PM 2.5 concentration.Jiang et al. [28] carried out parameter optimization and model improvement on the SVM algorithm and improved the air quality early warning model based on a combination algorithm and Particle Swarm Optimization (PSO) to make the improved algorithm evaluation results more accurate.Bai et al. [29] applied the rough set method to air quality evaluation and extracted its rules.The research results showed that rough set theory was an effective tool for knowledge reasoning and expert system establishment.
According to the above scholars' findings, there are correlations between air quality and meteorological conditions, air pollutants, and industrial development.The impact of industrial waste emissions on air quality or environmental factors on air quality all involve studying the change process of air quality.At the same time, RF and neural network models have successfully lent to air quality predictions [30].The results provide methodological guidance for exploring the air quality-meteorology correlation model.However, the prediction accuracy of the traditional single-term algorithm for nonlinear air quality data is not high, and the model's generalization ability is weak.For high-dimensional data, it is easy to cause modeling failure, so the air quality-meteorology correlation model cannot be fully analyzed, resulting in low model performance.The research results show that both RF and NN models successfully predicted the air quality model.Hence, the ML algorithm will be integrated to establish the air quality-meteorology correlation model.The ML algorithm can process high-dimensional data without feature screening and with high prediction accuracy, good generalization ability, and strong anti-interference ability.When missing values exist in the data, the prediction accuracy can still be maintained.
A high-accuracy air quality-meteorology correlation model is obtained while providing a survival guarantee for forest ecosystem species and implementation strategies for forest ecosystem management.Firstly, this work discusses the related theory of RF and BPNN models and uses a Genetic Algorithm (GA) to optimize and forecast the weight and threshold of BPNN.Therefore, the effects of climate change on the forest ecosystem are illustrated.Finally, the impact of meteorological conditions, such as temperature, humidity, and wind level, on air quality is analyzed.Some suggestions of sustainable development management strategies for forest ecosystems under climate change are also put forward.The AQI results of the RF and Backpropagation-Genetic Algorithm (BP-GA) models are analyzed.The RF algorithm, BPNN model, and GA are combined, and the GA is used to optimize the prediction of the network weights and thresholds in the BPNN, reducing the dimension of the input factors, reducing the interference of irrelevant dimensions, and realizing the optimization of the urban air quality-meteorology correlation model.This work's proposed model is different from the previous fusion model for air quality prediction.On the one hand, it improves the prediction accuracy and provides research references for AQI prediction using the air quality-meteorology correlation model.On the other hand, it combines meteorology and air quality factors to carry out fusion model prediction, which provides data support for the analysis of air quality problems.This innovation puts forward the concept of the RF + BP + GA fusion model and establishes a feature selection method suitable for evaluating the air quality-meteorology correlation model.This research surpasses air quality prediction that uses a traditional single algorithm.The combined prediction model of RF, neural network, and GA is used to predict air quality with higher prediction accuracy.It can not only predict the AQI but also provide an implementation strategy basis for forest ecosystem management.
This work is divided into five parts.Section 1, the introduction, mainly describes the research background; the research on the relationship between the air quality model, meteorological conditions, and machine learning conducted by domestic and foreign scholars; and explains the necessity of this research.Finally, it describes the research method's process, significance, innovation, and framework.Section 2, regarding the research methods, mainly introduces the theory of the RF algorithm, BP algorithm, and GA.An air quality and meteorology correlation fusion model based on RF + BP + GA is proposed to explain the role of forests in climate regulation and the source of the experimental data of this work.Section 3 analyzes the research results from the perspective of the changing trend of AQI and meteorological conditions and the interaction between the climate conditions correlation model and the forest ecosystem.Section 4 is the conclusion, including the research results, research significance, and future prospects.

Experimental Data
Pollutant emissions will not change greatly in the short term for a particular area.However, the ambient air quality will change considerably.The monitoring results of pollutant concentrations from the same emission source at the same place are not necessarily the same.Sometimes, a very high concentration can be measured, and sometimes, a shallow concentration can be measured, with significant differences.The main reason for this change is a change in weather conditions: the ability of atmospheric transportation, dilution, transformation, and removal of pollutants has changed [31].The migration and diffusion laws of pollutants vary according to different meteorological conditions.This section collects the air quality of Xi'an in Shaanxi Province and uses the real-time air quality detection data from Xi'an from 24 to 30 June 2022 as research data.The data from 18 June 2022 to 24 June 2022, are used to predict the AQI.The data comes from the Air Quality Implementation and Release System in Shaanxi Province.
AQIs can describe the air quality within the new ambient air quality standard.Common AQIs include the Air Pollution Index (API), Oak Ridge Air Quality Index (ORAQI), Extreme Value Index (EVI), and Pollutant Standard Index (PSI) [32,33].Table 1 exhibits the applicable range of different evaluation indices.According to the evaluation scope and applicable range of different evaluation indices in Table 1, AQI is selected to evaluate the air quality-meteorology correlation model in Xi'an.AQI is dimensionless and quantitatively describes air quality.Air-quality evaluation mainly involves six pollutants: fine particles, inhalable particles, sulfur dioxide, nitrogen dioxide, ozone, and carbon monoxide [34].Equation (1) gives the specific calculation: In Equation (1), I AQI p refers to the air quality subindex of pollutant project P. C p is the mass concentration of pollutant item P. BP Hi represents the air quality subindex of the corresponding area and the concentration index of the corresponding pollutant project.BP Lo indicates the air quality subindex and the corresponding pollutant project concentration index of the corresponding region.I AQI Hi means the air quality subindex of the corresponding region and the corresponding pollutant project concentration index.I AQI Lo stands for the air quality subindex of the corresponding region and the corresponding pollutant project concentration index.Secondly, the maximum value of I AQl for various pollutants is selected and determined as AQI.When AQI > 50, the pollutant with the largest I AQl is determined as the primary pollutant [35].Figure 1 illustrates the changes in the AQI index value in Xi'an from 24 to 30 June 2022.
According to the evaluation scope and applicable range of different evaluation indices in Table 1, AQI is selected to evaluate the air quality-meteorology correlation model in Xi'an.AQI is dimensionless and quantitatively describes air quality.Air-quality evaluation mainly involves six pollutants: fine particles, inhalable particles, sulfur dioxide, nitrogen dioxide, ozone, and carbon monoxide [34].Equation (1) gives the specific calculation: In Equation ( 1),   refers to the air quality subindex of pollutant project .  is the mass concentration of pollutant item .  represents the air quality subindex of the corresponding area and the concentration index of the corresponding pollutant project.  indicates the air quality subindex and the corresponding pollutant project concentration index of the corresponding region.  means the air quality subindex of the corresponding region and the corresponding pollutant project concentration index.  stands for the air quality subindex of the corresponding region and the corresponding pollutant project concentration index.Secondly, the maximum value of  for various pollutants is selected and determined as AQI.When AQI > 50, the pollutant with the largest  is determined as the primary pollutant [35].Figure 1 illustrates the changes in the AQI index value in Xi'an from 24 to 30 June 2022.In Figure 1, Xi'an had the lowest AQI and the best air quality on 26 June 2022.Except for 26 June 2022, Xi'an had certain air quality issues that week.The AQI on 24 and 30 June was the highest, reaching 166 and 161, respectively.Therefore, the air pollution in Xi'an is still serious.Most of the time, the air is polluted.The analysis indicates that such air quality is related to the continuous high temperature in Xi'an since June.The continuous high temperature changes the proportion of fine particles, inhalable particles, sulfur dioxide, nitrogen dioxide, ozone, and carbon monoxide in the air.

Random Forest (RF) Algorithm Theory
An RF is a forest constructed in a random way and consists of many DTs.The DTs in an RF are independent of each other.After the forest training, when there is new data input, all the DTs in the RF are calculated independently.For the classification problem, In Figure 1, Xi'an had the lowest AQI and the best air quality on 26 June 2022.Except for 26 June 2022, Xi'an had certain air quality issues that week.The AQI on 24 and 30 June was the highest, reaching 166 and 161, respectively.Therefore, the air pollution in Xi'an is still serious.Most of the time, the air is polluted.The analysis indicates that such air quality is related to the continuous high temperature in Xi'an since June.The continuous high temperature changes the proportion of fine particles, inhalable particles, sulfur dioxide, nitrogen dioxide, ozone, and carbon monoxide in the air.

Random Forest (RF) Algorithm Theory
An RF is a forest constructed in a random way and consists of many DTs.The DTs in an RF are independent of each other.After the forest training, when there is new data input, all the DTs in the RF are calculated independently.For the classification problem, the prediction result is determined by calculating the votes of the DT; For he regression problems, the evaluation average of all DT prediction results is usually calculated as the final RF prediction results.This method is advanced because of the random selection of features of the RF on the basis of random sampling of Bagging samples.The basic idea is within the scope of Bagging [36].In Figure 2, the RF builds forest randomly.Multiple DTs exist independently in the forest.Each of these DTs judges every new input of the RF to calculate which class is selected the most.Then, the class the sample belongs to is forecasted.Importantly, sampling and total splitting are the two crucial processes in building a DT.First, the RF uses two instances of random sampling to sample the rows and columns of input data.The puttingback method is adopted for row sampling.Thus, duplicate samples may be in the sample set.It is assumed that there are  input samples.Then,  samples are sampled.In this way, each DT does not input all samples during training, thereby avoiding overfitting.
Second, column sampling is performed, and n n N () features are selected from  fea- tures.The DT is established by completely splitting the sampling data until any leaf node cannot be further split or all samples in the leaf node point to the same category.The first two random sampling processes ensure that randomness and fitting do not occur even without pruning when the number of layers is small [37,38].In DT-based classification, information gain can eliminate random uncertainty.Equation (2) expresses the information quantity: In Equation ( 2),  represents the event set.x i n ( 1,2, , ) i is the event category of . denotes the amount of information.(  ) indicates the probability of occurrence of   .(  ) and   are proportional.The expected value of the variable is calculated with entropy.Equations ( 3) and ( 4) calculate the information entropy: In Figure 2, the RF builds forest randomly.Multiple DTs exist independently in the forest.Each of these DTs judges every new input of the RF to calculate which class is selected the most.Then, the class the sample belongs to is forecasted.Importantly, sampling and total splitting are the two crucial processes in building a DT.First, the RF uses two instances of random sampling to sample the rows and columns of input data.The putting-back method is adopted for row sampling.Thus, duplicate samples may be in the sample set.It is assumed that there are m input samples.Then, m samples are sampled.In this way, each DT does not input all samples during training, thereby avoiding overfitting.Second, column sampling is performed, and n(n ≤ N) features are selected from N features.The DT is established by completely splitting the sampling data until any leaf node cannot be further split or all samples in the leaf node point to the same category.The first two random sampling processes ensure that randomness and fitting do not occur even without pruning when the number of layers is small [37,38].In DT-based classification, information gain can eliminate random uncertainty.Equation (2) expresses the information quantity: In Equation ( 2), X represents the event set.x i (i = 1, 2, • • • , n) is the event category of X.I denotes the amount of information.p(x i ) indicates the probability of occurrence of x i .p(x i ) and x i are proportional.The expected value of the variable is calculated with entropy.Equations ( 3) and ( 4) calculate the information entropy: In Equations ( 3) and ( 4), H(Y|X = x i ) represents the conditional entropy of random variable X under condition Y. H(X) is the information entropy of the random variable X.This work chooses the C4.5 to calculate the DT-based classification or regression problems and uses Information Gain Ratio (IGR) to select attributes.The specific calculation reads: Here, Gain(Y, X) is the gain measure.SplitIn f ormation(Y, X) represents the split information measure.Y denotes the sample subset formed by the c values of Y 1 ∼ Y c .X means the sample attribute.The classification results of the established RF are calculated with Equation ( 8): In Equation ( 8), H(x) represents the final model of the RF.W is the DT classification model, which can be C4.5 or CART algorithms.h i (x) denotes the classification model of each DT, and Y stands for the classification result of h i (x).The RF can process very high-dimensional data (with many features).After the training, the RF can output the importance of features.Figure 3 displays the importance of sample features in RF training.
Figure 3 has four sample features: A, B, C, and D. Suppose feature B is replaced by noise B1, and modeling is carried out accordingly.When the difference between the classification error rates Error 1 and Error 2 is small, the importance of feature B is low.When Error 2 is much larger than Error 1, it means that the feature B has a greater impact on the classification results.Table 2 lists the algorithm flow of the RF:

Start:
Test sample X text .

Sampling:
The original data is sampled with the Bagging algorithm.

Data set:
Select training subset S j as the dataset of the jth DT.Data set.

Calculate all node attributes of information gain:
Select all m attributes of the node.
Calculate the information gain index of m attributes.

DT splitting:
Select the attribute with the largest information gain as the classification node.
Obtain N DTs. Output: Output classification results. End:

Start:
Test sample   .

Sampling:
The original data is sampled with the Bagging algorithm.

Data set:
Select training subset   as the dataset of the th DT.Data set.

Calculate all node attributes of information gain:
Select all  attributes of the node.
Calculate the information gain index of  attributes.

DT splitting:
Select the attribute with the largest information gain as the classification node.

Theoretical Knowledge of the BP-GA NN
A BPNN is mainly composed of forward propagation and backpropagation.The network with n input nodes and m output nodes is regarded as a mapping of an n-dimensional Euclidean space.Through the principle of least squares and gradient search technology, the weights and thresholds are constantly learned and adjusted.If the output of the output layer is significantly different from the expected value, the error signal is sent back to each network layer unit of the network, and the mean square error between the actual output and the expected output of the network is minimized by modifying the weight and threshold of each layer neuron.The BPNN is a Multilayer Feedforward Neural Network (MLFNN) with some unique characteristics: the neurons of each layer are completely interconnected with the neurons of the next layer.No same-layer neuron connections exist.No cross-layer neuron connections exist [39].Figure 4 is given; that is, the input sample is described with d attributes to output the l-dimensional real-value vector.For the convenience of discussion, Figure 4 signifies a multilayer feedforward network structure, which has d input neurons, l output neurons, and q hidden-layer neurons.The threshold of the jth neuron in the output layer is θ j .γ h indicates the threshold of the h-th neuron in the hidden layer.The connection weight between the ith neuron in the input layer and the hth neuron in the hidden layer is w ih .The connection weight between the hth neuron in the hidden layer and the jth neuron in the output layer is w hj [40].The output of the h-th neuron in the hidden layer is marked as y h .The input received by the y h -th neuron of the hidden layer and the input received by the j-th neuron of the output layer is expressed with Equations ( 9) and ( 10): w hj y h (10)

Theoretical Knowledge of the BP-GA NN
A BPNN is mainly composed of forward propagation and backpropagation.The network with n input nodes and m output nodes is regarded as a mapping of an n-dimensional Euclidean space.Through the principle of least squares and gradient search technology, the weights and thresholds are constantly learned and adjusted.If the output of the output layer is significantly different from the expected value, the error signal is sent back to each network layer unit of the network, and the mean square error between the actual output and the expected output of the network is minimized by modifying the weight and threshold of each layer neuron.The BPNN is a Multilayer Feedforward Neural Network (MLFNN) with some unique characteristics: the neurons of each layer are completely interconnected with the neurons of the next layer.No same-layer neuron connections exist.No cross-layer neuron connections exist [39].Figure 4 presents the connection structure between BPNN neurons.

Input layer
Hidden layer Output layer ; that is, the input sample is described with  attributes to output the -dimensional real-value vector.For the convenience of discussion, Figure 4 signifies a multilayer feedforward network structure, which has d input neurons, l output neurons, and  hidden-layer neurons.The threshold of the th neuron in the output layer is   . ℎ indicates the threshold of the ℎ-th neuron in the hidden layer.The connection weight between the th neuron in the input layer and the ℎth neuron in the hidden layer is  ℎ .The connection weight between the ℎth neuron in the hidden layer and the th neuron in the output layer is  ℎ [40].The output of the ℎ-th neuron in the hidden layer is marked as  ℎ .The input received by the  ℎ -th neuron Next, to make BPNN prediction more accurate, GA is used to optimize the BPNN.GA has good global searchability and can quickly search all solutions in the solution space without rapidly declining to locally optimal solutions.GA is a computational model in Darwin's biological evolution theory that simulates the biological evolution process of natural selection and the genetic mechanism.It is a method to find the optimal solution by simulating the natural evolution process.Genetic manipulation is the practice of simulating biological genetics.In a GA, after the initial population is formed with coding, the task of genetic operation is to impose certain operations on individuals in the population according to their environmental fitness (fitness evaluation).Thereby, it realizes the evolution process of survival of the fittest.From the perspective of optimization search, the genetic operation can optimize the solution to the problem generation after generation and approach the optimal solution.Figure 5 outlines the initialization operation, exchange operation, and mutation operation processes of GA.
In Figure 5, the initialization operation often codes the population chromosomes.The selection operation aims to make the solution group survive and evolve and to improve the group's convergence speed and search efficiency.The exchange operation can generate new individuals, expand the solution search space, and improve the global search ability of the algorithm.Mutation operation is very important for optimization and evolution.The local optimal solution trap can be avoided with mutation.The mutation is carried out with bits: the content of a bit is mutated.The mutation operation can be performed after the exchange operation, one of a pair of individuals can be randomly selected, and then the mutation can be carried out according to the mutation probability.All individuals in the population are network weights and thresholds in the BPNN [41,42].After the network structure is determined, a rough NN structure can be established to forecast air quality.Therefore, the initial weight and threshold of the BPNN can be obtained according to the individuals in the population.The training parameters are determined to train the BPNN to make forecasts [43].The GA can be expressed with Equation ( 11): SGA = (C, E, P 0 , M, φ, Γ, ψ, T) (11) Next, to make BPNN prediction more accurate, GA is used to optimize the BPNN.GA has good global searchability and can quickly search all solutions in the solution space without rapidly declining to locally optimal solutions.GA is a computational model in Darwin's biological evolution theory that simulates the biological evolution process of natural selection and the genetic mechanism.It is a method to find the optimal solution by simulating the natural evolution process.Genetic manipulation is the practice of simulating biological genetics.In a GA, after the initial population is formed with coding, the task of genetic operation is to impose certain operations on individuals in the population according to their environmental fitness (fitness evaluation).Thereby, it realizes the evolution process of survival of the fittest.From the perspective of optimization search, the genetic operation can optimize the solution to the problem generation after generation and approach the optimal solution.Figure 5 outlines the initialization operation, exchange operation, and mutation operation processes of GA.

Calculate fitness
The first generation of the population G=0 In Equation (11), C represents the individual coding plan in the population.E is the individual fitness evaluation function.P 0 denotes the initial population.M indicates the population size.φ stands for the selection operator.Γ, ψ, and T mean a crossover operator, a mutation operator, and the GA's termination condition, respectively.SGA stands for the Simple Genetic Algorithm.The fitness of the solution is evaluated according to the individual fitness value, and a new species population is generated.The individual fitness value of the population is calculated with Equation ( 12):

Generate the next generation
In Equation (12), n represents the number of network output nodes.y i is the actual measured value by the NN node i. o i and k denote the actual output of node i and the coefficient.Here, k = 0.2, and n = 5.Equations ( 13) and ( 14) calculate the individual selection operation: According to Equations ( 13) and ( 14), p i is the probability that each individual in the population is selected.i denotes the individual in the population.F i represents the individual fitness value.M means the number of individuals in the population.Here, M = 30.Equations ( 15) and ( 16) express individual crossover operations: In Equations ( 15) and ( 16), a kj is the cross-calculation of chromosome a kj at position j. a lj denotes the cross-calculation of the first chromosome at position j.b represents a random number ∈ [0, 1].The individual mutation operation is demonstrated in Equations ( 17) and ( 18): 17) In Equations ( 17) and ( 18), a ij , a min , and a max are the gene, the minimum a ij , and maximum a ij , respectively.r represents a random number ∈[0, 1].r 2 , g, and G max mean the random number, iteration number, and the maximum number of evolutions, respectively.To sum up, the flow of the GA-optimized BPNN is shown in Figure 6.
Sustainability 2023, 15, x FOR PEER REVIEW 14 of 24 In Equations ( 17) and ( 18),   ,   , and   are the gene, the minimum   , and maximum   , respectively. represents a random number ∈ [0,1]. 2 ,  , and   mean the random number, iteration number, and the maximum number of evolutions, respectively.To sum up, the flow of the GA-optimized BPNN is shown in Figure 6.

Construction of the Air Quality-Meteorology Correlation Fusion Model Based on RF + BP + GA
Through the above explanation of the basic principles of RF and the improvement of BP through the GA, the two are combined to implement an air quality fusion model.The

Construction of the Air Quality-Meteorology Correlation Fusion Model Based on RF + BP + GA
Through the above explanation of the basic principles of RF and the improvement of BP through the GA, the two are combined to implement an air quality fusion model.The air quality-meteorology correlation fusion model based on RF + BP + GA is illustrated in Figure 7.

The Role of Forests in Climate Regulation
The amount of carbon dioxide in forests varies between day and night, seasons, and weather conditions.Plants photosynthesize during the day, absorbing carbon dioxide from the air and releasing oxygen.At night, when photosynthesis stops and respiration begins15heyhey take oxygen from the air and release carbon dioxide.Therefore, the percentage of carbon dioxide in the forest is different at night than during the day.Trees' photosynthesis intensity is closely related to temperature.When the temperature is too low, photosynthesis slows down.If the temperature is high, photosynthesis is fast.When it gets too hot, photosynthesis stops again.Therefore, carbon dioxide demand is more or less different between seasons, weather conditions, high and low temperatures, and strong and weak photosynthesis.In short, carbon dioxide in the forest also decreases as the height of the forest increases.Moreover, at different times, it also changes between

The Role of Forests in Climate Regulation
The amount of carbon dioxide in forests varies between day and night, seasons, and weather conditions.Plants photosynthesize during the day, absorbing carbon dioxide from the air and releasing oxygen.At night, when photosynthesis stops and respiration begins.They take oxygen from the air and release carbon dioxide.Therefore, the percentage of carbon dioxide in the forest is different at night than during the day.Trees' photosynthesis intensity is closely related to temperature.When the temperature is too low, photosynthesis slows down.If the temperature is high, photosynthesis is fast.When it gets too hot, photosynthesis stops again.Therefore, carbon dioxide demand is more or less different between seasons, weather conditions, high and low temperatures, and strong and weak photosynthesis.In short, carbon dioxide in the forest also decreases as the height of the forest increases.Moreover, at different times, it also changes between day and night, seasons, and weather conditions.
Forests act as carbon sinks, removing pollutants from the atmosphere, and are versatile tools to combat air pollution and mitigate climate change.Forests absorb a third of the carbon dioxide released by fossil fuels worldwide every year.Forests and climate are interdependent.On the one hand, a forest needs suitable environmental conditions as a plant community.Light, heat, water, and other conditions directly affect the geographical distribution range and spatial and temporal distribution pattern of various forest products and temperature changes.A dry or wet climate directly or indirectly affects the structure and function of the forest ecosystem.Therefore, if the climate changes, forest ecosystems will be affected.On the other hand, the forest itself can form a special microclimate.The forest changes the emissivity and thermal properties of the underlying surface.The forest climate is similar to the ocean's, with gentle temperature variations in relatively wet forests and nearby areas.In general, the reflectance of the forest is only half that of the soil.Solar radiation passes through the atmosphere, reaches the surface, and is absorbed by the forest layer.solar radiation occupies 30% of the land area and then is transferred to the atmosphere through long-wave radiation, latent heat release, and sensible heat transfer.Forests can be considered one of the heat reservoirs of the climate system.Forests partly affect precipitation, so forest destruction reduces the absorption of solar radiation and affects the water cycle.Large-scale forest changes may even affect global heat and water balances.As one of the components of the global climate system, forests stabilize the regional climate and thus play a role in stabilizing the global climate.
Forest carbon sinks are important for mitigating the effects of climate warming.However, under climate change, forests could easily become carbon sources rather than sinks.Natural disturbance mechanisms such as fire, pests, or drought can affect major forest functions, production, and stability.Applying the air quality-meteorology correlation model to forest ecosystem monitoring can provide data support for forest ecosystem management and facilitate the effective management of forest managers.

Experimental Software Environment Settings
This section uses the Spark framework of the Hadoop big data platform and sets three distributed frameworks.The operating system chooses Ubuntu 14.04 LTS.The software uses the crontab command in the Ubuntu system to implement the timing execution.In this experiment, the model execution task was set to be executed every 3 min.The JDK (Java Development Kit) version is JDK-7u80-Linux-x64, the Hadoop platform version is Hadoop-2.6,and the Spark version is 1.5.1.

Analysis of the Relationship between Air Quality and Meteorology
This section discusses the theory of the RF algorithm, expounds on the BPNN and related theories, and uses GA to optimize the prediction of the network weight and threshold of the BPNN.We sample the air quality of Xi'an from 24 to 30 June 2022, as the research object to analyze the impact of meteorological factors on air quality, such as temperature, humidity, and wind.The prediction results of the RF and BP-GA (Backpropagation-Genetic Algorithm) neural network algorithm in the air quality correlation model are analyzed.The changing trend of the AQI and meteorological conditions is described in Figure 8.In Figure 8a, the changing law of AQI is consistent with the changing law of average temperature.The average temperature on 26 June 2022 was the lowest, the average temperature gradually decreased from 24 June 2022 to 25 June 2022, and the average temperature gradually increased from the 26 June 2022 to 30 June 2022.Therefore, there was a certain positive correlation between the average temperature and AQI.In Figure 8b, the relative humidity changed greatly from 25 June 2022 to 27 June 2022, averaging 52.33%.The relative humidity changed slightly for the rest of the time, averaging 31.75%.At the same time, the AQI value was relatively high when the relative humidity was low and relatively low when the relative humidity was high.It can be concluded that there is a certain negative correlation between relative humidity and AQI.In Figure 8c, the trend of wind level is relatively flat compared with AQI.From 24 June 2022 to 27 June 2022, the wind level changed the most, and the wind level on the other days remained at about level 2. When the wind level was high, the AQI value was relatively low.It can be inferred that there is a certain negative correlation between wind level and AQI.The difference between meteorological conditions and AQI values from 24 to 30 June 2022, is sketched in Table 3.In Table 3, it can be seen from the value of AQI that the change in AQI value from 24 to 30 June 2022, is too large, and the data difference is 120.The minimum value of AQI is 46, and the maximum value of AQI is 166, which indicates moderate air-quality pollution, indicating that air-quality pollution has occurred in this area.Moreover, the maximum and minimum values of temperature, humidity, and wind meteorological factors in the region at the end of June were quite different, and the air quality in the region at the end of June was affected by meteorological conditions.

Analysis of the Prediction Results of the Air Quality-Meteorology Correlation Model Based on RF and BP-GA
The air quality-meteorology correlation model is constructed on the basis of RF and BP-GA.The AQI prediction results are demonstrated in Figure 9.
In Figure 9a, most of the predicted AQI values by the RF model are similar but lower than the actual AQI values.In Figure 9b, the predicted AQI values by the BP-GA model are close to but higher than the actual AQI values.Table 4 details the differences in the predicted numerical values of the air quality-meteorology correlation model of the different models.23, 164.32] and which is closer to the actual predicted value of AQI.However, combined with the data shown in Figure 9, there is still a certain data gap between the RF model and the BP + GA model in predicting the AQI value.Therefore, this work fuses the RF and BP-GA models to jointly forecast AQI.The prediction results of the fusion model are shown in Figure 10.
Table 4 indicates that the output interval of the AQI value predicted by the RF model is [42.3, 168.29] compared with the AQI value predicted by the BP + GA model, which output interval is [47.23, 164.32] and which is closer to the actual predicted value of AQI.However, combined with the data shown in Figure 9, there is still a certain data gap between the RF model and the BP + GA model in predicting the AQI value.Therefore, this work fuses the RF and BP-GA models to jointly forecast AQI.The prediction results of the fusion model are shown in Figure 10.Clearly, the AQI predicted by the RF + BP + GA fusion-based air quality-meteorology correlation model coincides with the actual measurement, and the trend of the predicted and actual values are the same.Thus, the fitting effect of the RF + BP + GA fusion-based air quality-meteorology correlation model is the best, achieving a complete fitting to accurately forecast the AQI.The differences in the prediction values of the RF + BP + GA fusion-based air quality-meteorology correlation model are indicated in Table 5.Clearly, the AQI predicted by the RF + BP + GA fusion-based air quality-meteorology correlation model coincides with the actual measurement, and the trend of the predicted and actual values are the same.Thus, the fitting effect of the RF + BP + GA fusion-based air quality-meteorology correlation model is the best, achieving a complete fitting to accurately forecast the AQI.The differences in the prediction values of the RF + BP + GA fusion-based air quality-meteorology correlation model are indicated in Table 5.

RF + BP + GA Prediction Training Model Numerical
The maximum value of AQI 166 The maximum predicted value of AQI 166.32 The minimum value of AQI 46 The minimum predicted value of AQI 46.23 Table 5 describes that the data gap between the AQI value of the RF + BP + GA fusion model and the AQI value of the actual air quality meteorology is [0.23, 0.32], which is very small.In summary, the prediction results of the RF-based and BP-GA-based air quality-meteorology correlation model are slightly lower and higher than the real values.The prediction results of the RF + BP + GA fusion-based air quality-meteorology correlation model are closest to the real values.Thus, fusing RF and BP-GA to build the air quality-meteorology correlation model can best predict the AQI.

Discussion
The air quality-meteorology correlation model uses a fusion model of RF + BP + GA to predict air quality.Through analyzing the air quality and meteorological conditions in Xi'an from 24 to 30 June 2022, it is found that there is a certain positive correlation between average temperature and AQI, while there is a certain negative correlation between relative humidity and AQI.This result is consistent with the research results of Kais et al. [44], who used RF to evaluate and predict the air quality of 113 environmental protection cities in China from 2014 to 2016.The survey results found a correlation between air quality levels and AQI values.Furthermore, the AQI value predicted by the RF + BP + GA model is basically coincident with the AQI value obtained with the actual measurement.The RF + BP + GA model has the best fitting effect among the air quality-meteorology correlation models, basically achieves complete fitting, and can accurately predict the AQI value of air quality.Jiang et al. [45] established the fusion model of the limit gradient lifting algorithm + BP + autoregressive moving average model to jointly predict the air quality in Changping District, Beijing.They found that the prediction effect of the proposed model was more accurate than the air quality prediction of a single limit gradient lifting algorithm, BPNN, or autoregressive moving average model.The results demonstrate the prediction effect of the fusion model in this work and show that the prediction effect of the fusion model is better than that of a single algorithm in predicting air quality.When Qiao et al. [46] used BPNN and RF to predict the concentration of air pollutants, they found that the AQI accuracy of BP + RF prediction was about 87%.However, comparing the AQI values used here reflects our model's better prediction accuracy.It is found that the AQI value predicted by the RF + BP + GA model basically coincides with the AQI value obtained with the actual measurement, so the AQI value predicted by the RF + BP + GA model achieves a good prediction accuracy, which proves the effectiveness of the proposed RF + BP + GA fusion-based air quality-meteorology correlation model.

Conclusions
The air quality was collected in Xi'an from 24 to 30 June 2022.Following an analysis of the changing trends of air quality, AQI, and input variables, this work takes meteorological factors-relative humidity, wind level, and average temperature-as the input variables for the air quality-meteorology correlation model.Meanwhile, it introduces the AQI as the output variable.As a result, a BPNN model optimized with RF and GA is proposed to forecast the air quality in Xi'an.The relationship between temperature, humidity, wind level, and air quality is analyzed.The influence of climate change on the forest ecosystem is illustrated, and the interaction between the air quality-climate correlation model and the forest ecosystem is explored.The prediction results of RF-based, BP-GA-based, and RF + BP + GA fusion-based air quality-meteorology correlation models are analyzed and compared.The results show a positive correlation between average temperature and AQI, a negative correlation between relative humidity and AQI, and a negative correlation between wind level and AQI.The predicted AQI values by the RF-based and BP-GA models are slightly lower and higher than the actual AQI values, respectively.The fitting effect of the RF + BP + GA fusion-based air quality-meteorology correlation model is the best, and the complete fitting is basically realized.The prediction error of the proposed RF + BP + GA model for AQI is not more than 0.32, which shows a good fitting effect with the actual value.The fusion air quality-meteorology correlation model can accurately forecast the AQI.Inevitably, meteorological conditions' seasonal and interannual fluctuations impact air quality.However, even in short-term or long-term adverse conditions, it is always imperative to focus on the long term and make steady progress following scientific and accurate pollution control principles.Forests and climate are interdependent.Forests act as carbon sinks, removing pollutants from the atmosphere and serving as a multifunctional tool to combat air pollution and mitigate climate change.
The findings provide research references for predicting the AQI using an air quality-meteorology correlation model and data support for analyzing air quality problems.The ecological relationship between climate and forest is expounded in detail, and the change and development of forest ecosystems under the air quality-climate correlation model are studied, which provides a reference for research on the interaction between climate and forest ecosystems.Last but not least, there are still some shortcomings of this work.Firstly, the sample size is too small.Secondly, although the BPNN is optimized, the optimization results of other machine-learning methods on the neural network are not compared.In addition, this work uses the air quality-meteorology correlation model of Xi'an in June 2022 for prediction analysis.There was a high temperature during this period, and the changes in other meteorological factors caused by these high temperatures are unknown.It is hoped that the impact of meteorological factors on air quality changes can be comprehensively considered in future research.The amount of training data can also be expanded, such as meteorological changes in air quality in Xi'an within one year in 2021.Various algorithms can be considered to optimize the neural network, and the optimization effects of other machine-learning algorithms on the neural network can be compared and analyzed.

Figure 1 .
Figure 1.Changes in AQI index value in Xi'an from 24 to 30 June 2022.

Figure 1 .
Figure 1.Changes in AQI index value in Xi'an from 24 to 30 June 2022.

Figure 2 .
Figure 2. Basic principles of RF (A, B, C represent different results of different decision tree prediction classifications).

Figure 2 .
Figure 2. Basic principles of RF (A, B, C represent different results of different decision tree prediction classifications).

Figure 3 .
Figure 3.The importance of sample features in RF training.

Figure 3
Figure3has four sample features: , , , and .Suppose feature  is replaced by noise 1, and modeling is carried out accordingly.When the difference between the classification error rates Error 1 and Error 2 is small, the importance of feature  is low.When Error 2 is much larger than Error 1, it means that the feature  has a greater impact on the classification results.Table 2 lists the algorithm flow of the RF:

Figure 3 .
Figure 3.The importance of sample features in RF training.
presents the connection structure between BPNN neurons.According to Figure 4, W L ij represents the connection weight of the ith neuron and the jth neuron in the L-th layer.L is the number of layers, and i denotes the ith neuron in the L-th layer.All neurons in each layer and each neuron in the next layer have a weight.The weight of the next layer is y i .The upper layer output is multiplied by each corresponding W L , accumulated, and added with the threshold b to obtain z i .Then, the output of the neural network is processed with the excitation function.A training set D

Figure 4 .
Figure 4.The connection structure between BPNN neurons.According to Figure 4,    represents the connection weight of the th neuron and the th neuron in the -th layer. is the number of layers, and  denotes the th neuron in the -th layer.All neurons in each layer and each neuron in the next layer have a weight.The weight of the next layer is   .The upper layer output is multiplied by each corresponding   , accumulated, and added with the threshold  to obtain   .Then, the output of the neural network is processed with the excitation function.A training set

Figure 5 .
Figure 5.The initialization operation, exchange operation, and mutation operation processes of GA.

Figure 7 .
Figure 7.The air quality-meteorology correlation fusion model based on RF + BP + GA.In Figure 7, the air quality-meteorology correlation fusion model based on RF + BP + GA first uses RF to extract air and meteorological quality features, mainly humidity and temperature.Here, it provides a training feature subset for the air quality-meteorology correlation fusion model.Second, the BPNN training model optimized with the GA is used to pre-train the feature subset to reduce the overfitting problem in the model training process and improve the training performance of the model.At last, the BPNN training model optimized with the GA is used to predict the AQI value.The RF + BP + GA fusion model reduces the dimension of the input features, reduces the influence of irrelevant factors on the training results, and realizes the optimization of the air quality-meteorology correlation fusion model.

Figure 7 .
Figure 7.The air quality-meteorology correlation fusion model based on RF + BP + GA.In Figure 7, the air quality-meteorology correlation fusion model based on RF + BP + GA first uses RF to extract air and meteorological quality features, mainly humidity and temperature.Here, it provides a training feature subset for the air quality-meteorology correlation fusion model.Second, the BPNN training model optimized with the GA is used to pretrain the feature subset to reduce the overfitting problem in the model training process and improve the training performance of the model.At last, the BPNN training model optimized with the GA is used to predict the AQI value.The RF + BP + GA fusion model reduces the dimension of the input features, reduces the influence of irrelevant factors on the training results, and realizes the optimization of the air quality-meteorology correlation fusion model.

Figure 8 .
Figure 8.The changing trend of AQI and meteorological conditions: (a) is the trend of temperature and AQI; (b) is the trend of relative humidity and AQI, and (c) indicates the trend of wind level and AQI.

Figure 8 .
Figure 8.The changing trend of AQI and meteorological conditions: (a) is the trend of temperature and AQI; (b) is the trend of relative humidity and AQI, and (c) indicates the trend of wind level and AQI.

Figure 9 .
Figure 9.The prediction results of the air quality-meteorology correlation model: (a) is the pr tion results of the RF model; (b) is the prediction results of the BP-GA model.In Figure9a, most of the predicted AQI values by the RF model are similar but lo than the actual AQI values.In Figure9b, the predicted AQI values by the BP-GA m are close to but higher than the actual AQI values.Table 4 details the differences in predicted numerical values of the air quality-meteorology correlation model of the d ent models.

Figure 9 .
Figure 9.The prediction results of the air quality-meteorology correlation model: (a) is the prediction results of the RF model; (b) is the prediction results of the BP-GA model.

Figure 10 .
Figure 10.Prediction results of the RF + BP + GA fusion-based air quality-meteorology correlation model.

Table 1 .
The applicable range of different evaluation indices.

Table 2 .
Steps of RF algorithm flow.

Table 2 .
Steps of RF algorithm flow.

Table 3 .
The difference between meteorological conditions and AQI values from 24 to 30 June 2022.

Table 4 .
The differences in the predicted numerical values of the air quality-meteorology correlation model of different models.

Table 4
indicates that the output interval of the AQI value predicted by the RF model is [42.3, 168.29] compared with the AQI value predicted by the BP + GA model, which output interval is [47.

Table 5 .
The differences in prediction values of the RF + BP + GA fusion-based air quality-meteorology correlation model.

RF + BP + GA Prediction Training Model Numerical
Figure 10.Prediction results of the RF + BP + GA fusion-based air quality-meteorology correlation model.

Table 5 .
The differences in prediction values of the RF + BP + GA fusion-based air qualitymeteorology correlation model.