Prediction of Ozone Hourly Concentrations Based on Machine Learning Technology

Li, Dong; Ren, Xiaofei

doi:10.3390/su14105964

Open AccessArticle

Prediction of Ozone Hourly Concentrations Based on Machine Learning Technology

by

Dong Li

^*

and

Xiaofei Ren

College of Economics and Management, Xi’an University of Posts & Telecommunications, Xi’an 710061, China

^*

Author to whom correspondence should be addressed.

Sustainability 2022, 14(10), 5964; https://doi.org/10.3390/su14105964

Submission received: 31 March 2022 / Revised: 5 May 2022 / Accepted: 9 May 2022 / Published: 14 May 2022

Download

Browse Figures

Versions Notes

Abstract

:

To optimize the accuracy of ozone (O₃) concentration prediction, this paper proposes a combined prediction model of O₃ hourly concentration, FC-LsOA-KELM, which integrates multiple machine learning methods. The model has three parts. The first part is the feature construction (FC), which is based on correlation analysis and incorporates time-delay effect analysis to provide a valuable feature set. The second part is the kernel extreme learning machine (KELM), which can establish a complex mapping relationship between feature set and prediction object. The third part is the lioness optimization algorithm (LsOA), which is purposed to find the optimal parameter combination of KELM. Then, we use air pollution data from 11 cities on Fenwei Plain in China from 2 January 2015 to 30 December 2019 to test the validity of FC-LsOA-KELM and compare it with other prediction methods. The experimental results show that FC-LsOA-KELM can obtain better prediction results and has a better performance.

Keywords:

O₃; prediction; lioness optimization algorithm; kernel extreme learning machine

1. Introduction

In recent years, with the rapid development of China’s economy, the ozone (O₃) content in the air has gradually increased, and ozone has become another serious air pollutant in addition to PM_2.5 and PM₁₀. The formation mechanism of ozone is complex. In short, it is produced by the photochemical reaction of nitrogen oxides (NO_x) in the atmosphere with volatile organic compounds (VOCs) [1]. High wind speed, high temperature, and low relative humidity also significantly promote the formation of ozone [2,3,4]. With the continuous increase in ozone concentration, its harmfulness becomes more and more serious. The study by Bell et al. [5] showed that high concentrations of ozone can have effects on human health, such as headache, chest pain, sore throat, cough and decreased lung function. However, ozone is not only harmful to human health, but also has a significant negative impact on the human living environment. Mills et al. [6] found that if the ozone concentration in the air exceeds 40 ppbv for a long time, crops and ecosystems will be damaged. It can be seen that ozone is very harmful. If a prediction method can be found that can predict changes in ozone concentration in a timely and accurate manner, it will help to alert the public to reduce the risk of exposure to high concentrations of ozone and mitigate the harm of ozone to public health [7]. However, the formation of ozone requires a series of chemical reactions, which makes ozone prediction challenging.

As a typical time series, the change in O₃ concentration is affected by a variety of external factors, making it difficult to predict. To solve this problem, scholars have proposed some typical solutions, such as machine learning techniques and fuzzy set theory. These methods can provide valuable predictive results in some cases, but two major challenges remain. First of all, the existing O₃ concentration prediction is mainly to design and improve a certain algorithm to achieve the prediction of ozone concentration in a specific region. Due to the high complexity of the O₃ concentration data itself, the performance of a single algorithm is limited. Secondly, for those algorithms that use multivariate prediction models to predict ozone, without screening when selecting predictive features, it may result in higher computational complexity and even lower accuracy.

In order to cope with various challenges in ozone concentration prediction, and also to further improve the prediction accuracy of O₃, this paper proposed an ozone prediction model that integrates multiple methods. The model mainly consists of three parts. The first part is the feature construction method, which is based on correlation analysis and incorporates time-delay effect analysis to provide a valuable feature set for subsequent prediction algorithms. The second part is the kernel extreme learning machine (KELM), which is an improvement of extreme learning machine and can be used for O₃ prediction. KELM has the strong nonlinear fitting ability and can establish a complex mapping relationship between feature set and prediction object, but its fitting accuracy is affected by several hyper-parameters. The third part is the lioness optimization algorithm, which is a meta-heuristic optimization algorithm based on population. The purpose of this algorithm is to find the optimal parameter combination of the KELM. Then, we used air pollution data from 11 cities on the Fenwei Plain in China from 2 January 2015 to 30 December 2019 to test the validity of the prediction model and compared it with other prediction methods.

We made the following contributions in this paper.

(1): We proposed a feature construction algorithm which can analyze the interaction strength between ozone and itself and other atmospheric pollutants from the perspectives of time and space, and built a feature set for ozone prediction based on this algorithm.
(2): We proposed an ozone concentration prediction model, FC-LsOA-KELM, which comprehensively uses the feature construction algorithm, the kernel extreme learning machine and the lioness optimization algorithm.
(3): We evaluated the prediction performance of FC-LsOA-KELM using 2015–2019 air pollution data from 11 cities on the Fenwei Plain in China. The results showed that the O₃ prediction model proposed in this paper can obtain better prediction results and has a better performance compared with other prediction models.

The rest of this paper is organized as follows. The second section introduces the related work on ozone concentration prediction. In Section 3, the feature construction algorithm, lioness optimization algorithm and ozone concentration prediction model are discussed. The fourth section gives an overview of the research area and the experimental evaluation results. The fifth section summarizes the whole paper.

2. Related Work

Research on ozone prediction has been carried out by scholars for many years, and a certain number of research results have been formed which show some different methods for ozone prediction. These methods can be mainly divided into the following categories:

(1): Prediction method based on linear regression. The linear regression prediction method is to study the linear causal relationship between the predictor (historical data of air pollutants) and the explanatory variable (future ozone concentration). The most commonly used linear regression methods in O₃ prediction include ARIMA [8,9] and multiple linear regression (MLR) [10,11,12,13]. Although these statistical methods have been widely used for near-earth O₃ concentration prediction, they also have many limitations. For example, when using MLR prediction, a large number of predictors need to be used, and these predictors often have multicollinearity problems [14]. In addition, the formation process of O₃ is strongly nonlinear, and the concentration of ozone also depends on many other factors, such as meteorological factors (temperature, relative humidity, etc.), atmospheric transport process, and the concentration of ozone precursor compounds (VOC, NO_x, etc.). These characteristics mean that the prediction accuracy of existing statistical models is often not ideal when predicting ozone concentration; especially when predicting some extreme values, the prediction error is large [15].
(2): Prediction method based on artificial neural network (ANN). ANN is one of the most commonly used machine learning methods for ozone prediction. The ANN, derived with limited prior knowledge, is a nonlinear prediction model [15,16]. Therefore, it can find the nonlinear relationship between meteorological and photochemical processes and ozone concentration at a particular site, and then realize the prediction of ozone concentration. By comparing the prediction results of ANN with MLR and ARIMA, scholars found that ANN is more effective than statistical models such as ARIMA and MLR [17,18,19,20].
(3): Prediction method based on support vector machine (SVM). SVM is a machine learning technique that has been widely applied to regression cases and classification problems [21]. Like ANN, SVM is a machine learning technique commonly used for ozone prediction [22,23,24]. Through studies, scholars have found that SVR has greater superiority and accuracy compared with statistical prediction methods such as ARIMA and MLR [25]. However, some researchers [26] have pointed out that ANN and SVM are not perfect, and they believed that these two methods still have certain limitations in ozone prediction. They proposed that ANN and SVM are easy to produce overfitting and local minimum problems, resulting in a poor prediction stability of the prediction model.
(4): Prediction method based on fuzzy set theory. In 1993, Song and Chissom proposed a fuzzy time series (FTS) based on fuzzy set theory. Subsequently, scholars tried to apply this theory to O₃ prediction. Domanska and Wojtylak [27] proposed a prediction method based on fuzzy set theory which uses a fuzzy time series model to predict O₃, CO, NO and other pollutants. Although the prediction results were satisfactory, the lack of uncertainty and instability analysis in this paper led to doubts about the reliability of the prediction method [28].
(5): Prediction method based on deterministic models. Deterministic models are based on mathematical equations describing chemical and physical processes in the atmosphere [29], and follow the principle of cause and effect [15]. When using a deterministic model for prediction, the O₃ reaction equation must be established first, and then a large amount of ozone precursor and meteorological parameter data should be collected. In these two works, the design of the reaction equation is the key. If the equation design is not considered properly, or the parameters are not appropriate, the accuracy of the prediction model will be greatly reduced.

3. Method Design

In this section, Section 3.1 discusses the design of the feature construction algorithm, Section 3.2 introduces the kernel extreme learning machine, Section 3.3 describes the lioness optimization algorithm, and Section 3.4 discusses a model for O₃ hourly concentration prediction: FC-LsOA-KELM.

3.1. Design of Feature Construction Algorithm

Suppose there is a multivariate time series,

V

, which contains

m

subsequences, and each subsequence has

n

time observation point data. Where

V_{1}

is our prediction object, and we hope to predict the future value of

V_{1}

by mining the historical values of

V_{1}

,

V_{2}

, …,

V_{m}

.

V = (\begin{matrix} V_{1} \\ V_{2} \\ V_{3} \\ ⋮ \\ V_{m} \end{matrix}) = (\begin{matrix} v_{1, 1} & v_{1, 2} & v_{1, 3} & \dots & v_{1, n} \\ v_{2, 1} & v_{2, 2} & v_{2, 3} & \dots & v_{2, n} \\ v_{3, 1} & v_{3, 2} & v_{3, 3} & \dots & v_{3, n} \\ ⋮ & ⋮ & ⋮ & ⋮ \\ v_{m, 1} & v_{m, 2} & v_{m, 3} & \dots & v_{m, n} \end{matrix})

(1)

In the process of constructing

V_{1}

prediction, the feature set

F

that may have a potential correlation with

V_{1}

needs to be found first. In order to construct

F

, this paper proposed a feature construction algorithm for multivariate time series which integrates time-delay effect analysis and correlation coefficient sorting. This algorithm uses the correlation coefficient

c

between the candidate feature

V_{i}^{'}

and the prediction target

V_{1}

as the criterion. When the correlation coefficient

c

between the candidate feature and the prediction target is greater than the specified threshold

\bar{c}

, the feature

V_{i}^{'}

will be included in the feature set

F

. In addition, in order to assist users in further refining and screening the feature set

F

, the algorithm optimizes the structure of the feature set and ranks the feature set

F

according to the level of the correlation coefficient

c

. The higher the correlation coefficient, the higher the ranking of the feature

V_{i}^{'}

.

In the specific construction process, this method not only considers the degree of correlation between

V_{2} (t - τ)

, …,

V_{m} (t - τ)

and

V_{1}

at different time delays

τ

, but also considers the degree of correlation between

V_{1}

itself under a certain time delay

τ

, that is, the degree of correlation between

V_{1} (t - τ)

and

V_{1}

itself. The specific Algorithm 1 is described as follows:

Algorithm 1. Feature Construction algorithm
1	Inputs: The Original multivariate time series V, Maximum time delay,
2	Correlation coefficient threshold $\bar{c}$ , Feature set $F = N u l l$
3	Outputs: Feature set $F$
4	Whilei <= $τ_{M A X}$ do % i is the time series delay value
5	While j <= m % m is the number of variables in
6	Calculate the correlation coefficient $c_{j, i}$ between $V_{1}$ and $V_{j, i}$
7	if $c_{j, i}$ >= $\bar{c}$
8	$C (i, 1)$ = $c_{j, i}$ , $C (i, 2)$ = $i, C (i, 3)$ = j, % C is used to record candidate feature information
9	end if
10	End while
11	End while
12	Sort C in descending order according to the first column value of C% Sort by correlation coefficient
13	While $C (i, :) \neq N U L L$
14	Generate feature $V_{j, i}^{'}$ based on the values of $C_{i, 2}$ and $C_{i, 3}$
15	F = $F \cup V_{j, i}^{'}$
16	End while
17	ReturnF

3.2. Kernel Extreme Learning Machine (KELM)

Extreme learning machine (ELM) is a fast-learning method based on a single-hidden-layer feedforward neural network [30,31]. This method only requires specifying the number of nodes in the hidden layer, and then the minimal 2-norm least-squares solution can be obtained by solving the linear system of equations, and this solution can be used as the output weight of the hidden layer. The learning process for ELM is only once. Compared with the traditional neural network, the network generalization ability and learning speed of ELM are significantly improved. For the set

{(x_{i}, y_{i})}_{i = 1}^{Q}

with

Q

samples, where

x_{i} \in R^{n}

is the input vector and

y_{i} \in R^{m}

is the corresponding expected output vector, the mathematical equation of ELM is:

y_{i} = \sum_{j = 1}^{l} β_{j} g (ω_{j} \cdot x_{i} + b_{j}) i = 1, 2, \dots, Q

(2)

where

ω_{j} = {[ω_{1 j} ω_{2 j} \dots ω_{n j}]}^{T}

is the weight value connecting the

j

th hidden layer node and the input node,

b_{j}

is the bias of the

j

th hidden layer node,

β_{j} = {[β_{j 1} β_{j 2} \dots β_{j m}]}^{T}

is the weight value connecting the

j

th hidden layer node and the output node, and g(⋅) represents the activation function.

In fact, Equation (2) can also be rewritten as a matrix:

Y = H β

(3)

where

Y

is the output matrix of the output layer,

β

is the output weight matrix, and

H

is the output matrix of the hidden layer. These can be specifically expressed as:

H = {[\begin{matrix} g (ω_{1} \cdot x_{1} + b_{1}) & g (ω_{2} \cdot x_{1} + b_{2}) & \dots & g (ω_{l} \cdot x_{1} + b_{l}) \\ g (ω_{1} \cdot x_{2} + b_{1}) & g (ω_{2} \cdot x_{2} + b_{2}) & \dots & g (ω_{l} \cdot x_{2} + b_{l}) \\ ⋮ & ⋮ & \dots & ⋮ \\ g (ω_{1} \cdot x_{Q} + b_{1}) & g (ω_{2} \cdot x_{Q} + b_{2}) & \dots & g (ω_{l} \cdot x_{Q} + b_{l}) \end{matrix}]}_{Q \times l}

(4)

β = {[\begin{matrix} β_{1} & β_{2} & \dots & β_{l} \end{matrix}]}^{T}

(5)

Y = {[\begin{matrix} y_{1} & y_{2} & \dots & y_{Q} \end{matrix}]}^{T}

(6)

Based on the theory of ELM, the input weights and bias of the hidden layer are randomly generated, so only the output weights need to be determined during the training process. According to the minimum norm solution rule [32], the corresponding solution for

β

can be expressed as:

\overset{\land}{β} = H^{+} Y = H^{T} {(H H^{T} + η I)}^{- 1} Y

(7)

where

H^{+}

is the

M o o r e - P e n r o s e

generalized inverse of the output matrix

H

of the hidden layer.

H^{+}

can be obtained analytically using the orthogonal projection method or the singular value decomposition method.

Since the kernel function mapping ϕ(x) in SVM is similar to the hidden layer node mapping h(x) in ELM, Huang [33] proposed to replace the h(x) in ELM with the kernel function mapping ϕ(x) in a support vector machine to construct the KELM algorithm. This algorithm solves the problem that ELM needs to determine the number of hidden layers, and has a better generalization performance. The kernel matrix in KELM is defined as:

Ω_{KELM} = h (x) \cdot h (x^{'}) = K (x, x^{'})

(8)

The corresponding KELM output function can be expressed as:

f (x) = h (x) \hat{β} = {[\begin{matrix} K (x, x^{'}) \\ ⋮ \\ K (x, x^{'}) \end{matrix}]}^{T} {(\frac{I}{C} + Ω_{KELM})}^{- 1} Y

(9)

Since

Ω_{KELM}

adopts the form of the inner product, KELM does not need to set the number of hidden layer nodes when solving the output function value, and does not need to set the initial weight and bias of the hidden layer.

3.3. Lioness Optimization Algorithm (LsOA)

The research shows that the fitting accuracy and generalization ability of KELM are affected by its kernel parameters. Therefore, it is necessary to adopt a suitable optimization algorithm to optimize its kernel parameters. The existing research results mainly use genetic algorithm [34] and particle swarm optimization [35] to optimize KELM’s parameters. Although these methods have the possibility of finding the optimal parameters, they still have some problems, such as a slow iteration rate and an ease of falling into local optimum. In order to overcome these problems, and also to enable KELM to have a better predictive performance, this paper proposes a novel population-based meta-heuristic optimization algorithm—the lioness optimization algorithm (LsOA). The algorithm uses the lioness hunting mechanism as the prototype, and comprehensively considers the two methods of team hunting and elite hunting. In addition, in order to avoid premature convergence of the LsOA, and also referring to the behavioral process of the lioness hunting, this paper also introduces a phase-focused strategy in the elite hunting mechanism.

In LsOA, the prey is the global optimal solution, and the current optimal solution is considered to be lioness A or the top lioness. The lions mentioned are all candidate solutions, and they are all lionesses by default.

(1): Team hunting mechanism. This mechanism refers to the lionesses in the lion pride hunting in a cooperative way. When hunting in groups, lionesses often form a team for collective hunting. One part of the lioness (“wings”) surrounds the prey, and another part of the lioness (“center”) moves relative to the position of the other lioness (“wings”) and the prey. When a part of the lioness at the “wings” begins to charge towards the prey, the lioness in the “center” role will cautiously approach the target and use all the barriers that can be used as a cover to hide as much as possible. When the “central” lioness is close enough to its prey, it can suddenly pounce on the target and catch the prey. This mathematical model is as follows:

$\vec{D} = | \vec{C} \cdot \vec{Prey} - \vec{X} (t) |$

(10)

$\vec{X} (t + 1) = \vec{Prey} - \vec{A} \cdot \vec{D}$

(11)

where $t$ is the current iteration number, $\vec{X} (t)$ is the position vector of the lioness, $\vec{D}$ is the distance between the current lioness and the prey, $\cdot$ represents the dot product, and $\vec{Prey}$ is the position vector of the prey. In these formulas, the calculation formulas of $\vec{A}$ and $\vec{C}$ are as follows:

$\vec{A} = 2 \vec{a} \cdot {\vec{r}}_{1} - \vec{a}$

(12)

$\vec{C} = 2 {\vec{r}}_{2}$

(13)

where $\vec{a} = 2 - I t e r \cdot \frac{2}{M a x_i t e r} \in [0, 2]$ , decreasing with the increase in iteration times. ${\vec{r}}_{1}, {\vec{r}}_{2}$ are both random vectors between [0, 1].

When capturing prey, we believe that the location of the target prey should be near the “central circle” of the hunting team. And this center circle is composed of the four best candidate solutions in the population (lioness A, lioness B, lioness C and lioness D) and the average of the four. The mathematical model is as follows:

\begin{array}{l} {\vec{D}}_{A} = | {\vec{C}}_{1} \cdot {\vec{X}}_{A} - \vec{X} |, {\vec{D}}_{B} = | {\vec{C}}_{2} \cdot {\vec{X}}_{B} - \vec{X} | \\ {\vec{D}}_{C} = | {\vec{C}}_{3} \cdot {\vec{X}}_{C} - \vec{X} |, {\vec{D}}_{D} = | {\vec{C}}_{4} \cdot {\vec{X}}_{D} - \vec{X} | \end{array}

(14)

\begin{array}{l} {\vec{X}}_{a} = {\vec{X}}_{A} - {\vec{A}}_{1} \cdot ({\vec{D}}_{A}), {\vec{X}}_{b} = {\vec{X}}_{B} - {\vec{A}}_{2} \cdot ({\vec{D}}_{B}) \\ {\vec{X}}_{c} = {\vec{X}}_{C} - {\vec{A}}_{3} \cdot ({\vec{D}}_{C}), {\vec{X}}_{d} = {\vec{X}}_{D} - {\vec{A}}_{4} \cdot ({\vec{D}}_{D}) \end{array}

(15)

{\vec{X}}_{a v e} = \frac{{\vec{X}}_{a} + {\vec{X}}_{b} + {\vec{X}}_{c} + {\vec{X}}_{d}}{4}

(16)

where the calculation of

{\vec{A}}_{i}

and

{\vec{C}}_{i}

(i = 1, 2, 3, 4) is shown in Equations (12) and (13),

{\vec{X}}_{A}, {\vec{X}}_{B}, {\vec{X}}_{C}, {\vec{X}}_{D}

are the positions of lioness A, lioness B, lioness C and lioness D, respectively,

{\vec{X}}_{a v e}

is the average of

{\vec{X}}_{a}, {\vec{X}}_{b}, {\vec{X}}_{c}, {\vec{X}}_{d}

and

\vec{X}

is the position of the current prey.

According to Equations (15) and (16), the positions of the lions at the center of the hunting team can be determined; these are

{\vec{X}}_{a}, {\vec{X}}_{b}, {\vec{X}}_{c}, {\vec{X}}_{d}

, respectively, and the last candidate is the average

{\vec{X}}_{a v e}

of these four candidates. These five positions constitute the hunting team’s “center circle”:

E_Team = [{\vec{X}}_{a}; {\vec{X}}_{b}; {\vec{X}}_{c}; {\vec{X}}_{d}; {\vec{X}}_{a v e}]

(17)

Figure 1 shows the construction process of the “central circle”. Since the position of the target prey in the search space is not known a priori, this paper assumes that the position of the target prey may be located near any of these five positions, and the probability of the five candidates being selected is the same, which is 0.2.

(2): Elite hunting mechanism. In addition to the team hunting, according to the theory of the survival of the fittest, the top lioness (the most physically strong lioness) sometimes hunts alone, that is, using the elite hunting mechanism. At this time, the population evolves into some agents of the elite through these agents to constantly test the direction and position of the elite and finally achieve the goal of capturing the prey. In order to avoid the risk of falling into a local optimum caused by relying only on a single elite position, here we draw on the principle of triangular stability, and replace the common method for determining the position of the elite from the common last-round best to a joint decision by the top three agent positions in the last-round fitness value. Figure 2 shows the construction process of the elite matrix.

$Top_lioness_pos = \frac{{\vec{X}}_{A} + {\vec{X}}_{B} + {\vec{X}}_{C}}{3}$

(18)

(3): The march strategy. With the increasing number of optimization iterations, the risk of the lions falling into local optimization increases. In order to enable the lions to quickly explore a new area, this paper designed the march strategy. The implication is that when $r_{2} \leq M$ , all dimensions of the predator are unified into a unique value. Otherwise, each dimension will randomly select a value in the $E_T e a m$ to be updated with the following formula:

$P r e y (i, j) = E_T e a m (r a n d i (s i z e (E_T e a m, 1)), :), IF (r_{2} > M)$

(19)

$P r e y (i, :) = E_T e a m (r a n d i (s i z e (E_T e a m, 1)), :), IF (r_{2} \leq M)$

(20)

where $r a n d i ()$ is used to generate pseudo-random integers and size() is used to return the size of the vector. M indicates the degree of influence of the march strategy on the process. The prey here is a matrix with the same dimension as the elite matrix.

(4): Phase-focused strategy. The focus of the phase is to divide the lion hunting process into three phases, namely, the early iteration, the middle iteration and the late iteration [36]. Different search mechanisms were used for each stage. At the beginning of the iteration, the prey is energetic and moves fast. At this time, the lions disperse randomly in the search area and use Brownian motion to find the prey. This phase of the algorithm focuses on exploration. When the number of iterations reaches one-third, the lions begin to narrow the encircling circle to encircle the target prey. The prey being chased by the lions uses Levy flight to escape and flee, and individual elites of the lions also adopt Levy flight to chase the prey. At this phase, exploration is as important as exploitation. At the end of the iteration, as the physical strength of the target prey decreases and its speed slows down, it is no longer able to flee for a long distance. Meanwhile, the encircling circle of the lions becomes smaller and smaller, and the probability of capturing the prey increases greatly.

① The early iteration:

While

I t e r

<

\frac{1}{3}

M a x_i t e r

,

\begin{array}{l} {\vec{step}}_{i} = \vec{RB} \otimes ({\vec{Elite_lioness}}_{i} - \vec{RB} \otimes {\vec{Prey}}_{i}) i = 1, \dots n \\ {\vec{Prey}}_{i} = {\vec{Prey}}_{i} + P \cdot \vec{R} \otimes {\vec{step}}_{i} \end{array}

(21)

where

I t e r

is the current iteration number and

M a x_i t e r

is the maximum iteration number.

\vec{RB}

is a vector of random numbers representing Brownian motion. The symbol

\otimes

means multiplication term by term.

P

= 0.5, and

R

is a random vector in [0, 1].

\vec{RB} \otimes {\vec{Prey}}_{i}

simulates the behavior of the prey.

② The middle iteration:

While

\frac{1}{3}

M a x_i t e r

<

I t e r

<

\frac{2}{3}

M a x_i t e r

For prey:

\begin{array}{l} \vec{{step}_{i}} = \vec{RL} \otimes ({\vec{Elite_lioness}}_{i} - \vec{RL} \otimes {\vec{Prey}}_{i}) i = 1, \dots, n / 2 \\ {\vec{Prey}}_{i} = {\vec{Prey}}_{i} + P \cdot \vec{R} \otimes \vec{{step}_{i}} \end{array}

(22)

where

\vec{RL}

is the random number vector representing Levy’s flight, and

\vec{RL} \otimes {\vec{Prey}}_{i}

simulates the behavior of the prey. For lions, this study assumes:

\begin{array}{l} \vec{{step}_{i}} = \vec{RB} \otimes (\vec{RB} \otimes {\vec{Elite_lioness}}_{i} - {\vec{Prey}}_{i}) i = n / 2, \dots, n \\ {\vec{Prey}}_{i} = {\vec{Elite_lioness}}_{i} + P . C F \otimes \vec{{step}_{i}} \end{array}

(23)

where

CF = {(1 - \frac{Iter}{Max_Iter})}^{(2 \frac{Iter}{Max_Iter})}

, regarded as an adaptive parameter that controls the step length of the lion’s movement.

\vec{RB} \otimes {\vec{Elite_lioness}}_{i}

simulates the behavior of a predator (lioness).

③ The late iteration:

While

I t e r

>

\frac{2}{3}

M a x_i t e r

\begin{array}{l} \vec{{step}_{i}} = \vec{RL} \otimes (\vec{RL} \otimes {\vec{Elite_lioness}}_{i} - {\vec{Prey}}_{i}) i = 1, \dots, n \\ {\vec{Prey}}_{i} = {\vec{Elite_lioness}}_{i} + P . C F \otimes \vec{{step}_{i}} \end{array}

(24)

where

\vec{RL} \otimes {\vec{Elite_lioness}}_{i}

simulates the behavior of a predator (lioness).

In addition, it has been found that many animals in the state of starvation have the characteristics of Brownian motion [36]. That is, they will turn suddenly during the movement, and the time interval of each turn is unpredictable. For this reason, we assumed that when a female lion observes a nearby prey, it will use Brownian motion to contain the prey. However, if there is a lack of prey within the territory and the lion needs to explore new territory, the lion will abandon Brownian motion and adopt Levy’s flight strategy instead [36]. During the design process of LsOA, we also considered this part of the content.

The pseudo code of the LsOA is as follows (Algorithm 2):

Algorithm 2. Lioness Optimization Algorithm
1	Initialize search agents(Prey) population $i$ = 1, …, n
2	Assign free parameters: $F A D s$ = 0.2; $P$ = 0.5; $Q$ = 0.5; $M$ = 0.9
3	While Iter < Max_iter
4	Calculate the fitness of each search agent
5	${\vec{X}}_{A}$ = the best search agent
6	${\vec{X}}_{B}$ = the second best search agent
7	${\vec{X}}_{C}$ = the third best search agent
8	${\vec{X}}_{D}$ = the fourth best search agent
9	Update $C F$ , $\vec{a}$ , $\vec{RL} and \vec{RB}$
10	If ( $q \leq Q$ ) % Team hunting
11	For each search agent
12	Update $\vec{A}$ and $\vec{C}$ by the Equations (12) and (13)
13	Use $\vec{A}$ and $\vec{C}$ to calculate $\vec{D}$
14	Calculate ${\vec{X}}_{a}$ , ${\vec{X}}_{b}$ , ${\vec{X}}_{c}$ and ${\vec{X}}_{d}$ by the Equation (15)
15	${\vec{X}}_{a v e} = \frac{{\vec{X}}_{a} + {\vec{X}}_{b} + {\vec{X}}_{c} + {\vec{X}}_{d}}{4}$
16	Construct the “center circle”: $E_T e a m = {{\vec{X}}_{a}, {\vec{X}}_{b}, {\vec{X}}_{c}, {\vec{X}}_{d}, {\vec{X}}_{a v e}}$
17	If ( $r_{2} < M$ )
18	Update the position of the current search agent by the Equation (19)
19	else if ( $r_{2} \geq M$ )
20	Update the position of the current search agent by the Equation (20)
21	end if
22	End for
23	Else if ( $q \leq Q$ ) % Elite hunting
24	Top_lioness_pos = $\frac{{\vec{X}}_{A} + {\vec{X}}_{B} + {\vec{X}}_{C}}{3}$
25	Construct the Elite matrix and accomplish memory saving
26	For each search agent
27	If Iter < Max_iter/3
28	Update the position of the current search agent by the Equation (21)
29	else if Max_iter/3 < Iter < 2 $\cdot$ Max_iter/3
30	For the first half of the populations (i = 1, …, n/2)
31	Update the position of the current search agent by the Equation (22)
32	End for
33	For the other half of the populations (i = n/2, …, n)
34	Update the position of the current search agent by the Equation (23)
35	End for
36	else if Iter > 2 $\cdot$ Max_iter/3
37	Update the position of the current search agent by the Equation (24)
38	end if
39	End for
40	End if
41	Update Top_lioness_pos if there is a better solution
42	Applying FADs effect and update the position of the current search agent
43	Iter = Iter + 1
44	End while
45	ReturnTop_lioness_pos

3.4. Design of the Prediction Model

Based on the methods introduced before, this paper proposes an hourly ozone concentration prediction model—FC-LsOA-KELM. The model structure is shown in Figure 3. Firstly, this model used the feature construction algorithm (FC) to construct the feature set of historical air pollution data with ozone concentration as the explained variable, thereby constructing the explanatory variable set. Then, according to the needs of model training and prediction, the feature data set was divided into a training set

A

and a test set

B

(the specific division is in Section 4.2). Finally, the model training set was put into KELM to train the ozone concentration prediction model. Since the fitting accuracy and generalization ability of the KELM are affected by the kernel parameters, this paper used LsOA to optimize the parameters of the KELM.

In addition, during this experiment, it was found that the KELM needs to read all the model training sets into the memory at one time to solve the linear equation system, and then obtain the minimal 2-norm least-squares solution. This calculation method has a high demand on the memory of the equipment used for model training, especially when the data set is large. To avoid the problem of memory overrun due to a large amount of data, this paper carried out random sampling on a training set

A

, and extracted 20% of the data each time as a model training set

A_{1}

for training. At the same time, so as to avoid model training bias and overfitting problems caused by improper sampling, this paper made the following two improvements:

(1): Randomly select 10% of the data from the training set $A$ excluding $A_{1}$ as the model validation set $A_{2}$ , which was used to test the performance of the KELM trained by the training set on the untrained data set. Its purpose is to test the generalization performance of the prediction model.
(2): Instead of taking the training results of the model training set as the optimization object in common optimization algorithms, this paper redesigned the fitness function:

$f i t n e s s (γ_{i}) = \frac{f (A_{1}, K E L M, γ_{i}) + f (A_{2}, K E L M, γ_{i})}{2} + | \frac{f (A_{1}, K E L M, γ_{i}) - f (A_{2}, K E L M, γ_{i})}{2} |$

(25)

where $f i t n e s s ()$ is the fitness function, $γ_{i}$ is the position of the lioness, which is also the kernel parameter of KELM; as above, $A_{1}$ is the model training set and $A_{2}$ is the model validation set; $| |$ means the absolute value. $f ()$ is the evaluation indicator of the fitting accuracy of KELM. In this paper, MAPE (mean absolute percentage error) was used, which is calculated as follows:

$M A P E = \frac{1}{K} \times \sum_{t = 1}^{K} | \frac{o b s e r v e d_{t} - p r e d i c t e d_{t}}{o b s e r v e d_{t}} | \times 100 %$

(26)

In Equation (26),

o b s e r v e d

represents the actual value,

p r e d i c t e d

represents the predicted value, and

K

is the total number of samples. The smaller the value of MAPE, the better the accuracy of the prediction model.

Figure 3 is a flow chart of the prediction model. By observing this flow chart, it can be found that after the prediction results are obtained, this paper adds the step of “prediction result correction”, which is mainly to correct the unreasonable values in the prediction results. The specific correction method is to perform statistical analysis on historical data monthly, statistically infer the maximum and minimum values of the predicted object, and then correct the value that exceeds the maximum or minimum range in the prediction result to the maximum or minimum value.

3.5. Comparison Methods

(1): Multiple linear regression [37]. When performing regression analysis, we call regression with two or more independent variables under linear correlation conditions multiple linear regression (MLR). This method of predicting dependent variables using an optimal combination of several independent variables is usually more efficient than using only one independent variable for prediction or estimation. The mathematical equation of MLR is as follows:

$y_{i} = β_{0} + β_{1} \cdot x_{i 1} + β_{2} \cdot x_{i 2} + \dots + β_{n} \cdot x_{i n}$

(27)

where $y_{i}$ is the observed value of the dependent variable, $\forall i \in [1, m]$ , $x_{i j}$ is the value of the $i$ -th dimension of the input variable $j$ , $β_{j}$ is the regression coefficient of the input variable, $\forall j \in [1, n]$ , and its value is estimated using the ordinary least squares.
(2): Gaussian process regression [38]. Gaussian process regression (GPR) is a nonparametric model that uses Gaussian process priors to perform a regression analysis on data, which provides flexibility for modeling stochastic processes. Compared with other models based on data parameters, GPR specifies a prior distribution over the function space, where the relationship between the data is encoded in the covariance function $k (x_{1}, x_{2})$ of the multivariate Gaussian distribution. The exponential square function, which is commonly used in many covariance functions, is shown in Equation (28).

$k (x_{1}, x_{2}) = σ_{f}^{2} \exp (\frac{- {(x_{1} - x_{2})}^{2}}{2 l^{2}})$

(28)

where $σ_{f}$ is the variance and represents the noise degree of the data; $l$ is the characteristic length scale parameter (the larger the value, the smoother the function).
(3): Back propagation neural network [39]. The back propagation neural network (BPNN) is a multi-layer feedforward network trained according to error back-propagation. The basic idea of BPNN is the gradient descent method, which uses gradient search technology to achieve the minimum mean square error between the real output value and the expected output value of the network. It is the most widely used neural network.
(4): Support vector regression [40]. SVM is a class of generalized linear classifiers that perform binary classification of data in a supervised learning manner. Support vector regression (SVR) is an application model of support vector machine in regression problems. The core idea is to find a hyperplane (hypersurface) that minimizes the expected risk.
(5): Kernel ridge regression [41]. Ridge regression [42] is a well-known technique from multiple linear regression that implements a regularized form of least-squares regression. Kernel ridge regression (KRR) introduces the kernel function on the basis of ridge regression, realizes the mapping of low-dimensional data in high-dimensional space, and further constructs a linear ridge regression model in high-dimensional feature space to realize nonlinear regression [41]. At present, KRR is widely used in pattern recognition, data mining and other fields.
(6): Decision tree. Decision tree (DT) is a non-parametric supervised learning method used for classification and regression. The goal is to create a model that learns simple decision rules from data features to predict the value of a target variable.
(7): Stochastic gradient descent regression [43]. Stochastic gradient descent (SGD) is a simple but highly efficient method which is mainly used for the discriminant learning of linear classifiers under convex loss functions, such as (linear) support vector machines and logistic regression. Stochastic gradient descent regression supports different loss functions and penalties to fit linear regression models.
(8): GCN [44]. A graph convolutional network (GCN) is actually a feature extractor, which is the same as a convolutional neural network (CNN), but its object is graph data. It can be applied to richer topological structure data, such as social networks, recommendation systems, transportation networks, etc. These data are characterized by disorderly connections. GCN cleverly designs a method to extract features from graph data, so that we can use these features to perform node classification, graph classification and edge prediction, and also obtain embedded representation of the graph by the way, which is widely used.
(9): GAT [45]. Graph attention network (GAT) aggregates neighbor nodes through the attention mechanism and realizes the adaptive allocation of different neighbor weights. This is different from GCN. The weights of different neighbors in GCN are fixed, and they all come from the normalized Laplacian matrix. GAT greatly improves the expressive ability of the graph neural network model.
(10): LsOA-KELM. The LsOA-KELM model uses the LsOA for the parameter optimization of KELM. Compared with the model in this paper, the data processed by the LsOA-KELM model are all original data without feature construction. When using the LsOA-KELM model, the population size of LsOA was 30, and the number of iterations was set as 50. The kernel function of KELM was set to ‘RBF’, and the other parameters were the default values of the original algorithm.
(11): FC-LsOA-KELM--. The FC-LsOA-KELM-- model adds a feature construction step on the basis of LsOA-KELM, and reconstructs the original data and hands it to LsOA-KELM for training and prediction. The main difference between this model and the FC-LsOA-KELM model is the lack of a correction step for the predicted values. The parameter setting of the FC-LsOA-KELM model was consistent with LsOA-KELM.

We conducted many experiments on the parameter settings of the above-mentioned methods. The final parameter values and the differences between the methods are as follows (Table 1):

3.6. Evaluation Indicators

At present, there are many error evaluation indicators that can be used to evaluate the performance of regression prediction models. The common indicators are RMSE (root mean squared error), MAPE (mean absolute percentage error), R² (R squared), MAE (mean absolute error), MSE (mean squared error) and so on. In this paper, the three most representative indicators, RMSE, MAPE and

\overset{\land}{R^{2}}

, were selected, where

\overset{\land}{R^{2}}

is a simple adjustment of R² to facilitate subsequent statistical analysis. The calculation formula of MAPE among the three indicators is shown in Equation (26). The calculation formulas of RMSE and

\overset{\land}{R^{2}}

are as follows:

R M S E = \sqrt{\frac{1}{K} \sum_{t = 1}^{K} {(o b s e r v e d_{t} - p r e d i c t e d_{t})}^{2}}

(29)

R^{2} = 1 - \frac{\sum_{t = 1}^{K} {(o b s e r v e d_{t} - p r e d i c t e d_{t})}^{2}}{\sum_{t = 1}^{K} {(\bar{o b s e r v e d} - p r e d i c t e d_{t})}^{2}}

(30)

\overset{\land}{R^{2}} = 1 - R^{2}

(31)

where

o b s e r v e d

represents the actual value,

\bar{o b s e r v e d}

represents the actual average value,

p r e d i c t e d

represents the predicted value, and

K

is the total number of samples. The more accurate the prediction method, the closer MAPE, RMSE and

\overset{\land}{R^{2}}

will be to 0.

4. Research Results

4.1. Research Area

The Fenwei Plain is the general name of the Fenhe Basin and the Weihe Plain and its surrounding terraces in the Yellow River basin. The Fenwei Plain starts from Yangqu County in Shanxi Province in the north, reaches the Qinling Mountains in Shaanxi Province in the south, and reaches Baoji City in Shaanxi Province in the west. It is distributed in the northeast-southwest direction, with a length of about 760 km and a width of about 40 to 100 km. The Fenwei Plain includes Xi’an, Baoji, Xianyang, Weinan and Tongchuan in Shaanxi Province; Taiyuan, Jinzhong, Lvliang, Linfen and Yuncheng in Shanxi Province; and Luoyang and Sanmenxia in Henan Province, with a total land area of 70,000 square kilometers. It is the fourth largest plain in China and the largest alluvial plain in the middle reaches of the Yellow River, with a total population of 55.5445 million.

In recent years, with the rapid economic development and rapid population accumulation in the Fenhe Plain, the air quality in the area has deteriorated. According to data from the China Air Quality Monitoring Network, from 2015 to 2019, the average concentration of O₃ in 11 cities in the Fenwei Plain showed an overall upward trend, with an annual average increase of 12.2 μg/m³, and it exceeded the secondary standard limit (160 μg/m³) from 2017 to 2019. In order to control the continued deterioration of air quality in the Fenwei Plain, in 2018, the Fenwei Plain was included in the “three-year action plan to fight air pollution” (National Development [2018] No. 22) of the State Council of China, making it one of the three key areas for continuous air pollution prevention and control. In this paper, the Fenwei Plain was selected as the research area, and the research area is shown in Figure 4.

4.2. Research Data

After selecting the research area, from the Tencent weather interface (http://weather.gtimg.cn/aqi/ (accessed on 30 January 2020)), we used crawler technology to crawl 11 cities’ hourly air pollution data, including Xi’an, Baoji, Xianyang, Weinan and Tongchuan in Shaanxi Province; Jinzhong, Lvliang, Linfen and Yuncheng in Shanxi Province; and Luoyang and Sanmenxia in Henan Province from 1 o’clock on 2 January 2015 to 23 o’clock on 30 December 2019. These air pollutants included O₃, PM_2.5, PM₁₀, SO₂, NO₂, CO and AQI. The initial amount of data collected in each city was 43,799. Due to some missing data, in order to avoid the impact of missing data on the prediction model training, the data items involving missing values were eliminated for this paper. In the end, the amount of remaining data in each city was 42,312, a total of 1763 days. Then, we divided the dataset according to the number of days. Among them, 1733 days of data were divided into the training set (41,592), and the remaining 30 days were divided into the test set (720).

4.3. Feature Construction

Air pollution has the characteristics of regional diffusion; that is to say, the pollutants in the current place A will spread to place B with the passage of time and airflow, and then have an impact on the ozone concentration of place B. Therefore, when constructing the prediction feature set, we should consider not only the time-delay effect between local pollutants, but also the flow of pollutants between regions. Based on the above reasons, when constructing the ozone prediction feature set in a certain area, the air pollution data of 11 cities in the study area will be analyzed through the feature construction algorithm proposed in this paper, and the possible spatio-temporal effects will be mined. Finally, a relatively complete prediction feature set is built for the prediction method. The key parameters of the construction algorithm are set as

τ_{M A X} = 288

, meaning 24 h a day, a total of 12 days of delay in detection;

\bar{c} = 0.6

means that when the correlation coefficient is greater than or equal to 0.6, we consider the feature to be highly correlated with ozone and that it can be included in the feature set. Table 2 shows the construction results of the feature set of O₃ hourly concentration prediction in various regions. Luoyang has the largest number of features, with a total of 328 features, and Lvliang has the lowest number of features, with only 79 features.

Here, we take the O₃ prediction feature set of Baoji city as an example, continue to analyze the selected features, and then discuss the possible relationship between the selected features and the time delay. Figure 5a shows the statistical results of selected features in this region by time delays. Observing the scatter points and fitting curves in Figure 5a, it can be found that when the time delay

τ

is 1, 22 and 44, there are three local maximum points in the number of features, and the interval between three points obviously has a period of 21~22. By observing the data of other cities, such as Luoyang (Figure 5b), Xi’an (Figure 5c) and Jinzhong (Figure 5d), it is found that these cities also have a similar rule. In addition, through observation, we can also find that with the increase in time delay

τ

, the number of selected O₃ prediction feature sets does not linearly decrease, but there is an obvious fluctuation when

τ

is close to 22, and the number of selected features will be greater than at the time when

τ

is 1. Therefore, it is not a wise choice to directly take the influential factor set with

τ

= 1 as the prediction feature set of O₃, which will lead to the abandonment of many valuable features, especially in the face of O₃, which has a certain periodic prediction object.

4.4. Result Analysis

In order to intuitively observe the prediction effect of the FC-LsOA-KELM method, we used a scatterplot to describe the relationship between the observed and the predicted O₃ concentrations per hour in 11 cities. We used the actual value of O₃ in each city as the abscissa, and the predicted value obtained using the FC-LsOA-KELM method as the ordinate. The drawn scatterplots are shown in Figure 6.

It can be seen from each scatterplot that there was a relatively obvious positive linear correlation between the actual value and the predicted value in each region. Among them, the linear regression coefficient value of Baoji was up to 0.9728, and the performance of Lvliang was relatively poor in various regions, which also reached 0.8826. It can be seen that the FC-LsOA-KELM method performed well in O₃ prediction in various regions, which can be further confirmed from the fitted R² in each scatterplot. By comparing the value of R² in each scatterplot, we found that the R² of the prediction results in various regions was in the range of 0.9093~0.9608. According to the definition of R², the closer R² is to 1, the stronger the correlation between the actual value and the predicted value, and the more accurate the prediction method is.

Although it can be found from the previous analysis that the performance of the FC-LsOA-KELM method in O₃ prediction of 11 regions in the Fenwei Plain is generally satisfactory, whether its prediction ability is competitive cannot be judged. To illustrate this point, we used the 11 comparison methods listed in Section 3.5 to predict the O₃ in the study area. The analysis results of prediction error are shown in Table 3, Table 4 and Table 5. For the convenience of comparison, all the minimum values were marked in bold in these tables. According to Table 3 (MAPE), compared with the other 11 methods, the FC-LsOA-KELM method showed obvious advantages, ranking first in prediction error MAPE for each city. To observe the overall performance of these methods in the study area easily, the last row in Table 3 listed the mean value of the prediction error MAPE of these methods in each city. Looking at these mean values, it can be found that, except for FC-LsOA-KELM, the rankings of the remaining methods were FC-LsOA-KELM (2), BPNN (3), DT (4), MLR (5), KRR (6), GPR (7), SGD (8), GCN (9), GAT (10) and SVR (11).

Then, we continued to observe Table 4, which shows the RMSE values of various prediction methods. According to the data in Table 4, FC-LsOA-KELM had an outstanding performance in other regions except Lvliang, ranking first. This is somewhat different from the performance of FC-LsOA-KELM in MAPE, which is reflected in the fact that FC-LsOA-KELM ranked only fifth in RMSE for Lvliang, but the method ranked first in MAPE for Lvliang. When further analyzing the data of Lvliang, we found a special phenomenon. MLR, which had the best performance in RMSE, ranked only sixth in MAPE for Lvliang, and its MAPE value was 72.50%. This value was twice the MAPE value of the first-ranked FC-LsOA-KELM (31.71%). To further analyze the reasons for the conflicting conclusions of the two indicators, we calculated the percentage errors of FC-LsOA-KELM and MLR on the prediction results of O₃ in Lvliang, and integrated the two sets of values with the actual value of O₃ in Lvliang. A comparison chart of the predicted percent error and the actual value was drawn, as shown in Figure 7. Comparing the curves in Figure 7, it can be clearly seen that the percentage error of MLR was significantly larger than that of FC-LsOA-KELM, which confirms the conclusion in Table 4 that the MAPE value of FC-LsOA-KELM is much smaller than that of MLR. Then, we had an illusion that the RMSE value of FC-LsOA-KELM should be smaller than that of MLR. However, the data in Table 4 show that the RMSE value of MLR (4.3497) was significantly smaller than that of FC-LsOA-KELM (6.3164). Why, therefore, is there such a contradiction between the indicators? When predicting Lvliang, we found that the prediction accuracy of MLR was higher than that of FC-LsOA-KELM. The main reason we considered this was that Lvliang is located at the northern end of the Fenwei Plain (as shown in Figure 7, the red area is the location of Lvliang), and its climatic characteristics are quite different from those of other cities. Moreover, the correlation between ozone concentration and other urban pollutants is low. This can be verified with the number of predicted features constructed at different thresholds. We used the feature construction algorithm to construct the feature set of each region and acquired statistics on the number of features under different thresholds in each region. The statistical results are shown in Table 6. The second, third and fourth rows in the table are the number of features of each city under thresholds of 0.5, 0.6 and 0.7, respectively. Observing the data in Table 6, it can be found that under various threshold conditions, the number of features obtained in Lvliang was the lowest, and there was a large gap with the penultimate Tongchuan, which may be one of the main reasons for the poor O₃ prediction effect of FC-LsOA-KELM in Lvliang.

In addition, we also tried to analyze the reasons why MLR was better than FC-LsOA-KELM in RMSE and

\overset{\land}{R^{2}}

, but worse than FC-LsOA-KELM in MAPE from the perspective of model design. After analysis, we found that the objective function design of the optimization algorithm in this paper is based on MAPE; the primary optimization indicator of the optimization algorithm is thus MAPE, so that FC-LsOA-KELM was far superior to MLR in Lvliang’s MAPE, but its performance was poor according to the other two indicators.

We continued to observe Figure 8 and found that when the percentage error of MLR was relatively high (i.e., when the blue curve is significantly higher than the orange curve), the O₃ concentration value was often low, which means that although the prediction percentage error was very high at this time, its absolute error was small. This explains the large difference between the MAPE ranking and the RMSE ranking.

Subsequently, we continued to observe the data for

\overset{\land}{R^{2}}

listed in Table 4, and found that the performance of FC-LsOA-KELM on

\overset{\land}{R^{2}}

was consistent with its performance on RMSE, and it ranked first in all regions except Lvliang. At the same time, we also found the same problem after observing Lvliang; that is, the

\overset{\land}{R^{2}}

value of FC-LsOA-KELM also ranked fifth in the overall ranking of Lvliang, and its ranking was quite different from that of MAPE. The reason is consistent with the previous analysis of RMSE.

4.5. Statistical Analysis

When comparing the performance of various prediction methods, it is often necessary to conduct statistical tests on experimental results. To this end, we also conducted a statistical analysis on the analysis results of these methods listed in Section 3.5 using MAPE, RMSE and

\overset{\land}{R^{2}}

.

Before performing our statistical analysis, the Friedman test was used to calculate the average ranking of various prediction methods. The results are shown in Table 7. According to the Friedman mean rank, considering the 11 cities, FC-LsOA-KELM performed the best, ranking first in the three indicators of MAPE, RMSE and

\overset{\land}{R^{2}}

, which were 1.00, 1.36 and 1.36, respectively. These results also illustrate the superior performance of FC-LsOA-KELM.

We then used a statistical test for the comparison of multiple methods. First, a non-parametric Friedman test was used to determine whether there were significant differences in the performance of all methods under each indicator. To avoid biased conclusions caused by a single indicator and a single region, we tested 11 cities on three different indicators. Where the indicator used in the first group was MAPE, the second group was RMSE, and the third group was

\overset{\land}{R^{2}}

. First, the Friedman test requires the average ranking to be calculated. Then, the Friedman test should consider the critical value obtained at the significance level (α = 0.05, 0.1), and compare the critical value with Friedman’s statistical results to determine whether there is evidence that the null hypothesis is false. For the tests of these three indicators, we found that the results rejected the null hypothesis, which indicates that the performance of each group of methods has significant differences.

After determining that the various prediction methods differed in performance, we used the Bonferroni–Dunn test [46] to analyze statistical differences between the performance of these methods. This test compared the proposed method, FC-LsOA-KELM, with the other 11 methods; that is, the average ranking difference of each method was compared with the critical difference (CD). If the difference was greater than the critical difference, the method with a good average ranking was statistically superior to the method with a bad average ranking; otherwise, there was no statistical difference between the two. The calculation formula of the critical difference is as follows:

C D = q_{α} \sqrt{\frac{k (k + 1)}{6 N}}

(32)

where

k

is the number of methods used for comparison,

N

is the number of data sets, and the commonly used value of

q_{α}

can be obtained by looking up the table. See Appendix A for the table.

The results of the Bonferroni–Dunn test are shown in Figure 9. The bar chart shows the average ranking of the 12 methods on these three indicators, and the value corresponding to the horizontal line is equivalent to the threshold, that is, the ranking of the comparison method plus the value of the CD. This part defined two thresholds at the significance levels of 0.05 and 0.1, respectively. Each group used a different color to identify (the first group was blue, the second group was red, the third group was green), and the threshold was also distinguished by different colors. FC-LsOA-KELM obtained the same Friedman mean ranks on RMSE and

\overset{\land}{R^{2}}

, and the thresholds (CD) at significance levels of 0.05 and 0.1 were also the same for the two methods, which caused some of the threshold lines in Figure 8 to overlap. In the end, only three lines appear in Figure 9. The first pink line is 6.38, which is the corresponding threshold line for the second and third groups at a significance level of 0.05. The second green line is 6.02, which is not only the threshold line corresponding to the first group when the significance level is 0.05, but also the threshold line corresponding to the second and third groups when the significance level is 0.1. The last blue line is 5.66, which is the threshold line corresponding to the first group at a significance level of 0.1.

By looking at Figure 9, it can be seen that the prediction performance of FC-LsOA-KELM can outperform those methods whose mean rank is above the threshold line (i.e., the height of the bar exceeds its corresponding line). In Group 1, FC-LsOA-KELM significantly outperforms GPR, SVR, KRR, SGD, GCN and GAT at the significance levels of 0.05 and 0.1, respectively. For LsOA-KELM, FC-LsOA-KELM is superior to this method only at the significance level of 0.1. In Group 2, FC-LsOA-KELM is significantly superior to GPR, SVR, DT, SGD, GCN and GAT at both significance levels. In Group 3, FC-LsOA-KELM is significantly better than the other six methods except MLR, BPNN, KRR, LsOA-KELM and FC-LsOA-KELM--, at the significance levels of 0.05 and 0.1, respectively.

In addition to the Bonferroni–Dunn test, this study also considered Holm’s method [47]. The steps are: calculate the

p

value, sort the

p

value and compare the

p

value with

α / i

. If

p < α / i

, reject the null hypothesis; that is, the difference is significant.

α

is the significance level and

i

is the sorting result number of the method. As for the Friedman test and Bonferroni–Dunn test mentioned above, we also carried out tests for the three groups based on MAPE, RMSE and

\overset{\land}{R^{2}}

. The test results for these three groups are shown in Table 8, Table 9 and Table 10, respectively. It can be seen from the data in these tables that, for these three indicators, FC-LsOA-KELM was also significantly better than other methods at the significance levels of 0.05 and 0.1, respectively. The above test results once again verify the superiority of the FC-LsOA-KELM model.

5. Conclusions

To optimize the accuracy of O₃ concentration prediction, this paper proposed a combined prediction model of O₃ hourly concentration, FC-LsOA-KELM, which integrates multiple machine learning methods. The prediction performance of this model was tested using the historical air pollution data for several cities in the Fenwei Plain of China. By analyzing the experimental results, we can draw the following conclusions:

(1): The selection and use of prediction features have a significant impact on the prediction performance of the prediction model. In this paper, when we used LsOA-KELM to train unselected and reconstructed air pollution data to build a predictive model to predict future values of O₃ concentration, the prediction results were not ideal. In the evaluation of MAPE, RMSE and $\overset{\land}{R^{2}}$ , LsOA-KELM was worse than BPNN, MLR and other methods. However, when LsOA-KELM was faced with the air pollution data reconstructed by the FC, its prediction performance was significantly improved.
(2): The prediction feature set constructed by the feature construction method (FC) can not only mine the potential relationship between air pollutants, but also analyze the impact of historical pollutants on future pollutants, which is helpful for enriching the source of O₃ prediction features, thereby helping the prediction model improve the accuracy of O₃ predictions.
(3): Using historical data to revise the prediction results can reduce the outliers in the model prediction caused by insufficient training of the prediction model, thereby helping the prediction model to improve the prediction accuracy.

In the future, we will consider using FC-LsOA-KELM for financial market forecasting and renewable energy generation forecasting, etc., to determine whether FC-LsOA-KELM can also perform well in other fields.

Author Contributions

Conceptualization, D.L. and X.R.; methodology, D.L.; software, D.L. and X.R.; validation, D.L. and X.R.; formal analysis, D.L.; resources, D.L.; data curation, D.L.; writing—original draft preparation, D.L. and X.R.; writing—review and editing, D.L.; visualization, D.L. and X.R.; supervision, D.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Scientific Research Program Funded by Shaanxi Provincial Education Department (Program No.20JG031), and the Postgraduate Innovation Fund Project of Xi’an University of Posts and Telecommunications (CXJJWY2020001).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Nomenclature

$C ()$	record candidate feature information	FC	$\vec{A}$	coefficient vector	LsOA
$C ()$	record candidate feature information		$\vec{a}$	coefficient vector
$\bar{c}$	threshold for correlation coefficient		$\vec{C}$	coefficient vector
$c$	correlation coefficient		$C F$	adaptive parameter
$m$	number of subsequences of the time series		${\vec{D}}_{i}$	distance between the lioness $i$ and the prey
$n$	number of historical time points		$E_Team$	hunting team’s “center circle”
$V$	multivariate time series		$\vec{Elite_lioness}$	elite matrix
$V_{i}$	the $i$ th subsequence		$I t e r$	current iteration number
$V_{i}^{'}$	candidate features of the $i$ th subsequence		$M$	constant
$F$	feature set		$M a x_i t e r$	maximum iteration number
$τ$	a time delay		$\vec{Prey}$	position vector of the prey
$τ_{M A X}$	maximum time delay		$P$	constant
$b_{j}$	bias of the $j$ th hidden node	KELM	${\vec{r}}_{i}$	random vectors between [0, 1]
$C$	regularization coefficient		$\vec{RB}$	random number vector of Brownian motion
$g (\cdot)$	activation function		$R$	uniform random vector in [0, 1]
$H$	output matrix of the hidden layer		$\vec{RL}$	random number vector of Levy’s flight
$H^{+}$	Moore-Penrose generalized inverse of matrix H		$t$	current iteration number
$I$	identity matrix		$Top_lioness_pos$	position vector of the elite lioness
$K (\cdot)$	kernel function		${\vec{X}}_{i}$	position vector of the lioness $i$
$Q$	total number of samples		${\vec{X}}_{A}$	position vector of the lioness A with the best fitness
$w_{j}$	input weight vector		${\vec{X}}_{B}$	position vector of the lioness B with the second highest fitness
$x_{i}$	input vector		${\vec{X}}_{C}$	position vector of the lioness C with the third highest fitness
$y_{i}$	expected output vector		${\vec{X}}_{D}$	position vector of the lioness D with the fourth highest fitness
$Y$	output matrix of the output layer		${\vec{X}}_{a}$	position vector adjusted by lioness A
$β$	output weight matrix		${\vec{X}}_{b}$	position vector adjusted by lioness B
$β_{j}$	output weight vector		${\vec{X}}_{c}$	position vector adjusted by lioness C
$η$	inverse of the regularization coefficient		${\vec{X}}_{d}$	position vector adjusted by lioness D
$Ω_{KELM}$	kernel matrix		${\vec{X}}_{a v e}$	the mean of ${\vec{X}}_{a}, {\vec{X}}_{b}, {\vec{X}}_{c}, {\vec{X}}_{d}$

Appendix A

Table A1. Critical values for the two-tailed Bonferroni–Dunn test.

Methods	2	3	4	5	6	7
q0.05	1.960	2.241	2.394	2.498	2.576	2.638
q0.10	1.645	1.960	2.128	2.241	2.326	2.394
Methods	8	9	10	11	12	13
q0.05	2.690	2.724	2.774	3.219	3.268	3.313
q0.10	2.450	2.498	2.539	2.978	3.030	3.077

Where Methods is the number of methods used for comparison.

References

Hemming, B.L.; Harris, A.; Davidson, C.; U.S. EPA. Air Quality Criteria for Lead (2006) Final Report; U.S. Environmental Protection Agency: Washington, DC, USA, 2006; EPA/600/R-05/144aF-bF. [Google Scholar]
Khatibi, R.; Naghipour, L.; Ghorbani, M.A.; Smith, M.S.; Karimi, V.; Farhoudi, R.; Delafrouz, H.; Arvanaghi, H. Developing a predictive tropospheric ozone model for Tabriz. Atmos. Environ. 2013, 68, 286–294. [Google Scholar] [CrossRef]
Ordieres-Merè, J.; Ouarzazi, J.; Johra, B.E.; Gong, B. Predicting ground level ozone in Marrakesh by machine-learning techniques. J. Environ. Inform. 2020, 36, 93–106. [Google Scholar] [CrossRef]
Yang, L.; Xie, D.; Yuan, Z.; Huang, Z.; Wu, H.; Han, J.; Liu, L. Quantification of regional ozone pollution characteristics and its temporal evolution: Insights from the identification of the impacts of meteorological conditions and emissions. Atmosphere 2021, 12, 279. [Google Scholar] [CrossRef]
Bell, M.L.; Peng, R.D.; Dominici, F. The Exposure–Response Curve for Ozone and Risk of Mortality and the Adequacy of Current Ozone Regulations. Environ. Health Perspect. 2006, 114, 532–536. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Mills, G.; Buse, A.; Gimeno, B.; Bermejo, V.; Holland, M.; Emberson, L.; Pleijel, H. A synthesis of AOT40-based response functions and critical levels of ozone for agricultural and horticultural crops. Atmos. Environ. 2007, 41, 2630–2643. [Google Scholar] [CrossRef]
Riga, M.; Stocker, M.; Ronkko, M.; Karatzas, K.; Kolehmainen, M. Atmospheric Environment and Quality of Life Information Extraction from Twitter with the Use of Self-Organizing Maps. J. Environ. Inform. 2015, 26, 27–40. [Google Scholar] [CrossRef]
Duenas, C.; Fernandez, M.C.; Canete, S.; Carretero, J.; Liger, E. Stochastic model to forecast ground-level ozone concentration at urban and rural areas. Chemosphere 2005, 61, 1379–1389. [Google Scholar] [CrossRef]
Kumar, K.; Yadav, A.K.; Singh, M.P.; Hassan, H.; Jain, V.K. Forecasting Daily Maximum Surface Ozone Concentrations in Brunei Darussalam—An ARIMA Modeling Approach. J. Air Waste Manag. Assoc. 2004, 54, 809–814. [Google Scholar] [CrossRef] [Green Version]
Hubbard, M.C.; Cobourn, W.G. Development of a regression model to forecast ground-level ozone concentration in Louisville, KY. Atmos. Environ. 1998, 32, 2637–2647. [Google Scholar] [CrossRef]
Kovač-Andrić, E.; Sheta, A.; Faris, H.; Gajdošik, M.Š. Forecasting ozone concentrations in the east of Croatia using nonparametric neural network models. J. Earth Syst. Sci. 2016, 125, 997–1006. [Google Scholar] [CrossRef] [Green Version]
Allu, S.K.; Srinivasan, S.; Maddala, R.K.; Reddy, A.; Anupoju, G.R. Seasonal ground level ozone prediction using multiple linear regression (MLR) model. Model. Earth Syst. Environ. 2020, 6, 1981–1989. [Google Scholar] [CrossRef]
Iglesias-Gonzalez, S.; Huertas-Bolanos, M.E.; Hernandez-Paniagua, I.Y.; Mendoza, A. Explicit Modeling of Meteorological Explanatory Variables in Short-Term Forecasting of Maximum Ozone Concentrations via a Multiple Regression Time Series Framework. Atmosphere 2020, 11, 1304. [Google Scholar] [CrossRef]
Oufdou, H.; Bellanger, L.; Bergam, A.; Khomsi, K. Forecasting daily of surface ozone concentration in the Grand Casablanca region using parametric and nonparametric statistical models. Atmosphere 2021, 12, 666. [Google Scholar] [CrossRef]
Pawlak, I.; Jarosawski, J. Forecasting of Surface Ozone Concentration by Using Artificial Neural Networks in Rural and Urban Areas in Central Poland. Atmosphere 2019, 10, 52. [Google Scholar] [CrossRef] [Green Version]
Kumar, P.; Lai, S.H.; Wong, J.K.; Mohd, N.S.; Kamal, M.R.; Afan, H.A.; Ahmed, A.N.; Sherif, M.; Sefelnasr, A.; El-Shafie, A. Review of Nitrogen Compounds Prediction in Water Bodies Using Artificial Neural Networks and Other Models. Sustainability 2020, 12, 4359. [Google Scholar] [CrossRef]
Spellman, G. An application of artificial neural networks to the prediction of surface ozone concentrations in the United Kingdom. Appl. Geogr. 1999, 19, 123–136. [Google Scholar] [CrossRef]
Chaloulakou, A.; Saisana, M.; Spyrellis, N. Comparative assessment of neural networks and regression models for forecasting summertime ozone in Athens. Sci. Total Environ. 2003, 313, 1–13. [Google Scholar] [CrossRef]
Sousa, S.; Martins, F.G.; Alvim-Ferraz, M.; Pereira, M.C. Multiple linear regression and artificial neural networks based on principal components to predict ozone concentrations. Environ. Model. Softw. 2007, 22, 97–103. [Google Scholar] [CrossRef]
AlOmar, M.K.; Hameed, M.M.; AlSaadi, M.A. Multi hours ahead prediction of surface ozone gas concentration: Robust artificial intelligence approach. Atmos. Pollut. Res. 2020, 11, 1572–1587. [Google Scholar] [CrossRef]
Faris, S.; Alivernini, A.; Conte, A.; Maggi, F. Ozone and particle fluxes in a Mediterranean forest predicted by the AIRTREE model. Sci. Total Environ. 2019, 682, 494–504. [Google Scholar] [CrossRef]
Luna, A.S.; Paredes, M.; Oliveira, G.; Corrêa, S.M. Prediction of ozone concentration in tropospheric levels using artificial neural networks and support vector machine at Rio de Janeiro, Brazil. Atmos. Environ. 2014, 98, 98–104. [Google Scholar] [CrossRef]
Quej, V.H.; Almorox, J.; Arnaldo, J.A.; Saito, L. ANFIS, SVM and ANN soft-computing techniques to estimate daily global solar radiation in a warm sub-humid environment. J. Atmos. Sol.-Terr. Phys. 2017, 155, 62–70. [Google Scholar] [CrossRef] [Green Version]
Faleh, R.; Bedoui, S.; Kachouri, A. Ozone monitoring using support vector machine and K-nearest neighbors methods. J. Electr. Electron. Eng. 2017, 10, 49–52. [Google Scholar]
Su, X.; An, J.; Zhang, Y.; Zhu, P.; Zhu, B. Prediction of ozone hourly concentrations by support vector machine and kernel extreme learning machine using wavelet transformation and partial least squares methods. Atmos. Pollut. Res. 2020, 11, 51–60. [Google Scholar] [CrossRef]
Lu, W.Z.; Wang, D. Learning machines: Rationale and application in ground-level ozone prediction. Appl. Soft Comput. J. 2014, 24, 135–141. [Google Scholar] [CrossRef]
Domanska, D.; Wojtylak, M. Application of fuzzy time series models for forecasting pollution concentrations. Expert Syst. Appl. 2012, 39, 7673–7679. [Google Scholar] [CrossRef]
Yafouz, A.; Najah, A.; Zaini, A.; El-Shafie, A. Ozone Concentration Forecasting Based on Artificial Intelligence Techniques: A Systematic Review. Water Air Soil Pollut. 2021, 232, 79. [Google Scholar] [CrossRef]
Vautard, R.; Beekmann, M.; Roux, J.; Gombert, D. Validation of a hybrid forecasting system for the ozone concentrations over the Paris area. Atmos. Environ. 2001, 35, 2449–2461. [Google Scholar] [CrossRef]
Huang, G.B.; Zhu, Q.Y.; Siew, C.K. Extreme learning machine: Theory and applications. Neurocomputing 2006, 70, 489–501. [Google Scholar] [CrossRef]
Huang, G.B.; Wang, D.H.; Lan, Y. Extreme Learning Machines: A Survey. Int. J. Mach. Learn. Cybern. 2011, 2, 107–122. [Google Scholar] [CrossRef]
Huang, G.B.; Zhu, Q.Y.; Siew, C.K. Extreme learning machine: A new learning scheme of feedforward neural networks. In Proceedings of the IEEE International Joint Conference on Neural Networks, Budapest, Hungary, 25–29 July 2004. [Google Scholar]
Huang, G.B.; Zhou, H.; Ding, X.; Zhang, R. Extreme Learning Machine for Regression and Multiclass Classification. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 2012, 42, 513–529. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Holland, J.H. Genetic algorithms. Sci. Am. 1992, 267, 66–72. [Google Scholar] [CrossRef]
Eberhart, R.; Kennedy, J. A new optimizer using particle swarm theory. In Proceedings of the Sixth International Symposium on Micro Machine and Human Science, MHS’95, Nagoya, Japan, 4–6 October 1995; pp. 39–43. [Google Scholar] [CrossRef]
Faramarzi, A.; Heidarinejad, M.; Mirjalili, S.; Gandomi, A.H. Marine Predators Algorithm: A Nature-inspired Metaheuristic. Expert Syst. Appl. 2020, 152, 113377. [Google Scholar] [CrossRef]
Yuchi, W.; Gombojav, E.; Boldbaatar, B.; Galsuren, J.; Enkhmaa, S.; Beejin, B.; Naidan, G.; Ochir, C.; Legtseg, B.; Byambaa, T.; et al. Evaluation of random forest regression and multiple linear regression for predicting indoor fine particulate matter concentrations in a highly polluted city. Environ. Pollut. 2019, 245, 746–753. [Google Scholar] [CrossRef]
Cao, Q.D.; Miles, S.B.; Choe, Y. Infrastructure recovery curve estimation using Gaussian process regression on expert elicited data. Reliab. Eng. Syst. Saf. 2022, 217, 108054. [Google Scholar] [CrossRef]
Wang, L.; Zeng, Y.; Chen, T. Back propagation neural network with adaptive differential evolution algorithm for time series forecasting. Expert Syst. Appl. 2015, 42, 855–863. [Google Scholar] [CrossRef]
Brereton, R.G.; Lloyd, G.R. Support vector machines for classification and regression. Analyst 2010, 135, 230–267. [Google Scholar] [CrossRef]
Cawley, G.C.; Talbot, N.; Foxall, R.J.; Dorling, S.R.; Mandic, D.P. Heteroscedastic kernel ridge regression. Neurocomputing 2004, 57, 105–124. [Google Scholar] [CrossRef]
Banerjee, K.S.; Carr, R.N. Ridge regression-Biased estimation for non-orthogonal problems. Technometrics 1971, 12, 55–67. [Google Scholar] [CrossRef]
Ighalo, J.O.; Adeniyi, A.G.; Marques, G. Application of linear regression algorithm and stochastic gradient descent in a machine-learning environment for predicting biomass higher heating value. Biofuels Bioprod. Biorefining 2020, 14, 1286–1295. [Google Scholar] [CrossRef]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
Zar, J.H. Biostatistical Analysis. Q. Rev. Biol. 2010, 18, 797–799. [Google Scholar] [CrossRef]
Holm, S. A simple sequentially rejective multiple test procedure. Scand. J. Stat. 1979, 6, 65–70. [Google Scholar] [CrossRef]

Figure 1. The construction process of the “central circle”.

Figure 2. The construction process of the elite matrix.

Figure 3. A flow chart of the prediction model.

Figure 4. A schematic diagram of the research area.

Figure 5. Statistical results of prediction features in various regions.

Figure 6. Scatter plots of observed and predicted O₃ concentrations in the study area.

Figure 7. Location of Lvliang.

Figure 8. Comparison between the predicted percentage error and the actual value of MLR and FC-LsOA-KELM.

Figure 9. Bonferroni–Dunn test of different methods at significance levels (α = 0.05 and α = 0.1).

Table 1. Parameters and differences of comparison methods.

	Key Parameters	Parameter Introduction (Accessed on 1 May 2022)	Advantages and Disadvantages
MLR	fit_intercept=True, normalize=‘False’	https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression	advantages: simple modeling; easy explanation; fast running speed disadvantages: does not fit nonlinear data very well
SVR	kernel=‘poly’, C=1.1, gamma=‘auto’, degree=3, epsilon=0.1, coef0=1.0	https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html#sklearn.svm.SVR	advantages: robust to outliers; solve high-dimensional problems; excellent generalization ability disadvantages: not suitable for large-scale data; sensitive to missing data
BPNN	hidden layer nodes=30		advantages: self-learning and adaptive ability; high-speed optimization; parallel processing capability disadvantages: a large number of parameters; difficult to explain; a risk of falling into local optimal
GPR	kernel=DotProduct() + WhiteKernel(), random_state=0	https://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.GaussianProcessRegressor.html#sklearn.gaussian_process.GaussianProcessRegressor	advantages: fits nonlinear data; predicted values are probabilistic; interpretability disadvantages: determination of covariance function; nonparametric model; high complexity when the amount of data is large
KRR	alpha=1, kernel=‘linear’, gamma=None, degree=3, coef0=1	https://scikit-learn.org/stable/modules/generated/sklearn.kernel_ridge.KernelRidge.html	advantages: kernel function, which is more flexible; fits nonlinear relationships well; data can be mapped to high-dimensional space disadvantages: high computational cost and large amount of computation
DT	criterion=‘squared_error’	https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor	advantages: easy to understand and explain; easy to implement; insensitive to missing values; disadvantages: prone to overfitting
SGD	loss=‘squared_error’, penalty=‘l2’, alpha=0.0001, max_iter=1000, tol=0.001	https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html#sklearn.linear_model.SGDRegressor	advantages: fast running speed disadvantages: poor convergence performance; local minimum may be obtained, and the accuracy is not high
GCN	hidden layer nodes=6		advantages: suitable for nodes and graphs of any topology; disadvantages: all neighbor nodes are assigned the same weight; completely dependent on the graph structure
GAT	hidden layer nodes=6		advantages: using the attention mechanism, different weights can be assigned to different neighbor nodes; not completely dependent on the graph structure disadvantages: when the neighborhoods are highly overlapping, a lot of redundant computations are involved

Table 2. Statistical table of the construction results of prediction feature sets in various regions.

Region	Baoji	Jinzhong	Linfen	Luoyang	Lvliang	Sanmenxia	Tongchuan	Weinan	Xi’an	Xianyang	Yuncheng
Number of Features	126	215	195	328	79	232	111	218	278	202	309

Table 3. Values of MAPE with different methods for different cities.

CITY	MLR	GPR	BPNN	SVR	KRR	DT	SGD	GCN	GAT	LsOA- KELM	FC- LsOA- KELM--	FC- LsOA- KELM
Baoji	18.43%	20.17%	15.01%	36.08%	18.08%	21.42%	27.51%	46.33%	35.05%	18.52%	10.99%	10.96%
Jinzhong	29.28%	46.11%	23.71%	36.53%	36.53%	30.27%	34.71%	50.20%	48.77%	29.78%	20.48%	20.09%
Linfen	38.40%	73.25%	40.18%	140.49%	40.37%	34.52%	59.60%	52.49%	79.49%	96.91%	23.79%	22.00%
Luoyang	31.09%	40.56%	31.79%	95.34%	36.67%	29.74%	44.18%	69.16%	82.85%	32.58%	20.91%	20.83%
Lvliang	72.50%	97.01%	61.00%	297.32%	76.58%	38.91%	85.12%	57.07%	75.03%	76.66%	34.53%	31.71%
Sanmenxia	32.28%	44.91%	25.61%	103.66%	35.37%	36.45%	39.43%	36.86%	50.31%	31.79%	20.79%	20.66%
Tongchuan	24.97%	28.82%	19.37%	42.41%	27.23%	29.80%	30.19%	46.30%	38.13%	24.93%	17.12%	16.77%
Weinan	45.83%	49.57%	33.18%	175.69%	50.94%	41.61%	61.73%	60.27%	88.39%	42.19%	22.00%	21.63%
Xi’an	32.58%	41.42%	21.21%	58.56%	32.33%	25.75%	57.70%	54.91%	83.77%	20.82%	17.04%	16.33%
Xianyang	39.36%	48.00%	38.23%	121.86%	49.11%	30.35%	66.77%	52.84%	66.73%	37.68%	20.46%	20.38%
Yuncheng	21.75%	31.47%	19.27%	55.65%	25.23%	21.25%	27.95%	36.46%	33.42%	23.09%	14.69%	14.56%
Average	35.13% (5)	47.39% (8)	29.87% (3)	105.78% (12)	38.95% (6)	30.92% (4)	48.63% (9)	51.17% (10)	61.99% (11)	39.54% (7)	20.25% (2)	19.63% (1)

Table 4. Values of RMSE with different methods for different cities.

CITY	MLR	GPR	BPNN	SVR	KRR	DT	SGD	GCN	GAT	LsOA- KELM	FC- LsOA- KELM--	FC- LsOA- KELM
Baoji	6.0387	6.2340	4.9951	7.3118	5.9949	7.6410	7.1130	13.7306	11.0259	6.0634	3.8903	3.8899
Jinzhong	6.7307	7.9415	6.4126	7.1433	7.1433	9.5298	7.5211	11.4633	10.4303	6.6280	5.4699	5.4377
Linfen	7.6286	10.5295	7.4877	13.7167	7.9837	9.3066	8.8934	10.6429	12.3162	12.5136	5.8552	5.7987
Luoyang	5.7315	6.2975	5.2057	9.2403	5.8818	7.6821	6.7448	11.8725	13.6997	5.8018	4.0487	4.0454
Lvliang	4.3497	6.5708	5.8295	14.0666	5.6420	10.6836	4.6932	9.9392	8.6261	6.9900	6.3899	6.3164
Sanmenxia	6.5610	7.6068	6.3408	10.4989	6.8107	9.3511	7.3729	10.3181	9.3824	6.4893	5.2770	5.2595
Tongchuan	6.7897	7.1010	6.5860	8.2193	6.8189	9.8223	7.3982	14.8928	11.2797	6.7131	5.2050	5.1898
Weinan	6.8271	7.3700	6.0154	12.9803	7.1069	8.1725	7.6357	9.6211	11.6963	6.4087	4.8395	4.8250
Xi’an	5.0065	5.4275	4.1472	6.8488	4.9635	6.1154	6.6992	10.0322	11.8799	3.8726	2.9850	2.9415
Xianyang	5.8522	6.2894	5.6366	11.0173	6.1698	7.3127	7.8243	9.1366	8.9809	5.8797	3.9272	3.9186
Yuncheng	7.3365	8.4255	6.9595	10.1973	7.4415	8.4045	8.3762	14.1279	12.7037	7.3362	5.2392	5.2369
Average	6.2593 (4)	7.2540 (7)	5.9651 (3)	10.1128 (10)	6.5415 (5)	8.5474 (9)	7.2975 (8)	11.4343 (12)	11.0928 (11)	6.7906 (6)	4.8297 (2)	4.8054 (1)

Table 5. Values of

\overset{\land}{R^{2}}

with different methods for different cities.

Table 5. Values of

\overset{\land}{R^{2}}

with different methods for different cities.

CITY	MLR	GPR	BPNN	SVR	KRR	DT	SGD	GCN	GAT	LsOA- KELM	FC- LsOA- KELM--	FC- LsOA- KELM
Baoji	0.1191	0.1269	0.0815	0.1746	0.1174	0.1907	0.1653	0.6158	0.3971	0.1201	0.0494	0.0494
Jinzhong	0.1123	0.1563	0.1019	0.1265	0.1265	0.2251	0.1402	0.3258	0.2697	0.1089	0.0742	0.0733
Linfen	0.1332	0.2538	0.1284	0.4308	0.1459	0.1983	0.1811	0.2593	0.3473	0.3585	0.0785	0.0770
Luoyang	0.0814	0.0983	0.0672	0.2116	0.0857	0.1462	0.1127	0.3493	0.4651	0.0834	0.0406	0.0406
Lvliang	0.0434	0.0991	0.0780	0.4543	0.0731	0.2621	0.0506	0.2268	0.1708	0.1122	0.0937	0.0916
Sanmenxia	0.1292	0.1737	0.1207	0.3309	0.1392	0.2625	0.1632	0.3196	0.2642	0.1264	0.0836	0.0830
Tongchuan	0.1152	0.1260	0.1084	0.1688	0.1162	0.2410	0.1367	0.5541	0.3178	0.1126	0.0677	0.0673
Weinan	0.1380	0.1608	0.1071	0.4989	0.1496	0.1978	0.1726	0.2741	0.4051	0.1216	0.0693	0.0689
Xi’an	0.1197	0.1407	0.0821	0.2240	0.1177	0.1786	0.2143	0.4807	0.6740	0.0716	0.0426	0.0413
Xianyang	0.1173	0.1355	0.1089	0.4159	0.1304	0.1832	0.2098	0.2860	0.2764	0.1184	0.0528	0.0526
Yuncheng	0.1076	0.1419	0.0968	0.2079	0.1107	0.1412	0.1403	0.3991	0.3227	0.1076	0.0549	0.0548
Average	0.1106 (4)	0.1467 (7)	0.0983 (3)	0.2949 (10)	0.1193 (5)	0.2024 (9)	0.1533 (8)	0.3719 (12)	0.3555 (11)	0.1310 (6)	0.0643 (2)	0.0636 (1)

Table 6. The number of features under different thresholds in each city.

CITY	Baoji	Jinzhong	Linfen	Luoyang	Lvliang	Sanmenxia	Tongchuan	Weinan	Xi’an	Xianyang	Yuncheng
0.5	617	2066	1553	3043	365	1354	589	972	1356	901	2737
0.6	126	215	195	328	79	232	111	218	278	202	309
0.7	18	23	29	38	6	27	18	31	57	28	34

Table 7. Friedman test ranking results.

NO.	Friedman Mean Rank	MAPE	RMSE	$\overset{\land}{R^{2}}$
1	Multiple Linear Regression(MLR)	5.18	4.45	4.45
2	Gaussian Process Regression(GPR)	8.45	7.55	7.55
3	Back Propagation Neural Network(BPNN)	3.82	3.18	3.18
4	Support Vector Regression(SVR)	11.41	10.41	10.41
5	Kernel Ridge Regression(KRR)	6.77	5.41	5.41
6	Decision Tree(DT)	4.91	9.00	9.00
7	Stochastic Gradient Descent(SGD)	9.09	7.27	7.27
8	Graph Convolutional Network(GCN)	9.36	11.00	11.00
9	Graph Attention Network(GAT)	10.27	10.73	10.73
10	LsOA-KELM	5.73	5.27	5.27
11	FC-LsOA-KELM--	2.00	2.36	2.36
12	FC-LsOA-KELM	1.00	1.36	1.36

Table 8. Holm’s method test results of the first group.

FC-LsOA-KELM vs.	Rank	z-Value	p-Value	a/i(0.05)	a/i(0.1)
MLR	5.18	−2.934	0.00335	0.00455	0.00909
GPR	8.45	−2.934	0.00335	0.005	0.01
BPNN	3.82	−2.934	0.00335	0.00556	0.01111
SVR	11.41	−2.934	0.00335	0.00625	0.0125
KRR	6.77	−2.934	0.00335	0.00714	0.01429
DT	4.91	−2.934	0.00335	0.00833	0.01667
SGD	9.09	−2.934	0.00335	0.01	0.02
GCN	9.36	−2.934	0.00335	0.0125	0.025
GAT	10.27	−2.934	0.00335	0.01667	0.03333
LsOA-KELM	5.73	−2.934	0.00335	0.025	0.05
FC-LsOA-KELM--	2.00	−2.934	0.00335	0.05	0.1

Table 9. Holm’s method test results of the second group.

FC-LsOA-KELM vs.	Rank	z-Value	p-Value	a/i(0.05)	a/i(0.1)
GPR	7.55	−2.934	0.00335	0.00455	0.00909
SVR	10.41	−2.934	0.00335	0.005	0.01
DT	9.00	−2.934	0.00335	0.00556	0.01111
GCN	11.00	−2.934	0.00335	0.00625	0.0125
GAT	10.73	−2.934	0.00335	0.00714	0.01429
LsOA-KELM	5.27	−2.934	0.00335	0.00833	0.01667
FC-LsOA-KELM--	2.36	−2.934	0.00335	0.01	0.02
BPNN	3.18	−2.845	0.00444	0.0125	0.025
KRR	5.41	−2.845	0.00444	0.01667	0.03333
SGD	7.27	−2.845	0.00444	0.025	0.05
MLR	4.45	−2.934	0.02080	0.05	0.1

Table 10. Holm’s method test results of the third group.

FC-LsOA-KELM vs.	Rank	z-Value	p-Value	a/i(0.05)	a/i(0.1)
GPR	7.55	−2.934	0.00335	0.00455	0.00909
SVR	10.41	−2.934	0.00335	0.005	0.01
DT	9.00	−2.934	0.00335	0.00556	0.01111
GCN	11.00	−2.934	0.00335	0.00625	0.0125
GAT	10.73	−2.934	0.00335	0.00714	0.01429
LsOA-KELM	5.27	−2.934	0.00335	0.00833	0.01667
FC-LsOA-KELM--	2.36	−2.934	0.00335	0.01	0.02
BPNN	3.18	−2.845	0.00444	0.0125	0.025
KRR	5.41	−2.845	0.00444	0.01667	0.03333
SGD	7.27	−2.845	0.00444	0.025	0.05
MLR	4.45	−2.490	0.01279	0.05	0.1

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, D.; Ren, X. Prediction of Ozone Hourly Concentrations Based on Machine Learning Technology. Sustainability 2022, 14, 5964. https://doi.org/10.3390/su14105964

AMA Style

Li D, Ren X. Prediction of Ozone Hourly Concentrations Based on Machine Learning Technology. Sustainability. 2022; 14(10):5964. https://doi.org/10.3390/su14105964

Chicago/Turabian Style

Li, Dong, and Xiaofei Ren. 2022. "Prediction of Ozone Hourly Concentrations Based on Machine Learning Technology" Sustainability 14, no. 10: 5964. https://doi.org/10.3390/su14105964

APA Style

Li, D., & Ren, X. (2022). Prediction of Ozone Hourly Concentrations Based on Machine Learning Technology. Sustainability, 14(10), 5964. https://doi.org/10.3390/su14105964

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Prediction of Ozone Hourly Concentrations Based on Machine Learning Technology

Abstract

1. Introduction

2. Related Work

3. Method Design

3.1. Design of Feature Construction Algorithm

3.2. Kernel Extreme Learning Machine (KELM)

3.3. Lioness Optimization Algorithm (LsOA)

3.4. Design of the Prediction Model

3.5. Comparison Methods

3.6. Evaluation Indicators

4. Research Results

4.1. Research Area

4.2. Research Data

4.3. Feature Construction

4.4. Result Analysis

4.5. Statistical Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Nomenclature

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI