Predicting Benzene Concentration Using Machine Learning and Time Series Algorithms

: Benzene is a pollutant which is very harmful to our health, so models are necessary to predict its concentration and relationship with other air pollutants. The data collected by eight stations in Madrid (Spain) over nine years were analyzed using the following regression-based machine learning models: multivariate linear regression (MLR), multivariate adaptive regression splines (MARS), multilayer perceptron neural network (MLP), support vector machines (SVM), autoregressive integrated moving-average (ARIMA) and vector autoregressive moving-average (VARMA) models. Benzene concentration predictions were made from the concentration of four environmental pollutants: nitrogen dioxide (NO 2 ), nitrogen oxides (NO x ), particulate matter (PM 10 ) and toluene (C 7 H 8 ), and the performance measures of the model were studied from the proposed models. In general, regression-based machine learning models are more e ﬀ ective at predicting than time series models.


Introduction
Volatile organic compounds (VOCs) are compounds with very high volatility which are in a gaseous state under normal circumstances. They include a variety of chemical species among which are benzene, toluene, ethylbenzene and xylene [1] and are known as BETX, all of them are air pollutants, along with other compounds such us SO 2 , NO 2 , NO x , PM 10 , PM 2.5 or CO [2].
Benzene is a hydrocarbon with a ring structure made up of six carbon atoms. It is one of the products that is most used as a raw material in industrial processes in the organic chemical industry. Its usual state is liquid, highly flammable, and colorless, although it does have a very characteristic odor and very high toxicity [3].
Benzene has been used in the manufacture of several chemical products like styrene, phenols, in nylon and synthetic fibers, maleic anhydride, pharmaceuticals, detergents and dyes and explosives. It has also been used in fuel, where it is added as a lead substitute, chemical reagent and extraction agent for other chemical compounds [3]. Derivates of benzene are used primarily as solvents and diluents in the manufacture of perfumes and in intermediates in the production of dyes, although the most important industrial use is as solvents and paint thinners [4,5]. benzene, fine particles, sulfur dioxide and nitrogen dioxide is a significant environmental risk factor for mortality [23].
There are different limits established for exposure to benzene. The European Commission has established a concentration limit for benzene of 5 µg/m 3 as an annual averaging period in the Directive 2008/50/EC, Directive on environment air quality [24].
Regarding occupational exposure limits [13], TLV-TWA of 0.5 ppm (the threshold limit value considered as the time-weighted average for an 8 h day and 40 weekly hours) and 2.5 ppm as STEL (short-term exposure limit for periods of 15 min that should not be exceeded at any time) have been established.
OSHA regulates benzene levels in the workplace. The maximum amount of benzene in the air must not exceed 1 ppm during an 8 h day, 40 h a week [5] (Directive 97/42/EC, amendment of 90/394/EEC).
Environmental pollution processes are complex and direct measurement is not always possible. In addition, it is difficult to carry out an analysis to discover the source, the propagation and distribution models and to make predictions over time. Therefore, since the reduction of such pollution is an objective of the European Union, it is necessary to find methods that make it possible to model its behavior and predict its evolution.
In the case of pollution concentration modeling, there are two methodologies used in predicting air quality: physical models and models based on statistical data. Physical models are based on chemical dispersion and transport models, in which the input variables are often related to parameters obtained from meteorological observations, such as air temperature, dew point, relative humidity, atmospheric pressure, speed and direction of wind, UV radiation, another group of parameters related to the terrain such as vegetation, local topography, etc. and parameters related to emission sources that represent changes in traffic patterns, industrial activity and the distance to these sources. Statistical machine learning models are based on periodic parameters and these models do not need to understand physical phenomena. Models based on statistical parameters achieve better predictions than models that only used meteorological variables as input [25].
The objective of this study was to predict the concentration of benzene from other pollutants, which were collected by several measurement stations of the Community of Madrid (Spain) every day during the period indicated in Table 1 for each station, constituting voluminous information that allows their mathematical modeling and statistical learning to obtain an explanation of the dependency among the main pollutants in a geographic area based on the concentrations of other air pollutants in each station, as well as the environmental pollution existing at that time in other stations. To make this prediction, time series forecasting techniques such as autoregressive moving average vectors (VARMAs), and integrated autoregressive moving average model (ARIMA) were applied and the results compared with those obtained with machine learning methodologies such as artificial neural networks (ANNs), support vector machines (SVMs), multivariate linear regressions (MLRs) and multivariate adaptative regression splines (MARSs).
In existing literature there are several success stories of applying these methodologies as a tool for predicting contamination and dispersion of air pollutants such as SO 2 , NO 2 , NO x , PM 10 , PM 2.5 O 3 or CO.
In many of these articles they analyze different models, compare the results and determine which method makes the best predictions for the pollutants that have been studied [40,41].
A previous study [28] reviewed 139 articles about pollutants examined with ANN models and the results indicate that the most analyzed pollutants are particulate matter (PM 10 , PM 2.5 ) in 62% of the articles, 36% nitrogen oxides (NO, NO 2 and NO x ), 31% O 3 . SO 2 and CO modeling in 13% and 16% of articles. Almost a third of them include the study of several pollutants, but only one article [42], has studied an ANN model for predicting benzene.
Therefore, models for the prediction of benzene have not yet been developed, and even less so to study the relationship between measurements obtained from other pollutants at stations with benzene. The methods used are data-oriented, so there is no need to make solid chemical-physical assumptions during the modeling process, they have the capacity to handle large amounts of data and are applied in different tasks such as studying the existing relationship between benzene and the rest of pollutants and its prediction.

The Database
In this study, data observed from various pollutants was used, along with four predictor variables: nitrogen dioxide (NO 2 ), nitrogen oxides (NOx), particulate matter with a diameter less than 10 µm (PM 10 ), and toluene (C 7 H 8 ) were selected to perform the benzene (C 6 H 6 ) prediction models.
These pollutants were collected by eight remote measurement stations in the Community of Madrid (Spain) every day, with hourly measurements during the periods indicated in Table 1. Figure 1 shows the geographical location of the stations used in this study on the map of Madrid. Before carrying out the formation phase of the models used, an imputation of the missing data due to breakdowns and maintenance of the stations was carried out, through multiple imputations by chained equations [43].
The mean and standard deviation of the pollutants under study are shown in Table 2.

Multivariate Linear Regression
Linear regression with several independent variables is known as multivariate or multiple linear regression (MLR), which is a generalization of simple regression when the relationship of a dependent variable designated by Y is represented by the linear combination of the other k independent variables, which are denoted by X 1 , X 2 , . . . , X k .
MLR model assumes a relationship of the type Y = β 0 + β 1 X 1 + β 2 X 2 + . . . + β k X k + ε, where β 0 , β 1 , β 2 , . . . , β k are (k + 1) parameters called regression coefficients and ε is the stochastic error associated with the regression. Regression parameters β are estimated in order to determine the best hyperplane among all of the form y t =β 0 +β 1 x 1 +β 2 x 2 + . . . +β k x k . A set of n observations is available for each of the dependent and independent variables, so the model is as follows [44,45]: In order to be able to make predictions about the behavior of the Y variable, it is first necessary to carry out the estimationsβ 0 ,β 1 ,β 2 , . . . ,β k of the model parameters by the ordinary least square method whose objective is to find the hyperplanes that minimize the sum of square error (SSE) between the observed and predicted response: SSE = n j=1 y j −ŷ j 2 , where y j is the outcome andŷ j is the model prediction. That is: To minimize SSE, it is derived with respect to theβ 0 ,β 1 ,β 2 , . . . ,β k parameters and equating to zero gives a system of equations called normal equations.
In matrix form, the MLR model involves finding the last square solution of the linear system Xβ = Y, where X is the n × (k + 1) input data matrix with each row an input vector (with a 1 in the first position), X t its transposed matrix, Y the (n × 1) vector of output in the training set andβ the (k × 1) parameters vector. Now the residual sum-of-squares is Y − Xβ t Y − Xβ and to minimize SSE is derived ∂SSE /∂β = −2X t Y − Xβ and equating to zero (assuming rank X t X is k and, hence, X t X is positive definite), that is: The fitting values at the training inputs are Y = Xβ = X X t X −1 X t or what is the same: Before modelling, stepwise regression [45] was used to choose a significant subset of independent variables for every station. This method selects the predictors by automatic iteration forward procedure dropping and including variables. It stops when the lowest Akaike Information Criterion (AIC) is reached. The models were built based on R 2 adjust when all variables were at 5% significance level and were validated by the k-fold cross validation method, which is explained later.

Multivariate Adaptive Regression Splines
Multivariate adaptive regression splines (MARSs) is a nonparametric statistical method developed by Friedman [46] and is a generalization of recursive partitioning regression (RPR) that can consider complex relationships between a set of k predictor independent variables, which are denoted by X 1 , X 2 , . . . , X k and a dependent variable designated by Y, and does not make starting assumptions about any type of functional relationship between input and output variables. The MARS model is defined as [46][47][48].ŷ where X is a function of the independent variables and their interactions, β 0 is the intercept parameter, β is a vector of coefficients of the basis function, M is the total number of basis functions, h(X) the spline basis function in the model and ε is the fitting error.
To approximate the non-linear relationships between the independent variables X and the response variable Y, basis functions (BF) are used. They consist of a single spline function or the product of two or more spline functions for different predictors. Spline functions are piecewise linear functions, that is, truncated left-hand and right-hand functions, and take the form of hinge functions that are joined together smoothly at the knots.
where t is a constant called a node that specifies the boundary between the regions that have continuity from the base functions of the regions from left to right and that are smoothly joined at the given node and adaptively selected from the data. The "+" sign refers to the positive part and indicates a value of zero for negative values of the argument. As stated in [49], MARS forms reflected pairs for each predictor variable with knots of each value x j , j ∈ {1, . . . , k} with knots at each observed value x ij , 1 ∈ {1, . . . , n} of that variable, where n is the sample size. The set of all possible pairs with the corresponding knots and the truncated linear basis functions can be expressed by the set An adaptive regression algorithm is taken during a recursive partition strategy to automatically select the locations of the node or breakpoints, including the two-stage process: forward-stepwise regression selection and backward-stepwise elimination procedure [50].
The first step, also called the construction phase, begins with the intercept and then sequentially adds to the model the predictor that best improves the fit; that is, when the maximum reduction in the sum-of-squares residual error occurs. The search for the best combination of variable and node is done iteratively. Considering a current model with M base functions, the following pair will be added to the model in the form of β M+1 h m (X)max 0, X j − t + β M+2 h m (X)max 0, t − X j . The β coefficients are estimated using the least-squares method [44].
This process continues until a predetermined number of base functions (M max ) is reached or the R 2 changes less than a threshold [50]. A large number of BFs are added one after another and an overfitting model is created. Generally, the maximum number of BF is 2 to 4 times the number of predictor variables [46].
The second step, also called the pruning phase, begins with the full model and simplifies it by eliminating terms by applying a backward procedure to avoid oversizing. MARS identifies the basis functions that contribute the least to the model and removes the least significant terms sequentially. The final model is chosen using the generalized cross-validation method (GCV) [47], which is an adjustment of the sum-of-squares of the residuals and penalizes the complexity of the models by the number of basis functions and the number of knots.
where M is the number of BF, N is the number of data sets, f (x i ) denotes the predicted values of MARS and d is a penalty for each basis function, which takes the value of two for the additive model and three for an interaction model [47]. The term (M − 1)/2 is the number of hinge functions knots [44]. Note that MARS is a trademark and is used in this document as an acronym for multivariate adaptive regression splines.
The MARS model was made with the parameters degree = 9 and thresh = 1 × 10 −8 . These parameters are explained in [50].

Artificial Neural Networks
The artificial neural network (ANN) is a nonlinear and nonparametric statistical method designed to simulate the data processing of brain neurons [51]. The ANN does not make any prior assumption about the model-building process or the relationship between input and output variables. Several studies [52,53] have indicated that MLP is the most suitable and widely-used class of neural network in atmospheric pollutants. The MLP is the most common neural network and is extensively developed in the following literature [51,54,55].
The MLP network is divided into several layers: an input layer that has as many neurons as the model has independent variables, k, a single hidden intermediate layer of neurons, H, and an output layer with as many neurons as the model has dependent variables, S.
In a previous study [56] it was shown that a single hidden layer with a finite number of units is generally sufficient to fit any piecewise continuous function. In existing literature [57], several proposals have been made to estimate the number of neurons, although the number of hidden units H was determined by trial-and-error training various networks and estimating the corresponding errors in the test data set. If the hidden layer has few neurons, many training errors occur due to insufficient fit or because the network could not converge during training. In contrast, too many hidden layer neurons lead to low training error, but due to overfitting, they memorize the dataset and cause high test error and poor generalization [58].
Input variables are mapped by functions, called activation functions, into intermediate variables of the hidden layer, which are mapped to the output variables. MLP utilizes the following transformations [40]: . , x k ) represents the inputs, U = (u 1 , u 2 , . . . , u H ) is the output of the hidden layer, an output layer with one dependent variableŶ, in this case, it is considered for generalization, Y = (ŷ 1 ,ŷ 2 , . . . ,ŷ S ) as the outputs of the network and W indicates the weight matrix between two layers. φ(·) and ψ(·) are transfer functions of hidden layer and output layer. Both of these are logistic functions that map the output of each neuron to the interval [0,1].
All inputs were normalized as z i = x i − x /σ x , where x is the mean and σ the standard deviation of the observation dataset for each variable.
Neural networks fit to data using learning algorithms due to an iterative training process. In this case, a supervised learning algorithm characterized by the use of a target output was compared with the predicted output and by adjusting the weights. All weights and bias were initialized with random values taken from a standard normal distribution. The main steps of this iterative training procedure are as follows: Perform forward propagation of the first vector of input variables through the whole MLP network, which ultimately calculates an outputŶ for inputs X and current weights.
The input layer is just an information-receiving layer that receives the vectors of the input variables and redistributes them to the neurons in the hidden layer. This layer does not perform any type of processing on the data.
In the hidden layer, all inputs are multiplied by the weights and sum of all, taking into account the bias. The output of the i-th neuron of the hidden layer with H nodes is the one transformed by an activation function φ(X). The output of the hidden layer is where w ij is the weight connecting the i-th input with the j-th hidden node, w 0 j is a limit value known as the threshold value or bias, i = 1, 2, . . . , k and j = 1, 2, . . . , H. The activation function f used in this paper is a non-linear and differentiable non-decreasing bounded function such as the logistic function f (z) = 1 1+e −z In the output layer, all the outputs of the hidden layer are multiplied by the weights and the sum of all, taking into account the bias. The output of the neuron m-th of the output layer is transformed by an activation function ψ(U). The output of the output layer is: where w jm is the weight connecting the j-th output of the hidden layer with the m-th output node, w 0m is the bias, m = 1, 2, . . . , S and j = 1, 2, . . . , H. The activation function used is the logistic function again.
During the training process, the predicted outputŷ will be different from the observed output y. An error function E = 1 2 n t=1 S m=1 (ŷ lm − y lm ) 2 is calculated as the sum of squared errors (SSE), where t = 1, . . . , n are the observations (input-output pairs) and m = 1, . . . , S the output nodes.
During the backward phase, the error is propagated backward through the network, but the neurons in the hidden layer only receive a part of the total error signal, which depends on the relative contribution that each neuron has made to the feed forward output. This whole forward and backward process is repeated for several iterations and stops when a given threshold is reached by all absolute partial derivatives (∂E/∂w) of the error function with respect to the weights [59].
In this paper the Rprop algorithm was used, a resilient backpropagation with weight backtracking [59][60][61], which modifies the weights by adding a learning rate to find a local minimum of the error function so that when ∂E /∂w < 0 the weight is augmented and when ∂E /∂w > 0 the weight is reduced.
The main difference between Rprop and the backpropagation algorithm is that only the sign of the gradient is used to update the weights. In this way convergence is accelerated.
where t is the iteration of gradient descent and w ij the weights. η i is the learning rate that will be increased if the corresponding partial derivative keeps its sign or it will be decreased if the partial derivative of the error function changes its sign, that is [60]: where α > 1 > β, usually α = 1.2, β = 0.5.
In this study, the neuralnet library [59] of the R computer software was used. The number of neurons used was 8 for all stations. To determine this parameter, a random station was taken and was determined by trial-and-error and k-fold cross validation, k = 5, taking into account that in some studies such as [53] it is indicated that the number of neurons H could be the sum of the number of input variables plus the number of the output variables and the maximum number of neurons in the hidden layer twice the number of neurons in the input layer, although [53,58] indicate that there is no a "rule of thumb" to determine this parameter and so there should be an iterative approach to it.
The criteria used to stop the algorithm was the threshold for the partial derivatives of the error function, setting the threshold of 1 and a limit of maximum steps of 10 9 . The detailed explanation of the parameters of the computer program are indicated in reference [59]. The logistic function was chosen instead of other commonly-used functions such as the hyperbolic tangent as the activation function [44,62] because the values of the dependent variable take positive values given that it is the measure of benzene concentration. The input variables were normalized as indicated. As it is a regression-based model, the output parameter was fixed as linear.

Support Vector Machines
The support vector machine (SVM) is a nonparametric machine learning method developed by Vapnik [63] for both classification as regression. In this paper, the method was used for regression, that is, support vector regression (SVR). There are two basic versions of SVM regression, epsilon-SVR and nu-SVR denoted by ε-SVR and ν-SVR, the differences between which will be discussed later.
The SVR task consists of transforming the training dataset T = (x 1 , y 1 ), . . . , (x n , y n ) to a higher dimensional space F through a non-linear mapping by a kernel function Φ(x) where a linear regression can be done, with y i ∈ R, x i ∈ R k , n being the number of samples and k the dimension of the input dataset.
The SVR model is defined as where b is the intercept of the model indicating the bias, w • Φ(x) is the dot or scalar product of weight vector w ∈ R and the kernel function.
The error function is defined in [63] by an ε-insensitive loss function that defines a tube so that if the predicted value is within the tube the loss is zero, that is: In order to solve the limitations that result from the optimization problem, introduce slack variables ξ + i , ξ − i that depend on the position in relation to the tube: above (ξ + i ) or below (ξ − i ). Therefore, the problem as indicated in [63,64] is stated as follows: Minimize C n i=1 ξ + i + ξ − i + 1 2 w 2 , with the following constraints: where C > 0 is the cost parameter whose function is to control the trade-off between the complexity of the model and the maximum level of deviation above ε. If the cost parameter is large, the model becomes more flexible since the effect of the errors measured by the slack variables, and the value of ε increases. On the other hand, if C is small, the model will be tighter and less likely to overfit, since the effect of the norm of weights vector is greater and leads to a better generalization [65]. The problem can be solved in a simpler way in its dual formulation, also making it possible to extend the problem to nonlinear functions. Therefore, a standard dualization method is used using Lagrange multipliers for the optimization problem, as described in [64,66] and applying the Karush-Kuhn-Tucker (KKT) optimality conditions of the primal problem [67]. The kernel function K x i , x j returns the scalar product between the pairwise in a high-order dimensional space without explicitly mapping data. [63].
where L is the Lagrangian and α + i > 0, α − i > 0, η + i > 0, η − i > 0 are the Lagrange multipliers for all i considered. A saddle point is found by partial derivatives with respect to b, w, ξ + i , ξ − i [63]: Thus, the dual optimization is as described: subjected to for all i (16) where K x i , x j is the kernel function that satisfies the Merced condition explained in [68] and can be written as K x i , x j = Φ(x i ) • Φ x j . The radial basis function (RBF) kernel is used in this paper, that is, K x i , x j = e −λ x i −x j 2 and λ is a parameter that regulates the behavior of the kernel.
Finally, after solving the dual problem, the prediction function f (x) can be formulated in terms of Lagrange multipliers and the kernel function as: ε-SVR uses parameters C > 0 and ε > 0 to apply an optimization penalty for points that were not predicted correctly. As it is difficult to select an appropriate ε, in [69] a new algorithm ν-SVR is introduced that automatically adjusts the ε parameter, which defines the tolerance margin, by introducing a new parameter υ ∈ [0, 1] that makes it possible to control the number of support vectors and training errors, establishing the relationship between the number of support vectors that remain in the solution with respect to the total number of samples in the data set. ε parameter is introduced in the optimization problem formulation and is estimated automatically. This formulation is: which can be solved in a similar way to ε-SVR: subjected to In [70] it is explained how to solve ν-SVR in detail and in [66] the relationship between ε-SVR and ν-SVR is discussed.
A grid search [67] was carried out to determine the parameters in one of the stations, randomly chosen, and these values were used in all stations. The study was undertaken with the following parameters: tolerance: 0.01, 0.05, 0.1, 0. where gamma is a kernel parameter that defines the influence of a single training set point and tolerance is the termination criterion.

Autoregressive Integrated Moving Average
Autoregressive integrated moving average (ARIMA) is a parametric method of univariate analysis of time series that have a stochastic nature, and whose methodology was described by Box and Jenkins [71][72][73][74][75].
The variable Y t , t = 1, 2, . . . , n where n is the total number of observations, depends only on its own past and a set of random shocks, but on no other independent variables. It is; therefore, a matter of making forecasts about the future values of said variable using as information only that contained in the past values of the time series itself.
In the ARIMA model (p, d, q), p represents the order of the autoregressive process (AR), d is the number of differences that are necessary for the process to be stationary (I) and q represents the order of the moving average process (MA).
This methodology is fundamentally based on two principles [71]: (a) Selection of the model iteratively through four steps: identification to determine the order p, d, q of the model, estimation of the parameters, validation to verify that the model fits the data and prediction.
(b) Concise parameterization or parsimony for the representation of the model with the minimum number of possible parameters.
B is defined as lag operator BY t = Y t−1 as the result of delaying observations by one period, in general, B k Y t = Y t−k , where k is the number of lags and ∇ as difference operator of the form The autoregressive or AR(p) model is based on the fact that the value of the series at a given moment t is the linear combination of the past values up to a maximum number p and a random error, that is, . , Y t−p are the past values of the series, β 0 , β 1 , β 2 , . . . , β p are the constants, with values different from zero, that have to be estimated with the regression and ε t is a Gaussian white noise error random variable.
When a time series is non-stationary, the integrated process I(d) is the method to make the time series near-stationary by differencing, where d is the number of differentiations necessary to make the series stationary, that is, using lag and difference operators: Hence, an ARIMA (p, d, q) model can be written as follows, where γ is a constant: β p (B)∇ d Y t = α q (B)ε t + γ as indicated in [76]. A stationary model is an ARMA (p, q) process, that is an ARIMA (p, 0, q).
To adjust the best ARIMA model for each station, the auto.arima function [77] of the forecast library of the R software was used. The value of the parameters p, d, q is indicated in Table 7.

Vector Autoregressive Moving Average
Vector autoregressive moving-average (VARMA) is a parametric method of multivariate analysis time series that have a stochastic nature. They are a generalization of univariate models ARMA, with the difference that instead of a single variable there are several variables; that is, they study the relationships between several time series, without distinguishing between exogenous and endogenous variables. A detailed explanation of the model can be found in [71,78].
In the VARMA model (p, q) p represents the order of the autoregressive process (VAR) and q represents the order of the moving average process (VMA).
This methodology is fundamentally based on two principles [71]: (a) Selection of the model iteratively. First, determination of the p, q order of the model, estimation of the parameters, validation that the model fits the data and prediction.
(b) Representation of the model with the minimum number of possible parameters. B is defined as lag operator in the same way that in the previous section B k Y t = Y t−k , where k is the number of lags [73].
Vector autoregressive or VAR(p) is a model in which the value of the series at a given time t are a linear combination of the past values of the variable and of the other variables up to a maximum number p and a random error vector, that is, where Y t is a vector of k variables to be predicted at time t, β 0 is a k dimensional vector of constants, β i are k × k matrices for i > 0 and β p not a null matrix. ε t are the multivariate white noise vectors with a positive definite covariance matrix Σ ε and zero mean.
Another representation with B operator can be given as β(B)Y t = β 0 + ε t , where β(B) = I − β 1 B − β 2 B 2 − · · · − β p B p is a degree p polynomial of matrices and I is k × k identity matrix [71].
The vector moving average or VMA(q) model is represented by a relationship between the time series and the present and q past values of white noise, that is, where α i are k × k matrices and α q not a null matrix, i > 0. µ is constant vector containing the mean of the process. ε t are white noise series. Using B lag operator, the model becomes Y t = µ + α(B)ε t , where α(B) = I − α 1 B − α 2 B 2 − · · · − α q B q is a matrix polynomial of order q and I is k × k identity matrix [78].
The vector autoregressive moving-average VARMA (p,q) can be written as follows: The VARMA parameters p and q have been determined by trial-and-error, choosing those that allow determining a smaller RMSE error. These parameters are shown in Table 8.

Performance Measurements
In order to make comparisons of the performance measures between the machined learning models, for each model the same parameters were established for all the stations, as indicated in the corresponding sections.
The different models that have been developed in this study are evaluated for their accuracy by root mean squared error (RMSE) [79,80], mean absolute error (MAE) [80,81] and bias [80] according to the following equations: As a method to estimate the prediction error, a k-fold cross validation was used for machine learning algorithms. This method uses part of the data to fit the model and another part to test it [44]. The dataset has been divided into k = 5 equal parts.
To validate the model of the time series algorithms, the data was divided into two sets, 80% of the initial data to fit the algorithm. A prediction was made on 20% of the final observations, which are the values used to calculate the RMSE, MAE and bias errors.

Results and Discussion
In this section the performance of the forecasts performed with the MLR, MARS, MLPNN, SVM, ARIMA and VARMA models are presented. Table 3 shows RMSE, MAE and bias values for the MLR models of all the stations, while Table 4 shows the same information for the MARS models while Tables 5-8 do the same for SVM, MLPNN with a single hidden layer with eight neurons, ARIMA and VARMA models, respectively. In the case of SVM, both ε-SVR and ν-SVR regressions were tested. Figure 2 shows a comparison of all RMSE values for all models and stations, the same is made for MAE in Figure 3 and BIAS in Figure 4.   Station 23 does not have such a differentiated error pattern between machine learning models and early series models, and the error values are more clustered. The lowest value for RMSE was reached for both SVR models with 1.095 and the highest value for MLP, with 1.360. Something similar occurs with the MAE error. The lowest error was achieved for ν-SVR with 0.498 and the highest value for MLP, with 0.621. The lowest bias error was reached for MLR with 10 −4 and the highest value for VARMA with 0.679, with positive values for MLR and MARS and negative for the rest of the models.
At station 25, the RMSE error ranges from 1.015 for the MARS model to 1.397 for the VARMA model. Time series models have slightly higher RMSE than machine learning models. Regarding MAE, the time series models have approximately twice the value of the other models, the lowest value being 0.458 for ν-SVR and the highest for VARMA, with 1.123. The same goes for the bias error. All bias errors in machine learning models are negative and close to zero with the lowest value for MLR. On the other hand, the time series models present positive values; that is, the observed value is greater than that predicted by these models, with the highest bias value for VARMA, with 0.579.
At station 26, the ARIMA model is the one that differs the most from the rest of the models, which show similar behavior for the three errors. The lowest value of RMSE was reached for the MARS and SVR models with 0.24, followed by MLP and MLR with 0.25 and 0.27 for VARMA. The ARIMA model has an RMSE of 0.36. The smallest MAE error was achieved in SVR, with 0.097. The rest of the models vary from 0.104 for MARS to 0.164 for VARMA, reaching 0.319 for the ARIMA model. The smallest bias error is for MLR, with a positive 5 × 10 −6 value. In the rest of the models the values are negative, except ARIMA, which reached a value of 0.243.   The behavior of the RMSE error in station 35 was very similar for all models, acquiring the lowest value for ε-SVR with 0.880 and the highest value for ARIMA, with 1.088. Machine learning models have a similar MAE error, the smallest being 0.416 for ε-SVR. Time series models have a slightly larger MAE error of 0.643 for ARIMA. The bias error is negative for all models except for VARMA, being very close to zero and the lowest value on the order of −10 × 10 −4 corresponds to MLR. The RVS models show a bias error of −0.02, while the bias error of the ARIMA model is −0.318.
At stations 15 and 23, the MLP model has a very high RMSE in relation to the machine learning models, compared to the homogeneity of the RMSE error performance in the rest of the stations. Station 35 has a very homogeneous RMSE error for all models. In stations 06 and 22, the machine learning models also have a homogeneous RMSE, but there is a clear difference with respect to the time series models, whose RMSE is higher. Station 15 has the highest RMSE and MAE in machine learning models, but this does not happen in time series models. Stations 26 and 27 have clearly lower RMSE and MAE values in all models, including time series models. The worst bias error for the ARIMA model is found at station 22. This model is the one that, together with VARMA, has the highest bias error for all stations, in general. The SVM models, both ε-SVR and ν-SVR, perform in a very similar way.

Conclusions
This paper studies the relationship between four predictors and benzene in order to establish predictions at eight stations in the community of Madrid, Spain, using seven mathematical models: MLR, MARS, ε-SVR, ν-SVR, MLP, ARIMA and VARMA.
Stations 06, 15, 22 and 35 are stations that are located in the city center and have observations with an average concentration of benzene higher than the rest of the stations located in the east, on the outskirts of the city or near the airport.
The models were evaluated using RMSE, MAE and bias. The validation of the machine learning models was carried out using k times the cross validation with k = 5 and with 20% of the observations for the time series models.
The results showed that, in general, machine learning models are more effective at predicting than time series models, but this does not happen at all stations, since at station 15 the lowest RMSE occurs in the VARMA model. The highest error values occur for the time series models, except at station 15, where they were obtained for the MLP model. MLR, MARS, SVR and MLP follow similar behavior patterns for all stations and stations 26 and 27 show the lowest errors.
The time series models, ARIMA and VARMA, present greater variations in the values obtained for the stations, not following the same pattern as the machine learning models.