A Methodology to Increase the Accuracy of Particulate Matter Predictors Based on Time Decomposition

: Particulate matter (PM) is one of the most harmful air pollutants to human health studied worldwide. In this scenario, it is of paramount importance to monitor and predict PM concentration. Artiﬁcial neural networks (ANN) are commonly used to forecast air pollution levels due to their accuracy. The use of partition on prediction problems is well known because decomposition of time series allows the latent components of the original series to be revealed. It is a matter of extracting the “deterministic” component, which is easy to predict the random components. However, there is no evidence of its use in air pollution forecasting. In this work, we introduce a di ﬀ erent approach consisting of the decomposition of the time series in contiguous monthly partitions, aiming to develop specialized predictors to solve the problem because air pollutant concentration has seasonal behavior. The goal is to reach prediction accuracy higher than those obtained by using the entire series. Experiments were performed for seven time series of daily particulate matter concentrations (PM 2.5 and PM 10 –particles with diameter less than 2.5 and 10 micrometers, respectively) in Finland and Brazil, using four ANNs: multilayer perceptron, radial basis function, extreme learning machines, and echo state networks. The experimental results using three evaluation measures showed that the proposed methodology increased all models’ prediction capability, leading to higher accuracy compared to the traditional approach, even for extremely high air pollution events. Our study has an important contribution to air quality prediction studies. It can help governments take measures aiming air pollution reduction and preparing hospitals during extreme air pollution events, which is related to the following United Nations sustainable developments goals: SDG 3—good health and well-being and SDG 11—sustainable cities and communities.

In the first step of Figure 1, a seasonality test is performed to determine the series' seasonal period. In this work, we used Friedman's test and Kruskal-Wallis test [42]. For example, suppose we assume that the PM time series presents similar patterns in a monthly fashion based on the statistic tests. In that case, a forecasting model is employed to perform daily predictions in a specific month. The objective is to build models specialized for each month, since similar patterns may be present in the same month. These models are selected based on the current prediction month.
A statistical test is adopted using the coefficient of variation (CV) analysis, which should answer the question: "Is the mean CV calculated from partitions smaller than the CV of the entire series?" We adopted this step, given the existence of a relationship between predictability and CV demonstrated by [43][44][45]. The purpose of employing a seasonality test is to find time series partitions with lower CV values than the entire time series. Figure 2 depicts the framework of the testing phase. The first part consists of determining the partition used for the prediction purpose. For partition j, the system uses the Fj model, which was previously designed (in the training phase) precisely to predict the j-th sub series (in the j-th partition). The definition of the number of subseries is based on the seasonality tests above mentioned and the mean CV values, in which the value n = 12 achieved the best results. Furthermore, n = 12 also represents the seasonal component of the series. Thus, if the prediction is performed in March, then the F3 model is used; if it is performed in December, the F12 model is used.  In the first step of Figure 1, a seasonality test is performed to determine the series' seasonal period. In this work, we used Friedman's test and Kruskal-Wallis test [42]. For example, suppose we assume that the PM time series presents similar patterns in a monthly fashion based on the statistic tests. In that case, a forecasting model is employed to perform daily predictions in a specific month. The objective is to build models specialized for each month, since similar patterns may be present in the same month. These models are selected based on the current prediction month.
A statistical test is adopted using the coefficient of variation (CV) analysis, which should answer the question: "Is the mean CV calculated from partitions smaller than the CV of the entire series?" We adopted this step, given the existence of a relationship between predictability and CV demonstrated by [43][44][45]. The purpose of employing a seasonality test is to find time series partitions with lower CV values than the entire time series. Figure 2 depicts the framework of the testing phase. The first part consists of determining the partition used for the prediction purpose. For partition j, the system uses the F j model, which was previously designed (in the training phase) precisely to predict the j-th sub series (in the j-th partition). The definition of the number of subseries is based on the seasonality tests above mentioned and the mean CV values, in which the value n = 12 achieved the best results. Furthermore, n = 12 also represents the seasonal component of the series. Thus, if the prediction is performed in March, then the F 3 model is used; if it is performed in December, the F 12 model is used. In the first step of Figure 1, a seasonality test is performed to determine the series' seasonal period. In this work, we used Friedman's test and Kruskal-Wallis test [42]. For example, suppose we assume that the PM time series presents similar patterns in a monthly fashion based on the statistic tests. In that case, a forecasting model is employed to perform daily predictions in a specific month. The objective is to build models specialized for each month, since similar patterns may be present in the same month. These models are selected based on the current prediction month.
A statistical test is adopted using the coefficient of variation (CV) analysis, which should answer the question: "Is the mean CV calculated from partitions smaller than the CV of the entire series?" We adopted this step, given the existence of a relationship between predictability and CV demonstrated by [43][44][45]. The purpose of employing a seasonality test is to find time series partitions with lower CV values than the entire time series. Figure 2 depicts the framework of the testing phase. The first part consists of determining the partition used for the prediction purpose. For partition j, the system uses the Fj model, which was previously designed (in the training phase) precisely to predict the j-th sub series (in the j-th partition). The definition of the number of subseries is based on the seasonality tests above mentioned and the mean CV values, in which the value n = 12 achieved the best results. Furthermore, n = 12 also represents the seasonal component of the series. Thus, if the prediction is performed in March, then the F3 model is used; if it is performed in December, the F12 model is used.  Different scenarios used this monthly approach. Ballini et al. [46] used fuzzy inference systems for identifying two hydrological processes and their use in generating synthetic monthly inflow sequences. Luna and Ballini [37] addressed monthly electrical energy demand forecasting to an energy enterprise in Brazil's Southeastern region. At the same time, Siqueira et al. [27,28] proposed a similar methodology for a seasonal streamflow series forecasting released to hydroelectric plants. In such studies, the lags (inputs) of the models were the previous month data. For example, to predict May 2010, the entries were April, March, and February samples from 2010. However, there are essential differences between the method introduced in this work and the approaches mentioned above:

1.
A seasonality test is conducted to find a suitable seasonal period for the series; 2.
The calculation of the CV is employed to determine which partitions present similar patterns; 3.
MLP, RBF, ESN, and ELM are used to assess the proposed methodology.
It is remarkable that in the traditional approach, a forecasting model needs to learn the statistical fluctuations of nonlinear mapping for all months of the year. In the proposed approach, despite having less data to adjust, it only needs to learn the month's statistical fluctuations in evaluation.  Different scenarios used this monthly approach. Ballini et al. [46] used fuzzy inference systems for identifying two hydrological processes and their use in generating synthetic monthly inflow sequences. Luna and Ballini [37] addressed monthly electrical energy demand forecasting to an energy enterprise in Brazil's Southeastern region. At the same time, Siqueira et al. [27,28] proposed a similar methodology for a seasonal streamflow series forecasting released to hydroelectric plants. In such studies, the lags (inputs) of the models were the previous month data. For example, to predict May 2010, the entries were April, March, and February samples from 2010. However, there are essential differences between the method introduced in this work and the approaches mentioned above:

Partition Creation and Calculation of the Coefficient of Variation
1. A seasonality test is conducted to find a suitable seasonal period for the series; 2. The calculation of the CV is employed to determine which partitions present similar patterns; 3. MLP, RBF, ESN, and ELM are used to assess the proposed methodology.
It is remarkable that in the traditional approach, a forecasting model needs to learn the statistical fluctuations of nonlinear mapping for all months of the year. In the proposed approach, despite having less data to adjust, it only needs to learn the month's statistical fluctuations in evaluation.  As each partition contains data specific to a particular month, it is expected that similar temporal patterns are clustered. Consequently, one expects a favorable scenario to present lower values of the CV in the partitions in comparison to the CV in the entire series. Since low/high values of the CV point to low/high data fluctuations, a correspondingly high/low predictability is expected, some works in the literature, e.g., [43][44][45][47][48][49], used this measure in different forecasting applications.

Partition Creation and Calculation of the Coefficient of Variation
Kahn [44] addressed the CV as a measure of predictability in the sales support and operations planning process. Fatichi et al. [49] used it to investigate the inter-annual variability of precipitation As each partition contains data specific to a particular month, it is expected that similar temporal patterns are clustered. Consequently, one expects a favorable scenario to present lower values of the CV in the partitions in comparison to the CV in the entire series. Since low/high values of the CV point to low/high data fluctuations, a correspondingly high/low predictability is expected, some works in the literature, e.g., [43][44][45][47][48][49], used this measure in different forecasting applications.
Kahn [44] addressed the CV as a measure of predictability in the sales support and operations planning process. Fatichi et al. [49] used it to investigate the inter-annual variability of precipitation on Sustainability 2020, 12, 7310 5 of 33 a global scale. Bahra [48] used a market's implied risk-neutral density (RND) function measure based on the CV as a volatility forecasting tool.

Multilayer Perceptron
The MLP is a feedforward artificial neural network (ANN) with multiple layers of artificial neurons (nodes) [27]. The artificial neurons of layer i are entirely connected to the subsequent layer's nodes (i + 1).
This architecture can have three or more layers [57]: 1. Input layer: transmits the input signal to the hidden layers; 2.
Hidden (intermediate) layers: these set of neurons performs a nonlinear transformation, mapping the input signal to another space; 3.
Output layer: this layer receives the signal from the hidden layers and generates a combination to form the network's output. Often this process is based on linear combinations.
In this work, we addressed MLP with just one hidden layer. Let u be the vector containing the inputs, b the bias vector, w i kn the intermediate layer weights. The variable n = 1, . . . , N is the index of each input, k = 1, . . . , K is related to each neuron, and w 0 1n are the weights of the output neuron. The output response of the network is given by Equation (1): in which f is the activation function of the hidden neurons and f s a linear activation function of the output layer. The training process of an ANN is defined as the steps to find its best set of weights to solve the task. In an MLP, the most usual way to adjust the network is to use an iterative process, resorting the solution of an unrestricted nonlinear optimization algorithm. The MLP architecture used in this work has three layers, and the training process is based on the modified scaled conjugate gradient, a second-order method [27].

Radial Basis Function Network
The radial basis function network (RBF) is also a widely known ANN framework, with two neuron layers: hidden and output layers. The first performs an input-output mapping, based on kernel (activation) functions of radial basis. In the hidden layer, each neuron presents two free parameters, a center, and a dispersion. The most used function is the Gaussian, given by Equation (2): where c is the center of the Gaussian function, and σ 2 is its variance. The second layer conducts a combination of the previous layer's outputs, in many cases, using a linear approach [57]. The training step of an RBF is performed following two stages. Initially, the weights of the hidden layer are adjusted, calculating their centers and dispersions. This task is performed using non-supervised clustering methods, like the K-Medoids. The second step is accomplished tuning the weights of the output layer. One can use a linear procedure to combine the outputs of the hidden layer. In this work, we address the Moore-Penrose pseudoinverse [58]. This operation ensures the minimum mean squared error (MSE) in the output. This solution is present in Equation (3): where X hid is the matrix containing the outputs of the hidden layer for the training set, W out the weights of the output layer, and d the desired output.

Extreme Learning Machines
Extreme learning machines (ELM) [55] are single layer feedforward neural networks, with high similarity with the traditional MLP. The main difference between them is the training since, in this case, the hidden layers are not adjusted during the training process [28]. In this phase, just the output layer is tuned [27]. Therefore, its training processes require less computational effort than the previous. As the MLP, the ELM is a universal approximator.
Consider u the vector of inputs and b the bias. The output of the hidden layer is given by Equation (4): x where W h is the matrix containing the hidden layer weights and f h (.) is the matrix of the activation functions of these neurons. The output of the network is given by Equation (5): in which W out is matrix with the weights of the output layer. The training process is summarized in finding the weights of the output layer, while W is randomly chosen. The most usual way to solve this task is applying the same Moore-Penrose pseudoinverse operator depicted in the RBF case.

Echo State Networks
Echo state networks (ESN) are recurrent ANN, since they present feedback loops of information in the hidden layer, named dynamic reservoir here [56]. This characteristic can be an advantage, especially in problems with temporal dependence between the samples. As the ELM, the training process of an ESN just adjusts the output layer.
Jaeger [56] proved the possibility of settting in advance the weights of the neurons in the reservoir, and kept unchanged during the training, under specific conditions [27,28].
The output of the reservoir x n+1 is given by Equation (6): being u n+1 the input signal, W in the matrix containing the weights of the input layer, W the weights of the reservoir that presents the feedback connections' synaptic weights and f(.) the respective activation functions. The output y of the network is as in Equation (7): where W out contains the weights of the output layer. There are some manners to generate the weights of the reservoir. In this work, we address the proposal from Jaeger [56], presented in Equation (8): Once again, we use the Moore-Penrose pseudoinverse operator to train the network [59][60][61].

Case Studies
Aiming to analyze if the proposed methodology brings gains to any studied location, we chose databases from two countries with specific climate and emission sources (Brazil and Finland). The case studies consist of predicting PM 10 and PM 2.5 to four different areas, with distinct characteristics. Helsinki (Kallio station-PM 10 and PM 2.5 , and Vallila station-PM 10 ), São Paulo city (Tietê station-PM 10 [62]. It is the leading business and commercial, the financial center of South America with a Human Development Index of 0.805 [59,63]. Campinas city is the third most populous municipality in São Paulo state, Brazil, occupying 797.6 km 2 , with an urban area of 238.2 km 2 , and 1,194,094 inhabitants, demographic density of 1359.60 inhabitants per km 2 [62]. Campinas and São Paulo have similar climates, with hot and rainy long summers and shorter winters with mild temperatures. The temperature ranges from 13 • C to 29 • C, rarely reaching values below 9 • C and above 33 • C [64].
Ipojuca city, Pernambuco state, Brazil, is located in the Brazilian Northeast, metropolitan region of Pernambuco state capital-Recife. It is a coastal city with 96,204 inhabitants and a demographic density of 152.98 inhabitants per km 2 [62]. With a hot and windy climate, the temperature ranges from 22 • C to 32 • C, rarely reaching values below 21 • C and above 34 • C [64].
Helsinki is the capital and the most populous city of Finland, with 655,276 inhabitants and an urban area of 1,268,296 km 2 [62]. It is the most important center of Finland for politics, education, finance, culture, and research in Finland. Helsinki weather differs significantly from the other studied cities, with temperature ranging from −8 • C to 21 • C, rarely reaching values below −19 • C and above 26 • C [64].
The demographic, business, and meteorological characteristics of each studied city are distinct and will ensure the proposed methodology is relevant to any location. Additionally, they present variable emission sources. Figures 4 and 5 show each station's location from Brazil and Finland, respectively. São Paulo and Campinas are mainly affected by vehicular emissions. São Paulo station is near a ring road, being influenced mainly by heavy-duty emissions. In contrast, Campinas station is located near one of the city's main avenues (urban area) with predominantly light-duty emissions. Ipojuca station is near a petrol refinery (industrial area) and port of Suape, a different emission pattern than Campinas and São Paulo. Beyond the demographic, business, and meteorological differences from Helsinki to Brazil, Kallio station is an urban background, and Vallila is in the city downtown, with high traffic [13,16,21]. We highlight the data from Finland were addressed in important investigations about air pollution forecasting [13,16,21].   Table 1 shows the number of samples and studied period for each station considered in this work. Additionally, we present the data source for each one. It is relevant to highlight the time range for Ipojuca station, leaving some months (May and June) with data for a year only. The time range is variable for each station, to comprise a different database and analyze the robustness of the proposed approach. The Kallio and Vallila time series were used, even with the old-time range, as the literature's remarkable works [10,14,65] used it.    Table 1 shows the number of samples and studied period for each station considered in this work. Additionally, we present the data source for each one. It is relevant to highlight the time range for Ipojuca station, leaving some months (May and June) with data for a year only. The time range is variable for each station, to comprise a different database and analyze the robustness of the proposed approach. The Kallio and Vallila time series were used, even with the old-time range, as the literature's remarkable works [10,14,65] used it. The satellite map is from Google Maps (Map data ©2020 Google; https://www.google.com/maps/place/ Finland/); the satellite is from Google Earth Pro (Map data ©2020 Google; https://www.google.com/ maps/@60.1834096,24.8975655,12.83z). Table 1 shows the number of samples and studied period for each station considered in this work. Additionally, we present the data source for each one. It is relevant to highlight the time range for Ipojuca station, leaving some months (May and June) with data for a year only. The time range is variable for each station, to comprise a different database and analyze the robustness of the proposed Sustainability 2020, 12, 7310 9 of 33 approach. The Kallio and Vallila time series were used, even with the old-time range, as the literature's remarkable works [10,14,65] used it.

Computational Simulations
Before applying the methodology, the database is divided into three sets: 1. Test set, for performance evaluation: it is composed of the last 10 samples of each month (for example, in January, we used the data from 22nd to 31st); 2.
Validation set, to avoid the excessive adjustment (only to the MLP): formed by the last 10 samples of the remainders of each month, excepting the test (in January, we used the data from 12nd to 21st); 3.
Training, to adjust the free parameters of the neural models: the remainder samples.
We standardized the test set to all months in 10 samples, since for Ipojuca series, there are two cases in which just one month of daily data (April and May 2016) is available. In this sense, we have 30 samples to adjust and evaluate the predictor's performances to the decomposed series.
For all experiments, we normalized the dataset in the interval [−1, +1]. It is mandatory for neural models due to the saturation provided by the hyperbolic tangent activation functions [57].

•
Mean squared error (MSE): where x t is the observed sample in time t,x t is the prediction, and N s the number of predicted samples.

•
Mean absolute error (MAE): • Mean absolute percentage error (MAPE): • Root mean squared error (RMSE): • Index of agreement (IA): in which x is the average of the observed series.
For the MSE and MAPE measures, the lower their values, the more accurate the model was. The IA varies in the range of [0, 1], and the higher its value, the better is the accuracy.
We performed 30 simulations for each PM series. The best configuration was selected based on the highest MSE value in the validation set. For all experiments, the models were evaluated in the one-step-ahead forecasting from the testing set. We highlight that the number of inputs, corresponding to time lags of the time series [71], is established from the wrapper method [72], considering the MSE as the evaluation function [73]. We considered up to 10 lags. Table 3 shows the calculation of the coefficient of variation (CV) for all series regarding different scenarios: without a partition (whole series), with an annual partition (separated by year), and monthly partition. The values in bold indicate the months in which the CV is smaller than the CV of the series without partition or annual partition. Note that Campinas series presents just two years.
The CV (Table 3) has a reduction tendency when using the monthly partition. Comparing the CV of monthly partition to annual and without partition, we can observe that it showed lower values from

Results
Tables 4-10 show the results achieved in terms of MSE for the seven previously mentioned stations, separated by month. The subscript "PP" means the results of the proposed partition-monthly decomposition (12 predictors), while "TR" represents the traditional approach (only one predictor to the whole series). For the sake of comparison, we also performed the linear autoregressive model (AR), from the box and Jenkins family, and a simple Naive persistent method ("PERS") [74]. The best performances by month are highlighted in bold. The decomposition was used for all three sets (training, validation, and test). Also, Tables A1-A7 in Appendix A present the values for MAE, MAPE, RMSE, and IA.
When using the proposed approach, 12 models are adjusted using the training approach. In this case, the samples are separated per month. On the other hand, the traditional approach uses just one predictor and all training samples.
The results regarding Kallio PM 10 ( Table 4) allow an interesting observation about error metrics. There is a divergence among the best performance (MSE, MAE, MAPE, RMSE, or IA). Although this behavior does not happen very often, to October, we can note that the best MSE, RMSE and IA were achieved by the RBF PP , the best MAE was from the ESN pp , while the smallest MAPE belongs to the ELM PP (see Table A1). In this case, we choose MSE as the primary metric, as it is minimized during the ANNs training process, and in the adjustment of the AR model [28,63,75]. Note that the RMSE is directly proportional to the MSE.
The comparison between the traditional and the monthly approach reveals that the decomposition increased the predictors' performance for almost all months, except for the ELM. In this sense, only April showed the best result for ELM TR . The other months had mainly RBF PP reaching the smallest errors (seven months), and MLP PP and ESN PP in two cases each.  Figure 6 shows the real and predicted values for the Kallio PM 10 series, considering the best predictor, the RBF PP , the correspondent to the traditional approach (RBF TR ), and the samples from the test set. The best prediction capability of the monthly approach is clear. It is important to highlight the monthly approach improved the prediction of some extreme events of high air pollutants level. Table 5 shows the results for Kallio PM 2.5 . The best results related to Kallio PM 2.5 for each month showed different behavior than Kallio PM 10 . Again, the ELM TR overcame the others for one month (January), while the RBF PP , MLP PP , and ESN PP reached the best MSE for 4, 4, and 3 months, respectively. The monthly proposal was superior to the traditional method.
We noticed again a tendency of reduction in the error comparing distinct approaches for the same architecture. Despite the ELM TR presenting the best overall result for one month, the ELM PP showed smaller MSE for eight months, overcoming the traditional methodology. We highlight that the RBF TR gave the worse errors, an intriguing finding since its counterpart RBF PP was one of the bests. Once again, we observed a discrepancy regarding the best predictors and error metrics in Table A2.  Figure 6 shows the real and predicted values for the Kallio PM10 series, considering the best predictor, the RBFPP, the correspondent to the traditional approach (RBFTR), and the samples from the test set. The best prediction capability of the monthly approach is clear. It is important to highlight the monthly approach improved the prediction of some extreme events of high air pollutants level.  Table 5 shows the results for Kallio PM2.5. The best results related to Kallio PM2.5 for each month showed different behavior than Kallio PM10. Again, the ELMTR overcame the others for one month (January), while the RBFPP, MLPPP, and ESNPP reached the best MSE for 4, 4, and 3 months, respectively. The monthly proposal was superior to the traditional method.
We noticed again a tendency of reduction in the error comparing distinct approaches for the same architecture. Despite the ELMTR presenting the best overall result for one month, the ELMPP showed smaller MSE for eight months, overcoming the traditional methodology. We highlight that the RBFTR gave the worse errors, an intriguing finding since its counterpart RBFPP was one of the

Test Set
Concentration of PM 10 Figure 7 presents the best prediction in the test set for RBF PP and MLP PP , and the same ANNs using the traditional approach (RBF TR and MLP TR ). Again, it is clear the best performance of the monthly approach. Additionally, extremely high and low air pollution events were well predicted. bests. Once again, we observed a discrepancy regarding the best predictors and error metrics in Table  A2. Seven times, the best IA did not match with the MSE and RMSE, while MAE five times, and MAPE six times.  Figure 7 presents the best prediction in the test set for RBFPP and MLPPP, and the same ANNs using the traditional approach (RBFTR and MLPTR). Again, it is clear the best performance of the monthly approach. Additionally, extremely high and low air pollution events were well predicted.  Table 6 summarizes the results of the Vallila PM10. The predictors' performances to the Vallila station were favorable to the monthly approach, concerning the MSE. The RBFPP was the one reaching the smallest MSE more times (four months).   Table 6 summarizes the results of the Vallila PM 10 . The predictors' performances to the Vallila station were favorable to the monthly approach, concerning the MSE. The RBF PP was the one reaching the smallest MSE more times (four months). Since RBF PP and ESN PP achieved the best MSE in four months and considering the number of best results and months, we present their output responses in Figure 8, with the same to the traditional approach (RBF TR and ESN TR ). It is interesting to observe the differences between the proposed and the traditional approaches. While the proposed methodology presented an adequate prediction capability to the whole test set, the original approach fits well for specific samples (some days of March, for example), but have significant discrepancies in others (beginning of September and October, for example). Since RBFPP and ESNPP achieved the best MSE in four months and considering the number of best results and months, we present their output responses in Figure 8, with the same to the traditional approach (RBFTR and ESNTR). It is interesting to observe the differences between the proposed and the traditional approaches. While the proposed methodology presented an adequate prediction capability to the whole test set, the original approach fits well for specific samples (some days of March, for example), but have significant discrepancies in others (beginning of September and October, for example).  Table 7 shows the errors achieved for the São Paulo PM10. The behavior found for the São Paulo PM10 was very similar to those related to the Vallila one. The monthly approach reached all the best errors based on the MSE metric, with the RBFPP presenting the best results five times, MLPPP four times, and ESNPP three times.  Figure 9 presents the predictions achieved by the RBFPP in the São Paulo PM10 time series, compared to the RBFTR and the original data. In this time series, we can observe that the RBFPP architecture could not lead to small prediction errors to all the air pollution peaks and valleys,

Test Set
Concentration of PM 10 Table 7 shows the errors achieved for the São Paulo PM 10 . The behavior found for the São Paulo PM 10 was very similar to those related to the Vallila one. The monthly approach reached all the best errors based on the MSE metric, with the RBF PP presenting the best results five times, MLP PP four times, and ESN PP three times. Figure 9 presents the predictions achieved by the RBF PP in the São Paulo PM 10 time series, compared to the RBF TR and the original data. In this time series, we can observe that the RBF PP architecture could not lead to small prediction errors to all the air pollution peaks and valleys, probably due to external factors that affect air pollution compared to previous time series. However, it still has a better prediction performance than the traditional approach.   Table 8 shows the results for the São Paulo PM2.5 series. Once again, the results presented 100% of the best values of MSE favorable to the monthly approach, highlighting the ESNPP and the MLPPP, which were the best in 4 cases, RBFPP (three cases), and ELMPP (one case). It is the first time the RBF was not the best in most scenarios, showing the importance of applying different ANNs.  Figure 10 shows the MLPPP and MLPTR results compared to the original data to the São Paulo PM2.5 test set.

Test Set
Concentration of PM 10 Table 8 shows the results for the São Paulo PM 2.5 series. Once again, the results presented 100% of the best values of MSE favorable to the monthly approach, highlighting the ESN PP and the MLP PP, which were the best in 4 cases, RBF PP (three cases), and ELM PP (one case). It is the first time the RBF was not the best in most scenarios, showing the importance of applying different ANNs.  Figure 10 shows the MLP PP and MLP TR results compared to the original data to the São Paulo PM 2.5 test set.  Table 9 has the results for Campinas PM10. Analyzing the metrics in Table 9, we can observe that the traditional approach led to the smallest errors for June (ESNTR) and August (ELMTR). The monthly proposal (RBFPP-four, ESNPP-five, and ELMPP-one) was the best to the remaining ten months. As in the São Paulo PM2.5 time series, the ESN found the best monthly performances.   Table 9 has the results for Campinas PM 10 . Analyzing the metrics in Table 9, we can observe that the traditional approach led to the smallest errors for June (ESN TR ) and August (ELM TR ). The monthly proposal (RBF PP -four, ESN PP -five, and ELM PP -one) was the best to the remaining ten months. As in the São Paulo PM 2.5 time series, the ESN found the best monthly performances.  Figure 11 presents the predictions found by the ESN PP and ESN TR in the test set compared to the original data.
Finally, Table 10 reveals the results for the last station considered in this work, Ipojuca PM 10 . The traditional approach achieved the smallest errors only for April (ESN TR ). To the other months, the ESN PP showed the best values for MSE, five times. The MLP PP was the best for three months, RBF PP for two months, and the ELM PP only one month. In most cases, the general performances of the neural models were improved using the monthly approach. Additionally, in some instances, the best results considering distinct metrics did not match. Finally, Table 10 reveals the results for the last station considered in this work, Ipojuca PM10. The traditional approach achieved the smallest errors only for April (ESNTR). To the other months, the ESNPP showed the best values for MSE, five times. The MLPPP was the best for three months, RBFPP for two months, and the ELMPP only one month. In most cases, the general performances of the neural models were improved using the monthly approach. Additionally, in some instances, the best results considering distinct metrics did not match.  Figure 12 presents the forecasts achieved by the ESNPP, compared to the ESNTR and the original data of the Ipojuca PM10 time series. Again, the monthly approach seems to have a good prediction power.

Test Set
Concentration of PM 10 Figure 12 presents the forecasts achieved by the ESN PP , compared to the ESN TR and the original data of the Ipojuca PM 10 time series. Again, the monthly approach seems to have a good prediction power. Lastly, it is worth mentioning that we applied Friedman's Test to the MSE of the test sets for all experiments, to verify if the results were significantly different [43]. The p-values ranged from 5.2771 × 10 -12 to 0.0112. Therefore, we can assume the hypothesis that a change in the predictor leads to different results.

Discussion
The computational results presented in Tables 4-10 allow some essential remarks. Initially, we Lastly, it is worth mentioning that we applied Friedman's Test to the MSE of the test sets for all experiments, to verify if the results were significantly different [43]. The p-values ranged from Sustainability 2020, 12, 7310 18 of 33 5.2771 × 10 −12 to 0.0112. Therefore, we can assume the hypothesis that a change in the predictor leads to different results.

Discussion
The computational results presented in Tables 4-10 allow some essential remarks. Initially, we elaborated Table 11, which shows how many times each model achieved the best performance by month. The goal is to analyze if the month decomposition brings gains to any neural network performance, independently of the characteristics of the studied time series. Table 11. Ranking of the best predictors by month regarding the mean squared error (MSE) in the test sets.

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Total
Considering the 12 months of the seven test sets, we have 84 scenarios. The main observation is that in 79 out of 84 months (94%), the monthly approach achieved the best overall results, a clear indication that the use of disaggregated series and 12 predictors instead of only one model to the whole series is an advantage.
Comparing the same neural architecture (RBF PP and RBF TR ) using the traditional and the monthly approach is favorable to our proposal in almost all cases. It is essential to highlight that we used seven time series from four different cities of two countries to distinct years, with diverse characteristics.
With regard to the neural models, the RBF PP and ESN PP presented similar results concerning the number of best monthly results found. For Kallio PM 10 and São Paulo PM 10 , the RBF stands out. For Kallio PM 2.5 , both RBF and MLP showed the best results and for Vallila PM 10 , the RBF, and the ESN. The other three time series were favorable to the ESN.
We cannot state which the best predictor for such a task is. Therefore, in related problems, we should test both RBF and ESN models. Related works that used the disaggregated method, like in streamflow series [27,28,58], indicated that the ELM overcame its organized counterpart (MLP), but it was the worse in this work. On the other hand, in [59] and [63], the fully trained approaches were the winner. It seems clear that the evaluation of some distinct neural architectures is essential in time series forecasting tasks. In addition, the neural approaches overcame the linear AR model and the persistence model, except for the February of Kallio PM 10 time series.
In this work, we followed the premises found in other investigations of prioritizing MSE [27,28,59,63], since it penalizes higher errors. However, some studies pointed to other error metrics. In [76], the authors advocate the use of MAE. In this sense, we calculated the number of times our proposition achieved the best values for all metrics. For each case, we found: (i) MAE-89%; (ii) MAPE-86%; RMSE-94%; IA-61%. These results allowed us to ensure the gains in applying the new methodology based on decomposition.
During extreme events of high air pollution, the health systems collapse due to overcrowding, and the government needs to take quick measures to assure treatment to the whole population. To this end, the monthly approach was robust to all considered time series, but the power of prediction may differ depending on each time series behavior.
Lastly, PM emissions are mainly anthropogenic, then life cycle assessment (LCA) thinking may help to better understand their impacts on human health. One scientific challenge of the LCA community is the regionalization of life cycle impact assessment (LCIA) methods, as the source location and its surrounding conditions are paramount to a reliable and accurate result [77,78]. In this sense, our study may assist the development and/or regionalization of PM formation methods in LCIA context.

Conclusions
This paper proposed a methodology based on monthly decomposition to PM concentration forecasting. We applied 12 independent predictors in this approach, each one responsible for predicting the data from each month. The traditional methodology just uses one predictor for the whole series.
Four different neural network architectures, a linear model, and a persistence model were addressed using both methodologies. Our approach based on the time series disaggregation improved air pollutants concentration forecasting with the best capability to predict air pollution in 94% of the scenarios analyzed. It is important to highlight our approach's robustness, since the predictors increased their performance compared to the traditional method, for various databases from different locations and pollutants.
Regarding the best forecasting approach, the RBF and ESN neural networks achieved the smallest errors in most cases, followed by the MLP, when using the new decomposition approach. However, we cannot state which ANN is the adequate neural model, reinforcing the necessity of testing various architectures. On the other hand, the ANNs overcame the AR model and the Persistence approach.
Considering the United Nations sustainable development goals (SDG) [79], which are to change the world, our study has an important contribution to foresee air quality and then help governments to take measures aiming air pollution reduction and preparing hospitals during events of extreme air pollution, which is related to SDG 3-good health and well-being and SDG 11-sustainable cities and communities.
Future works should include the use of variable-sized time windows in the time series decomposition, the use of selection and combination of predictors, and the application of other computational intelligence approaches, such as deep learning, hybrid methods, and ensembles. Additionally, exogenous variables as meteorological information could be addressed as an input of the models [80], and other time series or even variable sizes of the training, validation, and test sets.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
In Tables A1-A7 it is presented the errors' metrics for all series addressed. For each month, we highlighted in bold the metrics of the model that attained the best MSE value, and in shade of gray, the best value of each metric.         In Table A8, the acronym list of this paper is presented.  In Table A8, the acronym list of this paper is presented.  10 particulate matter with aerodynamic diameter less than or equal to 10 µm PM 2. 5 particulate matter with aerodynamic diameter less than or equal to 2.5 µm PP proposed forecasting method (which consider decomposition) RBF Radial Basis Function Networks RMSE Root Mean Squared Error TR Traditional forecasting method