Self-Organizing Topological Multilayer Perceptron: A Hybrid Method to Improve the Forecasting of Extreme Pollution Values

: Forecasting air pollutant levels is essential in regulatory plans focused on controlling and mitigating air pollutants, such as particulate matter. Focusing the forecast on air pollution peaks is challenging and complex since the pollutant time series behavior is not regular and is affected by several environmental and urban factors. In this study, we propose a new hybrid method based on artiﬁcial neural networks to forecast daily extreme events of PM 2.5 pollution concentration. The hybrid method combines self-organizing maps to identify temporal patterns of excessive daily pollution found at different monitoring stations, with a set of multilayer perceptron to forecast extreme values of PM 2.5 for each cluster. The proposed model was applied to analyze ﬁve-year pollution data obtained from nine weather stations in the metropolitan area of Santiago, Chile. Simulation results show that the hybrid method improves performance metrics when forecasting daily extreme values of PM 2.5 .


Introduction
Air quality monitoring is important for the sustainable growth of cities, mitigating the risks that may affect human health.According to the American Environmental Protection Agency [1], particle pollution includes inhalable particles of 2.5 (PM 2.5 ) and 10 (PM 10 ) microns or smaller, the source of which could be construction sites, unpaved roads, and forest fires, to name a few.When inhaled, these particles may cause serious health problems [2][3][4].There is a correlation between air pollution exposure and severe health concerns such as the incidence of respiratory and cardio-respiratory diseases [5] and even deaths.Thus, PM 2.5 concentrations are measured constantly in the main cities to determine whether they are under the national permissible limit [6][7][8].To be prepared for the next extreme event, it is essential to forecast the intensity of the pollution levels, since air pollutant concentrations are influenced by emission levels, meteorological conditions, and geography.
Accurately predicting the level of pollution is of great importance.However, it is also challenging because the time series of PM 2.5 exhibit non-linear time-varying behavior with sudden events [9].Therefore, different approaches have been used to address this challenge, such as statistical methods [10,11], machine learning algorithms [12], artificial neural networks (ANNs) [13][14][15], deep learning algorithms in general [16,17], and other hybrid methods [18][19][20].In particular, ANNs have had great influence and wide applicability as a non-linear tool in the forecasting of time series [21,22], where different architectures have been used: a feed-forward neural network (FFNN) [23], an Elman neural network [24], a recursive neural network [25], and adaptive neuro-fuzzy inference systems [26].These methods successfully model PM 2.5 and other pollution levels, generating accurate forecasts given their non-linear mapping and learning capabilities.Deep learning methods have been successfully applied to predict PM 2.5 concentrations [27].For instance, attention-based LSTM [17] and convolutional neural network (CNN)-LSTM models [16,28] have been used for improving the performance of PM 2.5 forecasting using data from multiple monitoring locations.Even though predicting the average PM 2.5 concentrations is a problem reasonably addressed in the literature with various techniques, approaches that can also address the forecasting of pollution peaks are still challenging.
We consider extreme events as those observations in a sample that are unusually high or low and are, therefore, considered to occur in the tails of a probability distribution.To address problems such as forecasting and identifying trends in extreme levels of air pollution, understanding environmental extremes and forecasting extreme values is challenging and has paramount importance in weather and climate [29].Although extreme values can be momentary in time, they have the potential to recirculate within an urban area and, therefore, move around and affect other nearby locations, having significant adverse health implications [30].Extreme value data may often exhibit excess kurtosis and/or prominent right tails.Therefore, their exposure impact increases and is much more difficult to predict.
Several authors have proposed statistical methods to analyze and forecast environmental extremes.For instance, Zhang et al. [31] analyzed the 95th percentile of historical data to study the relationship between the extreme concentration of ozone and PM 2.5 events, together with meteorological variables, such as the maximum daily temperature, the minimum relative humidity, and minimum wind speed.Bougoudis et al. [32] presented a low-cost, rapid forecasting hybrid system that makes it possible to predict extreme values of atmospheric pollutants.Mijić et al. [33] studied the daily average pollutant concentrations by fitting an exponential distribution.Zhou et al. [4] applied the extreme value theory using the generalized Pareto distribution to model the return pollution levels at different periods, where the authors used the GPD to fit the PM 10 , NO 2 , and SO 2 in Changsha, China.Zhou et al. [4] and Ercelebi et al. [34] reported that the distribution might vary at different stations due to specific factors and conditions.
In this work, we propose a new hybrid method based on machine learning to predict extreme daily events of PM 2.5 pollution concentration.Without loss of generality, in this study, we consider extreme events to be the highest daily levels, particularly in the 75th and 90th percentiles.Predicting these excessive daily levels is complex because their behavior is not regular, and they are prone to environmental and urban factors.The proposed hybrid method combines the unsupervised learning self-organizing maps (SOM) [35,36] with the supervised Multilayer Perceptron (MLP) [37] for time series forecasting.For this reason, the hybrid method is called Self-Organizing Topological Multilayer Perceptron (SOM-MLP).The main idea behind the model is to identify temporal patterns of extreme daily pollution found at different monitoring stations.First, a self-organizing map is applied to cluster time series segments that present similar behaviors to accomplish this task.Second, a non-linear auto-regressive model is constructed using a multilayer perceptron for each cluster.The method is used for predicting the daily extreme values of PM 2.5 concentration depending on the pattern of the historical time series data obtained from several monitoring stations in the metropolitan area of Santiago, Chile.Lastly, a gate function is applied to aggregate the predictions of each model.
The paper is structured as follows.In Section 2, we describe the theoretical framework of the multilayer perceptron and self-organizing map artificial neural networks.We introduce our proposed hybrid model in Section 3. In Section 4 and Section 5, we show the simulation results and comparative analysis.Discussion is presented in Section 6, and in Section 7, we offer concluding remarks and some future challenges.

Time Series Forecasting
A time series is a sequence of observed values x t recorded at specific times t [38].It represents the evolution of a stochastic process, which is a sequence of random variables indexed by time X t : t ∈ Z.A time series model provides a specification of the joint distributions of these random variables X t , capturing the underlying patterns and dependencies in the data.
Although many traditional models for analyzing and forecasting time series require the series to be stationary, non-stationary time series are commonly encountered in real-world data.Non-stationarity arises when the statistical properties of the series change over time, such as trends, seasonality, or shifts in mean and variance.Dealing with non-stationary time series poses challenges as standard techniques assume stationarity.
To handle non-stationary time series, various methods have been developed.One approach is to transform the series to achieve stationarity, such as differencing to remove trends or applying logarithmic or power transformations to stabilize the variance.Another approach is to explicitly model and account for non-stationarity, such as incorporating trend or seasonal components into the models.
In recent years, advanced techniques have been proposed to handle non-stationary time series effectively.These include proposals, adaptations, transformations, and generalizations of classical parametric and non-parametric methods, and modern machine and deep learning approaches.Neural networks, including multilayer perceptron (MLP) and self-organizing maps (SOM), have been applied with success because they can capture complex patterns and dependencies in non-stationary data, offering promising results.
In time series forecasting, MLPs can be trained to predict future values based on past observations.The network takes into account the temporal dependencies present in the data and learns to approximate the underlying mapping between input sequences and output forecasts.Various training algorithms, such as backpropagation, can be used to optimize the network's weights and biases.Similarly, SOMs can be employed to discover patterns and structure within time series data.By projecting high-dimensional time series onto a 2D grid, SOMs reveal clusters and similarities between different sequences.This can assist in identifying distinct patterns, understanding data dynamics, and providing insights for further analysis.
Both MLPs and SOMs offer valuable tools for time series analysis, with MLPs focused on prediction and forecasting tasks, while SOMs excel in visualization and clustering.Their application in time series analysis depends on the specific problem, dataset characteristics, and objectives of the analysis.In the next subsections, we briefly describe both MLP and SOM neural networks.

Artificial Neural Networks
Artificial Neural Networks have received significant attention in engineering and science.Inspired by the study of brain architecture, ANN represents a class of non-linear models capable of learning from data [39].Some of the most popular models are the multilayer perceptron and the self-organizing maps.The essential features of an ANN are the basic processing elements referred to as neurons or nodes, the network architecture describing the connections between nodes, and the training algorithm used to estimate values of the network parameters.Researchers see ANNs as either highly parameterized models or semiparametric structures [39].ANNs can be considered as hypotheses of the parametric form h(•; w), where hypothesis h is indexed by the vector of parameters w.The learning process consists of estimating the value of the vector of parameters w to adapt learner h to perform a particular task.

Multilayer Perceptron
The multilayer perceptron (MLP) model consists of a set of elementary processing elements called neurons [23,[40][41][42][43].These units are organized in architecture with three layers: the input, the hidden, and the output layers.The neurons corresponding to one layer are linked to the neurons of the subsequent layer.Figure 1 illustrates the architecture of this artificial neural network with one hidden layer.The non-linear function g(x, w) represents the output of the model, where x is the input signal and w is its parameter vector.For a three-layer ANN (one hidden layer), the kth output computation is given by the following equation: where λ is the number of hidden neurons.An important factor in the specification of neural network models is the choice of the activation function (one of the most used functions is the sigmoid).These can be non-linear functions as long as they are continuous, bounded, and differentiable.The transfer function of hidden neurons f 1 (•) should be nonlinear, while for the output neurons, function f 2 (•) could be a linear or a nonlinear function.
The MLP operates as follows.The input layer neurons receive the input signal.These neurons propagate the signal to the first hidden layer and do not conduct any processing.The first hidden layer processes the signal and transfers it to the subsequent layer; the second hidden layer propagates the signal to the third, and so on.When the output layer receives and processes the signal, it generates a response.The MLP learns the mapping between input space X and output space Y by adjusting the connection strengths between neurons w = {w 1 , ..., w d } called weights.Several techniques have been created to estimate the weights, the most popular being the backpropagation learning algorithm.

Input layer
Hidden layer Output layer

Self-Organizing Maps
The SOM, introduced by T. Kohonen [35], is an artificial neural network with unsupervised learning.The model projects the topology mapping from the high-dimensional input space into a low-dimensional display (see Figure 2).This model and its variants have been successfully applied in several areas [44].
Map M consists of an ordered set of M prototypes w k ∈ W ⊆ R D , k = 1 . . .M, with a neighborhood relation between these units forming a grid, where k indexes the location of the prototype in the grid.The most commonly used lattices are the linear, the rectangular, and the hexagonal arrays of cells.In this work, we consider the hexagonal grid where κ(w k ) is the vectorial location of the unit w k in the grid.When the data vector x ∈ R D is presented to model M, it is projected to a neuron position of the low dimensional grid by searching the best matching unit (bmu), i.e., the prototype that is closest to the input and is obtained as follows: where d(•, •) is some user-defined distance metric (e.g., the Euclidean distance).This model's learning process consists of moving the reference vectors toward the current input by adjusting the prototype's location in the input space.The winning unit and its neighbors adapt to represent the input by applying the following learning rule iteratively: where the size of the learning step of the units is controlled by both learning rate parameter 0 < α(t) < 1 and neighborhood kernel η c(x) (k, t).Learning rate parameter function α(t) is a monotonically decreasing function with respect to time.For example, this function could be linear, , where α 0 is the initial learning rate (<1.0), α f is the final rate (≈0.01), and t α is the maximum number of iteration steps to arrive to α f .The neighborhood kernel is defined as a decreasing function of the distance between unit w k and bmu w c(x) on the map lattice at time t.A Gaussian function usually produces the kernel.
In practice, the neighborhood kernel is chosen to be wide at the beginning of the learning process to guarantee the global ordering of the map, and both their width and height decrease slowly during learning.Repeated presentations of the training data thus lead to topological order.We can start from an initial state of complete disorder, and the SOM algorithm gradually leads to an organized representation of activation patterns drawn from the input space [45].In the recent years, there have been some improvements to this model; for example, Salas et al. [46] added flexibility and robustness to the SOM; also, Salas et al. [47] proposed a combination of SOM models.

A Self-Organizing Topological Multilayer Perceptron for Extreme Value Forecasting
The proposed model can capture the essential information from the topological structure of the time series, where the best-matching unit of the SOM clusters together similar regimes of the time series.The MLPs as non-linear autoregressive models are expected to improve their combined prediction.Moreover, fewer units learn about the extreme events, and their respective MLPs are specialized in these types of episodes.
The scheme of the architecture of the SOM-MLP hybrid model is shown in Figure 3.The framework of the SOM-MLP hybrid model consists of four stages.The data are preprocessed and structured in time segments in the first stage.In the second stage, the SOM projects the time segments to their units.The MLP obtains the extreme value forecasting in the third stage.Finally, the outputs are combined with a gating unit to obtain the final response in the fourth stage.1. Stage 1.The monitoring station data are combined into one large data set.Then, we proceed to normalize all the records in the range of [0, 1]: This article's data consist of observations collected hourly, for which 24 samples are available daily.The day vector is defined as X t,s = [X 1 t,s , ..., X 24 t,s ], where X l t,s is the sample of day t at the lth hour for station s.On one hand, target y t+1,s is built by obtaining a defined percentile of the next day, i.e., y t+1,s = Percentil γ (X 1 t+1,s , ..., X 24 t+1,s ).For example, this work considers the 75th and 90th percentiles (γ is 75 and 90, respectively).On the other hand, input vector X t,s is constructed as time segments of the selected day-lags.For instance, if we select a lag of p days, i.e., X t,s = [X 1 t,s , ..., X 24 t,s ], t−p,s , ..., X 24 t−p,s ], then the time segment is built as the concatenation of these day samples as follows: 2. Stage 2. This stage aims to recognize topological similarities in the time segments using the SOM network.The SOM model is built with K units corresponding to the optimal number of clusters to group vectors X t,s for each station s.These daily segments are then used to forecast the value for the following day.In this sense, the SOM clusters these segments X t,s with similar contamination patterns for each monitoring station.
The nodes are expected to learn contamination patterns; therefore, some of these nodes could have associated high-pollution episodes.The SOM network receives 24 h vectors from each station and associates it with one of the nodes with a similar pollution pattern, which could be low, intermediate, or high-pollution episodes.These episodes can be found on any of the stations.For this reason, SOM is performed for each station independently.
The SOM model is constructed with K units in a hexagonal lattice.To define the number of units, K, the elbow method, the Calinski-Harabasz index, or the Gap statistic can be used.The Within Cluster Sum of Squares (WCSS) value measures the average squared distance of all the points within a cluster to the cluster centroid.
The elbow method graphs the WCSS as a function of the number of clusters, where the bend of the curve offers information on the minimum number of units required by SOM [48,49].The Calinski-Harabasz index is based on assessing the relationship between variance within clusters and between clusters [50], where the optimal number of clusters maximizes this index [51].The Gap statistic compares the within-cluster dispersion to its expectation under an appropriate null reference distribution [52]. 3. Stage 3. The SOM network provides time segments X t into the best matching unit, bmu, i.e., the node with the most similar contamination pattern is associated with the 24-hour vector as follows: where For each node of SOM, an MLP is trained to predict the next day's extreme values y t+1 based on inputs X t associated by the bmu.The MLP contains an input layer with D neurons, one hidden layer with λ neurons, and an output layer with one neuron.D is the length of time segment input vector X t , and number λ is user defined.4. Stage 4. The individual outputs of the MLPs are combined using a combiner operator to generate the final output.We denote the output of the kth MLP as g (k) , k = 1, ..., K, and it corresponds to the kth unit of SOM.In this article, we test the following combining operators that we call the gate: (a) Best Matching Unit Gate: this gate lets through only the signal from the MLP model corresponding to the best matching unit.
Mean Gate: this gate obtains the average of the MLPs' outputs: (c) Softmax Gate: this gate computes the mean of the softmax of MLPs' outputs: so f tmax g (1) , g (2) , ..., ) . (9) (d) Maximum Gate: this gate computes the maximum of the outputs of MLPs.
(e) BMU-MAX Gate (GATE_BM): this gate combines the Best Matching Unit Gate and the Maximum Gate.The gate is controlled by an on-off parameter depending on either the moment of the year or the variability of pollution level.
where V x is the variability of the input data, and θ is a user-defined threshold.

Data Understanding
Santiago de Chile (SCL) is located at 33.4º S, 70.6º W, with more than six million inhabitants.The metropolitan area of Santiago is one of the most polluted locations in South America, with unfavorable conditions for pollutant dispersion due mainly to the Mediterranean climate and its topography.This condition worsens during the cold season [53], when high concentrations of PM 2.5 are observed, mainly due to the fuels used in the industry and transportation, in addition to the prevailing topographic and meteorological conditions [54].
Nine stations belonging to the National Air Quality Information System (SINCA) are located in different locations throughout the metropolitan area of Santiago.These stations constantly monitor pollution contaminants and meteorological conditions of the air.Based on the concentration level of PM 2.5 measured by m 3 of air, air quality is classified into five levels as shown in Table 1.Regulations assert that when the average concentrations during the last 24 hours are classified at Level 3 or higher, restrictions to emissions must be applied immediately.The map of Santiago, Chile shown in Figure 4 contains the locations of the monitoring stations; it was obtained by using the Arcgis 10.4.1 software (https://n9.cl/arcgis-chile),accessed on 21 October 2023, using shapefiles from South America, including the Pacific Ocean, and other countries.

Performance Metrics
Analysis of the forecasting performance involves calculating the accuracy measures that evaluate the distance between the observed and forecast values.The following indicators were used to evaluate the performance of the proposed method: root mean squared error (RMSE), mean absolute error (MAE), mean squared error (MSE), mean absolute percentage error (MAPE), Spearman correlation index, Pearson coefficient, and coefficient of determination.

Mean Absolute Percentage Error (MAPE):
4. Spearman Correlation Index: 5. Pearson coefficient: 6. Coefficient of determination: where O i and P i are the observed and predicted forecast values, respectively, n is the number of observations, P and Ō are the averages for the predicted and observed values, respectively, and d i is the difference between each pair of values in P and O.

Exploratory Data Analysis
The behavior of the contaminant for each monitoring station is summarized in the histograms and boxplots of Figures 5 and 6, respectively.In general, as can be seen in these figures, they present a probability distribution with a heavy tail, which is evidenced by the appearance of high-contamination episodes.The average PM 2.5 in CNA is 31.49± 30.22 µg/m 3 (Table 2), which indicates a certain decrease in contamination compared to the results in the study carried out in [55] during the years of 2010-2012.

Determining the Number of Nodes
Figure 7 shows the graphs of (a) the Elbow Method, (b) the Calinski-Harabasz index, and (c) the Gap method used to find the optimal number of clusters.First, the Within Cluster Sum of Squares (WCSS) is computed for each cluster for the elbow method.The WCSS is the sum of squares of the distances of each data point in all clusters to their respective centroids.Figure 7a shows that the WCSS measure begins to decrease with a higher number of centroids, where with K = 9, the WCSS seems to converge.In the application of the Calinski-Harabasz criterion (Figure 7b), we find that in seven of the nine stations, the index obtained begins to grow from k = 9, showing that after a k greater than 12, the index value seems to stabilize.In this case, we face a scenario in which there is no global maximum, but it can be considered that the optimal value is K = 9.In Figure 7c, the results of the Gap Statistics method are presented.From the resulting similarities of the nodes analyzed by this criterion, its structure is better defined as a large and single cluster.In many stations, the optimal values for k are between 4 and 12.These methods suggest that K = 9 is the optimal number of clusters.It is observed that the three methods converge in determining that the optimal number of centroids is nine.

Performance Results
In this section, an evaluation of the performance of the proposed model SOM-MLP is carried out by comparing it with a global neural network (MLP-Global) and with neural networks obtained for each monitoring station (MLP-Station).The proposed model is applied to predict the next days' extreme values y t+1 given by the percentile 75 and 90 of that day, Percentil γ (X 1 t+1 , ..., X 24 t+1 ).The input vector is time segment X t constructed with day lags X t , X t−1 and X t−7 .In this case, 72 neurons are used for the input layer, 96 neurons for the hidden layer, and one neuron for the output layer of the MLP models.
The data used in this study correspond to the concentrations of the PM 2.5 pollutant obtained on an hourly scale collected from 1 January 2015 to 30 September 2019.The data are obtained from the SINCA, where a total of nine monitoring stations are considered: Cerro Navia (CNA), Independencia (IND), Las Condes (LCD), Pudahuel (PDH), Parque O'Higgins (POH), La Florida (LFL), El Bosque (EBQ), Puente Alto (PTA) and Talagante (TLG) (see Figure 4).
The data set is divided into training (80%) and test (20%) sets, where the latter has the last 365 days.The training set is used to find the parameters of the neural network models, and the test set is used to evaluate the prediction performance capabilities of the models.The three approaches and the evaluation of the SOM-MLP model performance are implemented through a hold-out validation scheme using 1360 days for training and 365 days for testing.
Table 3 shows the performance of the global MLP method (MLP-Global) and the MLP constructed for each Station (MLP-Station) compared to our proposed hybrid method SOM-MLP with 4, 9 and 25 neurons.The SOM-MLP shows a good compromise between complexity and performance and outperforms the MLP-Global and MLP-Station alternatives.Figures 8 and 9 show the boxplots of the performance metric for each gate for the 75th and 90th percentiles, respectively.Among these are the RMSE, the Spearman correlation index, and the coefficient of determination.For the 90th percentile, it is observed that the GATE_BM is the one with the best performance.It is highlighted that, in extreme values, it is convenient to use the GATE_BM, which does not happen for the 75th percentile since other operators such as BMU, mean, and softmax are suitable in this scenario.Tables 4-7 report the performances of the models.Furthermore, Figures 10 and 11

Discussion
Extreme events in time series refer to episodes with unusually high values that can infrequently, sporadically, or frequently occur [56].Predicting such extreme events remains a significant challenge [57] due to factors like limited available data, dependence on exogenous variables, or long memory dependence.Qi et al. [57] emphasized the need for neural networks to exhibit adaptive skills in diverse statistical regimes beyond the training dataset.
To address this challenge, we propose a SOM-MLP hybrid model that segments time series into clusters of similar temporal patterns for each station.The MLP acts as a non-linear autoregressive model adapted to these patterns.Our hybrid method demonstrates improved prediction performance, as shown in Table 3, with the SOM-MLP model consisting of nine nodes achieving the best results compared to the MLP-Global and MLP-Station models, exhibiting lower MAE and MAPE metrics.
We utilize aggregating operators for the SOM-MLP model with nine nodes to enhance computational efficiency for both the 75th and 90th percentiles.The best matching unit (BMU) operator performs well for the 75th percentile (see Tables 4 and 5), while the BMU-MAX gate yields the best results for the 90th percentile (see Tables 6 and 7).These operators significantly improve the prediction of extreme values, especially during periods of high magnitude and volatility.
The proposed hybrid method, SOM-MLP, effectively captures local similarities in hourly series, enhancing the next day's 75th or 90th percentile forecasting.While estimating peak values remains challenging, particularly in volatile time series, our approach successfully captures trends and behaviors in each node's segmented series.This highlights the potential of our proposal for improving extreme value predictions in various time series applications.

Conclusions
In this study, we propose a strategy to capture the essential information regarding the topological structure of PM 2.5 time series.Our hybrid model demonstrates improved forecasting capabilities, particularly for extreme values.To the best of our knowledge, this is the first paper to employ this approach for predicting extreme air quality time-series events in Santiago, Chile.We observe that recognizing patterns in unique or clustered time series is essential for defining the hybrid model.The comparison of the evaluated models' performance indicates the potential for improvement through a combined optimization of operators.
Our results demonstrate that the proposed hybrid method outperforms the application of the MLP method per station, which, in turn, exhibits superior performance compared to the Global method.Consequently, these results pave the way for future stages where optimization processes can be implemented within the MLP network structure.Additionally, it is worth exploring the use of alternative neural networks or other methods of interest to enhance the hybrid method further.
This proposal serves as a sophisticated tool for assessing air quality and formulating protective policies for the population.Moreover, the proposed model can be extended to other pollutants, with a specific emphasis on capturing their extreme values and multivariate analysis for seasons, which are often neglected.For instance, in [58], the authors mention their developed model's inability to predict extreme ozone values.Our proposed model can be applied to analyze this pollutant in future work.Similarly, in [10], a visualization approach for PM 10 is proposed, which can be complemented by applying our hybrid method for predictive purposes.Furthermore, this study's framework can be extrapolated to investigate PM 10 in the city of Lima, Peru using neural networks [14].
It is important to highlight that the proposed methodology has some limitations.First, the technique is developed to analyze environmental pollution data based on univariate time series from various sources.Further studies are required for its validation in other contexts (for example, in data related to air quality [59][60][61], solid waste [62], also in academic performance data [63], data related to digital marketing [64] or those based on energy efficiency [65,66]).One of the methodology's significant limitations is that it does not preserve the time series structure since it assumes an auto-regressive model with a predefined lag size.
The proposed approach can also be extended to support the analysis of behaviors in other pollutants in multidimensional cases.Further studies are needed to evaluate the proposed methodology when the time series is decomposed into trend, seasonal, and long-term components, which can be imputed to the model by themselves or together after removing the noise, besides considering incorporating regularization techniques such as Dropout [67] and other strategies to avoid over-parameterization.Moreover, the models can be further improved by introducing geospatial information.

Figure 1 .
Figure 1.Schematic of the architecture of the MLP.The figure shows three layers of neurons: input, hidden and output layers.

Figure 2 .
Figure 2. Scheme of the architecture of self-organizing maps.This model consists of a single layer of neurons in a discrete lattice called a map.The SOM projects the high-dimensional data into a discrete low-dimensional map.

Figure 3 .
Figure 3. Proposed self-organized topological multilayer percepton.In the first stage (a), time series are collected from the monitoring stations.In the second stage (b), the self-organizing maps find similar topologies in each monitoring station (complemented by other clustering methods, such as elbow, Calinski-Harabasz, and gap).In the third stage (c), the SOM projects the time segments, and this generates the formation of clusters.An MLP is trained to predict each unit's extreme values for the next day.In the fourth stage (d), a combiner of the best results of the previous stage is evaluated.

Figure 4 .
Figure 4. Map with the Metropolitan area of Santiago, Chile (SCL), together with the location of the nine pollutant and weather monitoring stations that belong to SINCA.

Figure 7 .
Figure 7. (a) Elbow method, (b) Calinski-Harabasz index and (c) Gap method to determine the optimal number of clusters.It is observed that the three methods converge in determining that the optimal number of centroids is nine.
Figures8 and 9show the boxplots of the performance metric for each gate for the 75th and 90th percentiles, respectively.Among these are the RMSE, the Spearman correlation index, and the coefficient of determination.For the 90th percentile, it is observed that the GATE_BM is the one with the best performance.It is highlighted that, in extreme values, it is convenient to use the GATE_BM, which does not happen for the 75th percentile since other operators such as BMU, mean, and softmax are suitable in this scenario.Tables4-7report the performances of the models.Furthermore, Figures10 and 11show the behavior of the MLP-Station and the hybrid model for each monitoring station, respectively.The highest R 2 values of the model fit are observed at the El Bosque, La Florida and Talagante monitoring stations.

Figure 8 .
Figure 8. Performance of the models to forecast the 75th percentile.The SOFTMAX gate shows the best performance.

Figure 9 .
Figure 9. Performance of the models to forecast the 90th percentile.The BMU-MAX gate shows the best performance.

Figure 10 .
Figure 10.Forecasting results obtained by the MLP-Station for each station.

Figure 11 .
Figure 11.Forecasting results obtained by the SOM-MLP with the BMU-MAX gate for each monitoring station.

Table 1 .
Levels of PM 2.5 according to international regulations.

Table 2 .
Summary statistics of PM 2.5 for each monitoring station.

Table 3 .
Performance of each model (Mean ± SD): global MLP model, MLP constructed for each station, and the SOM-MLP hybrid model with 4, 9, and 25 neurons.

Table 4 .
Performance metrics obtained with different combiner operators for the 75th percentile.The reported results are the average and standard deviation of 20 runs.

Table 5 .
Performance metrics obtained with the models for the 75th percentile, for the global MLP model, the MLP constructed for each station, and the SOM-MLP hybrid model.The model with the best performance is the SOM-MLP hybrid model with the GATE_BMU.

Table 6 .
Performance metrics obtained with different combiner operators for the 90th percentile.The reported results are the average and standard deviation of 20 runs.

Table 7 .
Performance metrics obtained with the models for the 90th percentile, for the global MLP model, MLP constructed for each station, and the SOM-MLP hybrid model.The one that reports the best performance is the SOM-MLP hybrid model with GATE_BM.