Development of an AI Model to Measure Traffic Air Pollution from Multisensor and Weather Data

Gas multisensor devices offer an effective approach to monitor air pollution, which has become a pandemic in many cities, especially because of transport emissions. To be reliable, properly trained models need to be developed that combine output from sensors with weather data; however, many factors can affect the accuracy of the models. The main objective of this study was to explore the impact of several input variables in training different air quality indexes using fuzzy logic combined with two metaheuristic optimizations: simulated annealing (SA) and particle swarm optimization (PSO). In this work, the concentrations of NO2 and CO were predicted using five resistivities from multisensor devices and three weather variables (temperature, relative humidity, and absolute humidity). In order to validate the results, several measures were calculated, including the correlation coefficient and the mean absolute error. Overall, PSO was found to perform the best. Finally, input resistivities of NO2 and nonmetanic hydrocarbons (NMHC) were found to be the most sensitive to predict concentrations of NO2 and CO.


Introduction
In the transport sector, fossil fuel-powered vehicles, such as motorcycles, cars, and buses, are major contributors to local air pollution [1]. Two particularly important compounds in air pollution are nitrogen oxides (NO x ) and carbon monoxide (CO). On the one hand, primary NO x emissions are mostly in the form of nitric oxide (NO), which can react with ozone (O 3 ) to form nitrogen dioxide (NO 2 ). On the other hand, CO is produced by an incomplete combustion of fossil fuels, such as gasoline, natural gas, oil, coal, and wood. Emissions from transport vehicles are responsible for more than half of the NO x in the air and represent the largest anthropogenic source of CO [2,3]. In densely populated cities and industrialized areas, air quality has become an important measure of quality of life, as is the case in Vietnam. In fact, many studies have found that pollutants from vehicle exhaust can cause adverse impacts on nearly every organ in the body [4][5][6][7][8][9][10]. Controlling air quality (by controlling air pollution) is highly desirable to improve urban sustainability and quality of life [11], and it starts by measuring and forecasting air quality.
In the literature, two families of techniques are typically used to forecast pollutant concentrations or determine the factors that control NO 2 and CO concentrations. The first family uses detailed atmospheric diffusion models, which take into account the physical and chemical equations that impact pollutant concentrations [12][13][14][15][16]. The second family applies statistical methods and leverages statistical models to capture the fundamental relationship between a set of input data (i.e., independent variables) and their targets (i.e., dependent variables) [17][18][19][20][21][22][23][24][25]. As an example, Shi and Harrison [26] developed a linear regression model to predict NO x and NO 2 concentrations in London.
In parallel, low-cost gas multisensor technology can potentially revolutionize the research on air pollution by providing highly disaggregate spatiotemporal pollution data. These data can be utilized to supplement traditional pollution monitoring methods to help improve air pollution estimates and raise awareness about air pollution. Nonetheless, data quality and data processing remain an important concern, which hinders the adoption of these low-cost sensors. Indeed, unreliable sensors can easily provide erroneous data, which may then inform the wrong policies.
To partly address these concerns, artificial intelligence (AI) can offer an effective numerical approach to model complex and nonlinear relationships between a set of input data and targets, and it has been applied to many fields, from transport [27,28] to water resource engineering [29,30]. For air quality, artificial neural networks (ANN) can model nonlinear systems, and they have been successfully used to model sulfur dioxide concentrations in the industrial site of Priolo, Syracuse, Italy [31]. Comrie et al. [32] compared multilayer perceptron (MLP) models with more traditional regression models for ozone forecasting. Focusing on Central London (UK), Gardner and Dorling [33] developed a MLP model with hourly NO x and NO 2 data as well as meteorological condition data and showed that MLP outperformed the regression models developed by Shi and Harrison [26] using the same study site.
As the relationship between NO 2 , CO, and meteorology is complex and nonlinear, we developed two AI models to predict hourly NO 2 and CO concentrations from readily observable local meteorological data. The two models were adaptive neuro-fuzzy inference system (ANFIS) optimized by particle swarm optimization (hereafter denoted as ANFIS-PSO) and ANFIS optimized by simulated annealing (hereafter denoted as ANFIS-SA). The main objective of this study was to explore the influence of input data on predicting different air quality indexes. The input parameters were divided into two main groups: (i) resistivities from multisensor devices, which included five inputs, and (ii) meteorological variables, including temperature, relative humidity, and absolute humidity. Furthermore, a sensitivity analysis was performed to determine the most important factors that affect air quality, specifically to identify the dominant links between the sensors and the pollutants. The data was collected in the center of a city in Italy between March 2004 and February 2005.

Adaptive Network-Based Fuzzy Inference System
The ANFIS algorithm combines fuzzy systems with neural networks. Jang [34] first proposed the algorithm and used it to investigate nonlinear systems. Generally, an ANFIS includes five layers, and each layer is formulated by some nodes and node functions [35]. In this study, we used the Takagi-Sugeno model, considered to be the most prominent fuzzy inference system model [36][37][38].

Particle Swarm Optimization
Since its introduction by Kennedy and Eberhart [39], PSO has become one of the most commonly used evolutionary methods for parameter optimization. The principle of PSO is based on the social and biological behaviors of animals when seeking food. PSO originates with a random group of particles, where each particle stands for a specific solution to the problem. It comprises groups of particles in which the position of each individual is affected by the position of the particles in the group. Essentially, each individual can adjust their position in the search space based on the best locations possible and the best locations adjacent to their neighbors. At every iteration step, the position of each particle is also updated based on its current position and velocity [40].
Moreover, each particle randomly moves along the search space, but it can get disrupted as a result of its own knowledge and that of its neighbors [41,42]. Therefore, the way a particle searches can be influenced by other particles in the swarm. This means that the particles learn and acquire knowledge from one another in a group and advance at the same rate as their best neighbors [41,42]. Combining regression modeling and PSO generally results in a high-performing model that is suitable for addressing classification and forecasting problems [41,42]. For more information on PSO, the reader is referred to [43][44][45].

Simulated Annealing
Simulated annealing was developed after PSO, and it has become a powerful tool for global optimization. Based on the similarity between a search algorithm and the process of annealing in metallurgy, the idea of simulated annealing first appeared in Metropolis et al. [46] as a simulation algorithm. Similar to a cooling process, the algorithm simulates a steady temperature decrease until the system converges to a stable state, thereby avoiding the inclusion of defects when cooling too quickly or too slowly. Search algorithms also focus on identifying solutions without ignoring better solutions that can be found later. Kirkpatrick et al. and Cerny et al. used Metropolis et al.'s idea and applied it to search for feasible solutions and converge to an optimal solution, which they termed "simulated annealing" [47][48][49].
Since then, the development of SA algorithms and their applications have generated a new field of study. While annealing is the process of first heating a solid and then cooling it down slowly, in simulated annealing, the temperature is kept variable to simulate this heating process. Specifically, the temperature is initially set high and is then allowed to "cool down" slowly. The initial heating essentially helps to avoid becoming trapped in a local minimum. As the system cools down, its new structure becomes increasingly fixed, thus firmly setting its final properties. In the end, the free energy of the system is minimized, imitating how a minimum is reached during the annealing process, eventually resulting in an optimized solution [50,51]. For more information on SA, the reader is referred to [52,53].

Model Validation
Model performance is primarily evaluated using three statistical measures: mean absolute error (MAE), root mean squared error (RMSE), and correlation coefficient (R). The value of R ranges from 0 to 1; a higher value of R (i.e., closer to 1) indicates better performance [54][55][56]. On the contrary, lower values of RMSE and MAE indicate better performance [57][58][59]. Mathematically, these three measures are defined as where n refers to the number of data points; p i and q are the predicted and mean predicted values of the input data, respectively; and v i and v are the individual values and mean values of concentrations of NO 2 and CO as atmospheric pollutants, respectively.

Dataset
While air quality data is abundant, large multivariable datasets to develop models are not. In this work, we used data collected between March 2004 and February 2005 in the center of an unnamed, polluted Italian city with heavy traffic, mainly by cars [60,61]; the data is available in open access from the University of California, Irvine (UCI) machine learning repository. While the original dataset contained 9357 records, one analyzer was out of service, and the corresponding data had to be removed. A multisensor device was used to provide hourly averages of the resistivity expressed by the CO-, NO x -, O 3 -, and NO 2 -specific metal oxide (MOX) chemiresistors, a nonmetanic hydrocarbon (NMHC)-targeted MOX sensor [60,61]. The multisensor device also contained sensors to capture the temperature as well as the relative and absolute humidity. In the end, the input parameters contained 6941 responses from the eight inputs previously mentioned. In parallel, five conventional fixed stations provided reference concentration estimations for CO (mg/m 3 ), NMHC (g/m 3 ), benzene (C 6 H 6 ) (g/m 3 ), NO x (ppb), and NO 2 (g/m 3 ). These results were considered as outputs of the problem, which were recorded hourly by taking averages of the concentration values. While the original dataset had five outputs, we focused on estimating only concentrations of NO 2 and CO. Table 1 shows the summary statistics of all the variables used in this study.
The correlations between the inputs and concentrations of NO 2 and CO are plotted in Figure 1; both plots and linear correlation coefficients are shown. As Figure 1 clearly shows, some of the variables were significantly correlated. In particular, most of the sensor variables were correlated, although not in a strictly linear fashion. In this work, all variables were included to increase the accuracy of the final models developed.

Dataset
While air quality data is abundant, large multivariable datasets to develop models are not. In this work, we used data collected between March 2004 and February 2005 in the center of an unnamed, polluted Italian city with heavy traffic, mainly by cars [60,61]; the data is available in open access from the University of California, Irvine (UCI) machine learning repository. While the original dataset contained 9357 records, one analyzer was out of service, and the corresponding data had to be removed. A multisensor device was used to provide hourly averages of the resistivity expressed by the CO-, NOx-, O3-, and NO2-specific metal oxide (MOX) chemiresistors, a nonmetanic hydrocarbon (NMHC)-targeted MOX sensor [60,61]. The multisensor device also contained sensors to capture the temperature as well as the relative and absolute humidity. In the end, the input parameters contained 6941 responses from the eight inputs previously mentioned. In parallel, five conventional fixed stations provided reference concentration estimations for CO (mg/m 3 ), NMHC (g/m 3 ), benzene (C6H6) (g/m 3 ), NOx (ppb), and NO2 (g/m 3 ). These results were considered as outputs of the problem, which were recorded hourly by taking averages of the concentration values. While the original dataset had five outputs, we focused on estimating only concentrations of NO2 and CO. Table 1 shows the summary statistics of all the variables used in this study.
The correlations between the inputs and concentrations of NO2 and CO are plotted in Figure 1; both plots and linear correlation coefficients are shown. As Figure 1 clearly shows, some of the variables were significantly correlated. In particular, most of the sensor variables were correlated, although not in a strictly linear fashion. In this work, all variables were included to increase the accuracy of the final models developed.   The training dataset was scaled into the [−1, 1] range, as is common in machine learning, to better follow the non-Gaussian distribution of variables. The scaling process of a variable x is expressed by Equation (4), and it involves two parameters, α and β, shown in Table 1; essentially, α is the minimum value of the dataset, and β is the maximum value. The same scaling procedure (with the same α and β) was applied to the testing set as well.

Optimization Procedure
In this section, the optimization of ANFIS using SA and PSO is detailed. First, we note that there were 250 consequent and antecedent ANFIS parameters to be optimized, corresponding to an eight-dimensional input space. The parameters of ANFIS were generated using C-means clustering. In this work, both input space dimensionality and consuming time were evaluated when choosing the parameters of SA and PSO, especially in terms of population size and maximum number of iterations. Moreover, the maximum number of iterations was chosen as a stopping criterion. Tables 2 and 3 show the final parameters selected for SA and PSO, respectively, through a rigorous trial and error process [59,62]. Moreover, optimization curves are presented in Figure 2 for concentration of NO 2 and in Figure 3 for concentration of CO.

Model Performance
The performance of the two models developed is summarized in Table 4. In addition to MAE, RMSE, and R, a straight line was fitted to predicted vs. actual plots shown in Figures 4 and 5. The slope of the linear fit was then used to measure the angle between the x-axis and the linear fit, with angles closer to 45 • indicating better performance. Figure 4a,c shows the prediction capability between the scaled predicted and actual values of NO 2 concentration on the training set for ANFIS-SA and ANFIS-PSO, respectively. Figure 4b,d shows the same information but applied to the testing set. From the figures and

Model Performance
The performance of the two models developed is summarized in Table 4. In addition to MAE, RMSE, and R, a straight line was fitted to predicted vs. actual plots shown in Figures 4 and 5. The slope of the linear fit was then used to measure the angle between the x-axis and the linear fit, with angles closer to 45° indicating better performance.        With regard to the concentration of CO, Figure 5a,c shows the prediction capability of ANFIS-SA and ANFIS-PSO, respectively, using the training dataset. Figure 5b,d shows the same information but applied to the testing set. For the training set, ANFIS-SA and ANFIS-PSO produced slope angles of 37.73° and 39.51°, respectively. For the testing set, ANFIS-SA and ANFIS-PSO generated slope angles of 37.65° and 39.16°, respectively. The ANFIS-PSO therefore performed slightly better than ANFIS-SA. The three other measures support similar conclusions.   With regard to the concentration of CO, Figure 5a,c shows the prediction capability of ANFIS-SA and ANFIS-PSO, respectively, using the training dataset. Figure 5b,d shows the same information but applied to the testing set. For the training set, ANFIS-SA and ANFIS-PSO produced slope angles of 37.73° and 39.51°, respectively. For the testing set, ANFIS-SA and ANFIS-PSO generated slope angles of 37.65° and 39.16°, respectively. The ANFIS-PSO therefore performed slightly better than ANFIS-SA. The three other measures support similar conclusions.   Figure 6b,d shows the histograms of two models for NO2 concentration. We can see that ANFIS-PSO had a higher peak of error concentration around 0 than ANFIS-SA. A similar pattern can be observed for concentration of CO. Moreover, Table  4 shows that the R values tended to be higher for ANFIS-PSO, and the MAE and RMSE values tended to be lower for ANFIS-PSO.   Figure 6b,d shows the histograms of two models for NO 2 concentration. We can see that ANFIS-PSO had a higher peak of error concentration around 0 than ANFIS-SA. A similar pattern can be observed for concentration of CO. Moreover, Table 4 shows that the R values tended to be higher for ANFIS-PSO, and the MAE and RMSE values tended to be lower for ANFIS-PSO.    Figure 6b,d shows the histograms of two models for NO2 concentration. We can see that ANFIS-PSO had a higher peak of error concentration around 0 than ANFIS-SA. A similar pattern can be observed for concentration of CO. Moreover, Table  4 shows that the R values tended to be higher for ANFIS-PSO, and the MAE and RMSE values tended to be lower for ANFIS-PSO.  In conclusion, although both models performed well and were statistically significant, ANFIS-PSO was shown to be slightly superior to ANFIS-SA to model CO and NO 2 concentrations.

Sensitivity Analysis
Predicting air quality is complex as the relationships between the input and target variables are nonlinear. In this section, a sensitivity analysis of the input variables on the predicted results is discussed. In the literature, this type of analysis has been successfully applied to quantify the sensitivity level of input parameters in AI models. For instance, Ly et al. [63] used sensitivity analysis to study the influence of input parameters such as bubble radius, viscosity, and saturation for a problem related to the 3D selective laser sintering process in predicting bubble dissolution time.
The main idea is to exclude one input variable successively from the input space while keeping the others at their median value. Therefore, the method allows us to quantify how sensitive a model is to individual input parameters. Specifically, using the AI prediction model developed previously, a new eight-dimensional input space was constructed based on the probability density distribution of each variable. Here, the value of each input variable was recorded at the following percentiles: 0, 10, 25, 50, 75, 90, and 100. One input variable was then selected, and the model was run seven times, once for each of the seven percentile values. Each time, the other variables were kept at their median value (i.e., 50 percentile). Essentially, the method provided quantitative information on the deviation (i.e., change) of an output when varying the input variables.
In this study, deviation in the output solution, or level of sensitivity δ j i , for the jth input variable was expressed as follows: where O re f is the output of the configuration of reference, and O j i is the output using jth input variable at its ith percentile. Finally, the global percentage of sensitivity of each input was computed based on the following equation: Table 5 summarizes the values of each input at its seven percentiles, whereas Table 6 summarizes the output solution of the developed AI models corresponding to each percentile. Sensitivity, as a function of the percentile, is plotted in Figure 7 for NO 2 and in Figure 8 for CO. We can see that, for NO 2 , the input parameters X 2 (sensor NMHC) and X 4 (sensor NO 2 ) had the most important influence on the predicted results, both for ANFIS-SA and ANFIS-PSO. In addition, the other input parameters had a low impact on the predicted results compared to sensors NMHC and NO 2 (which was expected as NO 2 concentration was measured, thus also partly validating the accuracy of the models developed).   In their work, to estimate NO2 concentration, the best results came from the use of all sensors. In other words, omitting the NMHC or NO2 sensors led to lower performance. Interestingly, this was not the case for the CO concentration model. In fact, De Vito et al. [61] found that coupling the CO sensor with NMHC gave the best performance and that including the NO2 sensor actually led to lower performance. This phenomenon might be a result of the size of the dataset, with 6941 data     In their work, to estimate NO2 concentration, the best results came from the use of all sensors. In other words, omitting the NMHC or NO2 sensors led to lower performance. Interestingly, this was not the case for the CO concentration model. In fact, De Vito et al. [61] found that coupling the CO sensor with NMHC gave the best performance and that including the NO2 sensor actually led to lower performance. This phenomenon might be a result of the size of the dataset, with 6941 data points in our study compared with 9357 records in the case of De Vito et al. [61].  In terms of CO concentration, the sensitivity levels of the input parameters fluctuated significantly more; their level of sensitivity can also be consulted in Table 6. Similar to the NO 2 concentration, NMHC also had the most important impact in terms of sensitivity. For CO, the CO, O 3 , NO x , and NO 2 sensors were also found to have a significant impact. It is also worth noting that the input variables X 6 (temperature) and X 7 (relative humidity) had the lowest impact on the predicted results.
In conclusion, from the sensitivity analysis, the NMHC and NO 2 sensors were the most important parameters in the input space. This means that excluding one of them from the input space would impact the accuracy of the model. It is interesting to notice that using a dataset with 9357 records, De Vito et al. [61] found similar observations.
In their work, to estimate NO 2 concentration, the best results came from the use of all sensors. In other words, omitting the NMHC or NO 2 sensors led to lower performance. Interestingly, this was not the case for the CO concentration model. In fact, De Vito et al. [61] found that coupling the CO sensor with NMHC gave the best performance and that including the NO 2 sensor actually led to lower performance. This phenomenon might be a result of the size of the dataset, with 6941 data points in our study compared with 9357 records in the case of De Vito et al. [61].
The total percentage of sensitivity, calculated by summing all levels of sensitivity for each input variables (in absolute values), is presented in Figure 9a for NO 2 concentration and Figure 9b for CO concentration. The NMHC and NO 2 sensors appeared as the most important variables to predict both NO 2 and CO concentrations. The total percentage of sensitivity, calculated by summing all levels of sensitivity for each input variables (in absolute values), is presented in Figure 9a for NO2 concentration and Figure 9b for CO concentration. The NMHC and NO2 sensors appeared as the most important variables to predict both NO2 and CO concentrations.

Conclusions
Predicting air quality accurately is paramount in many cities around the world that are suffering from chronic and severe air pollution problems, notably linked to emissions from fossil fuel-powered transport vehicles. The main goal of this study was to develop an AI model that can reliably predict hourly NO 2 and CO concentrations from gas multisensor and local weather data. A total of eight input variables were used, consisting of five sensor variables and three weather variables. Moreover, two AI models were trained and tested, namely, ANFIS-PSO and ANFIS-SA.
First, the technical details of the two models and the dataset were introduced and discussed. The results showed that both models performed well and were statistically significant but that ANFIS-PSO performed slightly better. To further investigate the role of each individual input variable in the models developed, a detailed sensitivity analysis was carried out. It was found that the NMHC and NO 2 sensors particularly affected the sensitivity of both the NO 2 and CO concentration models. The CO concentration model was shown to be generally more sensitive to all variables. Nonetheless, the three weather variables did not overly affect the accuracy of the model.
Overall, accurately modeling air quality is paramount as the health of millions of people is affected by poor air quality. We have shown that combining multioutput sensor data with advanced AI techniques offers a powerful avenue, especially to model nonlinear processes such as air quality, as was done in this study. Thanks to the collection of new and larger datasets, future work should focus on developing new techniques that can analyze the problem as time series to further improve prediction performance, possibly as done in [64][65][66]. Finally, interested readers are recommended to consider cross-interference, sensitivity, and response time of sensors [67] in AI models developed to predict air quality.