Robust Estimation of Carbon Monoxide Measurements

This paper presents a robust analysis of carbon monoxide (CO) concentration measurements conducted at the Belisario air-quality monitoring station (Quito, Ecuador). For the analysis, the data collected from 1 January 2008 to 31 December 2019 were considered. Additionally, each of the twelve years analyzed was considered as a random variable, and robust location and scale estimators were used to estimate the central tendency and dispersion of the data. Furthermore, classic, nonparametric, bootstrap, and robust confidence intervals were used to group the variables into categories. Then, differences between categories were quantified using confidence intervals and it was shown that the trend of CO concentration at the Belisario station in the last twelve years is downward. The latter was proven with the precision provided by both nonparametric and robust statistical methods. The results of the research work robustly proved that the CO concentration at Belisario station in the last twelve years is not considered a health risk, according to the criteria established by the Quito Air Quality Index.


Introduction
Carbon monoxide (CO) is a colorless and odorless gas that when found in the air in large concentrations is harmful to both humans and animals. This gas is produced when fossil fuels are burned and, therefore, internal combustion engines and vehicles or machinery whose operating principle is based on burning fossil fuels are among the greatest sources of CO, which is an air pollutant of concern worldwide [1]. Additionally, gasoline-powered pressure washers, propane-powered forklifts, propane-powered resurfacing machines, and gasoline-powered appliances, among others, can cause CO poisoning when are not used correctly in some applications [2]. The most important source of CO is motor vehicle exhaust [3]. However, catalytic convertors have reduced automobile exhaust emissions of CO [4].
In addition, detonation of explosives employed in blasting can produce CO and people living near a blast site can be affected [2]. Furthermore, smoke-polluted environments and hookah smoking are also sources of CO exposure [2].
CO is a silent killer [4] and the type of CO poisoning that a human being can suffer will vary depending on the level of CO concentration to which they are exposed and the length of time that such exposure lasts. For example, a human being who is exposed to high levels of CO concentration for a long time may go into a coma or even die [4]. Moreover, the most common symptoms of CO poisoning include the following: headaches, dizziness, vomiting, nausea, dyspnea, chest pain, strokes and to cardiovascular and coronary heart diseases. Moreover, in [17] Bayesian hierarchical models were used to obtain national and regional average associations.
Furthermore, the robustness of the effects that CO poisoning has on cardiovascular mortality was evaluated in [17] using fitted two-pollutant models. However, the concept of robustness introduced in [17] was not in the sense of [7][8][9]. Specifically, the authors of [17] said that that the association between CO and mortality was robust if the significance of the predictor variable in a meta-regression model was very little.
An uncertainty analysis of CO measurements performed at the Izaña mountain station (Tenerife, Spain) was developed in [18]. Additionally, time series analysis was used to study the daily nighttime mean of CO concentration and, in order to perform the study, a least-squares fitting to a nonlinear function was carried out. This function consisted of a quadratic year-on-year component plus four Fourier harmonics that represented an annual cycle.
Sometimes observations of CO concentration do not follow a Gaussian distribution. Therefore, these observations cannot be analyzed using classical statistical inference methods. This happens in the present research work. But, it has also happened in research works carried out by other authors. For example, in [19] a statistical analysis of measurements of gas emissions from gasoline-powered vehicles in Irbid Directorate (Jordan), was carried out. In that paper, in order to analyze vehicle emissions of CO and other pollutants, 1000 vehicles were tested. In summary, the study performed in [19] was aimed at determining whether there were significant differences in the mean value of several emissions of pollutants which came from vehicles with different characteristics. With the purpose of conducting the above-mentioned study, nonparametric tests such as the Kruskal-Wallis test and the Mann-Whitney U test [20,21] were used [19]. Other examples of recent publications in which nonparametric tests have been used to analyze measurements of air pollution variables can be found in [22][23][24][25][26][27][28][29].
In the present paper, robust statistics [7][8][9] is used to analyze 12 years of measurement results of CO concentration at Belisario station [10], which is one of the most important stations of Quito Metropolitan Atmospheric Monitoring Network (QMAMN) [30]. QMAMN is part of the Ministry of the Environment of Ecuador and in Quito this network has nine air-quality monitoring stations, which are located in very important parts of the city.
The statistical analysis of different variables of air pollution in Quito was also carried out superficially in [30]. In fact, in [30] a robust analysis of the air pollution variables was not carried out, and the statistical tools used to analyze CO concentration were only the mean and maximum values. Therefore, it is necessary to complete what appears in [30] with a formal and rigorous study of the numerical results of the measurements of CO concentration levels in Quito. In this sense, the research work presented here could serve as a reference material to comprehensively analyze the results of the CO concentration measurements that have been carried out at the Belisario air quality monitoring station, in the last twelve years.
The objectives of this paper are the following: (1) Construct sets of variables that represent the 12 years of CO concentration measurements under study, the months of the year, and the hours of the day, to determine statistical parameters that establish similarities and differences between the elements of these sets of variables. (2) Classify CO concentration measurements by using different methods of estimating the central tendency and dispersion of the data. Specifically, classic, nonparametric, resampling, and robust methods are used. (3) Categorize and discriminate CO concentration measurements using confidence intervals.
These confidence intervals are constructed at the 95% confidence level and are of the following types: classic, nonparametric, bootstrap, and robust confidence intervals. Previous research papers that have been entirely focused on the use of robust statistics to analyze the behavior of air pollution variables are those shown in [31][32][33]. In addition, other research papers in which robust estimators have been used to analyze air pollution variables are [34,35].
The rest of the paper is organized as follows: Section 2 gives information about the study site and shows summary statistics. Section 3 is devoted to carrying out the data analysis by using nonparametric statistical inference techniques. Section 4 is aimed at performing the robust estimation of the CO concentration measurements. The aim of Section 5 is to perform a discussion of the results. Finally, the conclusions are given in Section 6.

Study Site and Summary Statistics
The study site was the Belisario station and information about this monitoring station can be found in [10,30]. According to [30], the data were collected using CO analyzers from Thermo Fisher Scientific, model 48i [36], which is a reference-level instrument that serves as a measurement standard in many countries (e.g., it is designated as a Federal Equivalent Method by the US EPA).
In this paper, each data represents a CO concentration value for each hour and said data are the result of the arithmetic mean of the CO concentrations that have been measured every 10 min of the corresponding hour represented by the data [30]. According to [37], in order to calculate the averages, 75% of the valid records were covered.
For the analysis, the data collected from 1 January 2008 to 31 December 2019 was considered, and the results of the analysis carried out refer to most of the data collected. Here, it will be analyzed whether the oscillations of the measurements are due to random variations or they indicate that the measurements are different from each other. The aforementioned will be carried out using nonparametric and robust statistics tools.
Since the data collected begins on 1 January 2008, with a sampling rate of one hour, and refers to a full 12 years, this would mean 105,193 data. However, since some data does not appear, others have negative values and one has an exceptionally high value compared to the rest, the analysis has been carried out with more than 96% of all the data; that is, only less than 4% of the total data has been lost. Negative values were removed because they cannot be valid. Nevertheless, the values equal to zero were taken into account, because these could represent valid measurements that were carried out at certain time instants. On the other hand, there was an excessively large point value that was also removed, because it was clearly seen that it could not be valid and that it also had no relationship with the rest of the values of the data set.
In this research work, there were no data scarcity problems and the time instants corresponding to missing information were not taken into account, because the robust analysis was carried out based on the information that was actually provided by the measurement instrument without need to perform any kind of interpolation, which represents one of the strengths of robust statistical inference.
With the data available, consisting of year, month, day, hour, and amount of CO concentration in milligrams per cubic meter (mg/m 3 ), the analysis and interpretation of these will be carried out with the aim of finding relationships between said data. The variables under analysis are X k , k = 1, . . . , 12, which are the CO concentrations in 2008, 2009, and so on until 2019. That is X 1 = 2008, X 2 = 2009, . . . , X 12 = 2019. Figure 1 shows the box plot diagram of the variable CO classified by years, and Figures 2 and 3 show three graphs of moving averages (MAs), one graph shows the MA of all the years and two others show the MA of half of the years. This smoothing technique is used in time series studies [38,39] and will be used here to analyze the trends of the variables. Although there are different types of MA smoothing, the simplest will be used in this paper. This type of smoothing by MA consists of the following: Given a value m less than the total number of data, the mean of the data set x h , x h−1 , . . . , x h−m+1 is found for each h ≥ m. In this way, each data loses its individual influence, although m − 1 observations are lost. In this paper, the MA of size 720 has been considered, since 720 is the number of data that would be in a full 30-day month.        The boxplot and moving average graphs shown in Figures 1-3 show that all variables (years) appear to behave similarly to each other except in 2009. In addition, Figure 1 shows that the number of observations that are extremely high, compared to available values, decrease as the years pass. Moreover, Figure 2 indicates a trend to decrease the CO concentration continuously as time passes.
In order to provide information that quickly supplies a simple description of the measurement results, Table 1 shows a statistical summary of the data. From Table 1, it can be seen that for each year there are approximately between 94% and 97% of all possible data. Also, this table shows that for all the variables the mean is higher than the median, that the skewness is positive and that the kurtosis is higher than 5, reaching values higher than 7 in some years. The aforementioned indicates that it is very likely that all the variables under study come from heavy-tailed distributions [8,40], because, based on the information provided in Table 1, the medians are less than the means, the skewness are greater than zero, the kurtosis they are greater than 3, and it is observed that the values of the standard deviations are not small when compared with the values of the means. Furthermore, from Figure 1 it can be seen that there are many outliers.
This idea is confirmed with the boxplot graph shown in Figure 1, where abnormally high observations are presented every year. Therefore, these observations do not come from Gaussian The boxplot and moving average graphs shown in Figures 1-3 show that all variables (years) appear to behave similarly to each other except in 2009. In addition, Figure 1 shows that the number of observations that are extremely high, compared to available values, decrease as the years pass. Moreover, Figure 2 indicates a trend to decrease the CO concentration continuously as time passes.
In order to provide information that quickly supplies a simple description of the measurement results, Table 1 shows a statistical summary of the data. From Table 1, it can be seen that for each year there are approximately between 94% and 97% of all possible data. Also, this table shows that for all the variables the mean is higher than the median, that the skewness is positive and that the kurtosis is higher than 5, reaching values higher than 7 in some years. The aforementioned indicates that it is very likely that all the variables under study come from heavy-tailed distributions [8,40], because, based on the information provided in Table 1, the medians are less than the means, the skewness are greater than zero, the kurtosis they are greater than 3, and it is observed that the values of the standard deviations are not small when compared with the values of the means. Furthermore, from Figure 1 it can be seen that there are many outliers.
This idea is confirmed with the boxplot graph shown in Figure 1, where abnormally high observations are presented every year. Therefore, these observations do not come from Gaussian variables [41]. Furthermore, none of the observations exceeds the desirable level of air pollution that is established by the Quito Air Quality Index (QAQI) [30]. QAQI establishes that the maximum value of the desirable level of air pollution is equal to 5 mg/m 3 . In any case, CO concentrations below 5 mg/m 3 may be considered safe or low risk for human beings. Therefore, for the case under study, it can be said that the CO concentration at Belisario station is not considered a health risk.
Finally, due to the fact that there are many observations for each of the variables, the first thing that was done was to try to carry out the statistical analysis using classical inference techniques. Therefore, attempts were made to implement different variable transformations that allowed the variables under study to fit a normal distribution [41]. In this sense, the following variable transformations were performed: sum of constants, logarithms, operations of taking nth roots, and inverse functions, among others. However, the results were not as expected, because it was not possible to adequately fit the data for one year to known random variables that were not heavy tails, and a fundamental characteristic of heavy-tailed distributions is that the central limit theorem does not work for them. Therefore, there was no way to fit any of the variables to a normal distribution. In fact, the settings that at some point seemed visually appropriate had P-values [21] less than 0.005. All this justified the use of nonparametric statistics and robust statistics in this research paper.

Nonparametric Statistical Inference
This section is aimed at knowing whether the samples of the variables came from the same population and had a common median. To do this, a comparison was made between all the variables aimed at testing whether the differences between the medians were due to the variability of measurements or due to random causes. With respect to the aforementioned, the variability of the observations could be produced by particular characteristics of the instants of time in which the measurements were conducted. However, random causes could be produced by weather conditions or noise introduced by measuring instruments, among other things.
In this paper, observations were made on different groups of variables and these variables were considered to be linearly independent, because the linear correlations between the variables were close to zero. In other words, the linear dependence between the variables was not strong. However, it is important to mention that in this research work the existence of nonlinear dependencies between variables was not studied, because this is out of the scope of the paper.
In this paper, in order to study whether the distributions of the variables were the same or not, the Wilcoxon rank sum test [20,21] was used to test whether the data collected in the variables under study comes from distributions with equal medians, as was also done in [22,23,31].
To carry out the hypothesis test, the null hypothesis was considered to be H 0 : Median = M 0 , and the alternative hypothesis was H 1 : Median M 0 . Therefore, if the null hypothesis is assumed to be true and also that the quantities observed during all the years are stable, then half of the observations of each year will be less than M 0 and the rest of the observations will be greater than that amount. Here, the significance level was α = 0.05 and the confidence level was (1 − α). Lastly, the nonparametric bilateral confidence intervals for the median were calculated as in [31,33].
The limits of the confidence intervals found in this paper are shown in Table 2, being the confidence level equal to 95%. Furthermore, the graphs of the confidence intervals that were found are shown in Figure 4.
From the information provided in Table 2 and Figure 4, it can be seen again that the amount of CO concentration per year at Belisario station tends to decrease, because as the median decreases the interval shifts to lower values. At this point, it is important to mention that the lengths of the intervals are very small due to the large number of samples available.   From the information provided in Table 2 and Figure 4, it can be seen again that the amount of CO concentration per year at Belisario station tends to decrease, because as the median decreases the interval shifts to lower values. At this point, it is important to mention that the lengths of the intervals are very small due to the large number of samples available.
In addition, once the Wilcoxon rank sum test was performed, the following was verified: (1) The medians of the variables and are homogeneous.  In addition, once the Wilcoxon rank sum test was performed, the following was verified: (1) The medians of the variables X 1 and X 3 are homogeneous.
(2) The medians of the variables X 6 , X 8 and X 11 are homogeneous.
(3) The medians of the variables X 10 and X 12 are homogeneous. (4) The medians of the variables X 2 , X 4 , X 5 , X 7 and X 9 do not coincide with any other. Therefore, the amount of CO concentration per year can be grouped into four categories, which are indicated in Figure 4, separated by the black horizontal dashed lines. Specifically, the years 2008 (X 1 ) and 2010 (X 3 ) are in one category, the years 2009 (X 2 ), 2011 (X 4 ), 2012 (X 5 ), 2014 (X 7 ) and 2016 (X 9 ) are in another category, the years 2013 (X 6 ), 2015 (X 8 ) and 2018 (X 11 ) are in a third category, and the years 2017 (X 10 ) and 2019 (X 12 ) are in the fourth category.
Before concluding this section, it is important to mention that the fact that the CO concentration has been decreasing over the years could be explained by the environmental policies that have been carried out in the city of Quito in recent years. These results could indicate that these policies, among other things, could be part of the reasons why better results have been obtained.

Robust Estimation
In this paper, robust methods [7][8][9] were used to carry out the estimation of the central tendency and dispersion of the data in such a way that the results of the analysis were not affected by extreme values [31][32][33].
A useful technique for characterizing robust statistics is the influence curve [42]. This technique aims to measure the influence that an observation has against all other observations. In fact, if the estimators are not robust, then it may happen that the influence curves are not bounded. Therefore, when this happens, the estimator can be greatly affected by an observation that is very far from the rest of the data. With robust estimators, the influence curves are bounded and the estimators are practically insensitive to observations that deviate from the data set.
In this paper, robust estimators were applied to sample order statistics [21]. In short, the ordered sample of X 1 , . . . , X n is given by where the observations with the lowest value and the highest value are X (1) and X (n) , respectively.

Central Tendency Estimators
According to [7][8][9], the location statistics are used to indicate around which value most of the data, with which it is intended to obtain deductions, are grouped to determine the center of the distributions. In this paper, in addition to the mean and median, other statistics will be used.
The estimates of the above-mentioned statistics are shown in Table 3. In addition, this table also shows the following estimates: 0.2-trimmed mean, 0.3-trimmed mean, 0.2-winsorized mean, and 0.3-winsorized mean. Furthermore, Figure 5 shows classic and robust statistics of the variables, which correspond to those shown in Table 3. Figure 5 shows that there is a pronounced decrease from 2008 to 2012 and, from 2012 onwards, a stabilization is observed in all the estimates found, with a slight decrease. Note that, in general, all measures of centrality for each year fluctuate between the median and the mean. Table 3. Point estimates of location.

Scale Estimators
In this paper, the variability of the data is going to be formalized through scale estimators. In accordance with [8], any statistic satisfying both the shift invariance condition and the scale equivariance condition is a dispersion estimate. The scale estimators that will be used in this paper are the following: (1) Sample standard deviation ( ) [7,8].  Table 4. Furthermore, Figure 6 shows the graphical representation of the point estimates of scale of the variables, which correspond to those included in Table 4.

Scale Estimators
In this paper, the variability of the data is going to be formalized through scale estimators. In accordance with [8], any statistic satisfying both the shift invariance condition and the scale equivariance condition is a dispersion estimate. The scale estimators that will be used in this paper are the following: (1) Sample standard deviation (s x ) [7,8].
The point estimates of scale are shown in Table 4. Furthermore, Figure 6 shows the graphical representation of the point estimates of scale of the variables, which correspond to those included in Table 4.   In Figure 6, it can be seen that all the estimates are upper bounded by the standard deviation and lower bounded by the point estimator least median of squares. In addition, it is observed that the estimators of scale based on the Andrew's wave and the biweight are very similar to each other, as was the case with the estimators of location based on the Andrew's wave and the biweight. Moreover, there is a slight drop in the value of all the estimates from 2008 to 2012 and then they stabilize, which could indicate that the increase in the amount of CO concentration produced an increase in its variability, since that the lower limit is always zero.

Confidence Intervals
In this section, following the methodology used in [32] and suggested in [7,8], the confidence intervals were established with a location parameter, a scale parameter, and a constant related to the Student's t distribution. Furthermore, said constant was selected following the indications given in [46,47]. In what follows, , means the q-th quantile of the Student's t distribution with degrees of freedom (DOF). In this paper, the estimators shown in Sections 4.1 and 4.2 were used to build confidence intervals. The pair of estimators were as follows [32,33]: .  In Figure 6, it can be seen that all the estimates are upper bounded by the standard deviation and lower bounded by the point estimator least median of squares. In addition, it is observed that the estimators of scale based on the Andrew's wave and the biweight are very similar to each other, as was the case with the estimators of location based on the Andrew's wave and the biweight. Moreover, there is a slight drop in the value of all the estimates from 2008 to 2012 and then they stabilize, which could indicate that the increase in the amount of CO concentration produced an increase in its variability, since that the lower limit is always zero.

Confidence Intervals
In this section, following the methodology used in [32] and suggested in [7,8], the confidence intervals were established with a location parameter, a scale parameter, and a constant related to the Student's t distribution. Furthermore, said constant was selected following the indications given in [46,47]. In what follows, t ν,q means the q-th quantile of the Student's t distribution with ν degrees of freedom (DOF). In this paper, the estimators shown in Sections 4.1 and 4.2 were used to build confidence intervals. The pair of estimators were as follows [32,33]: (1) X, s x , where X is the mean. (4) T(α), s W (α) .
In addition, confidence intervals based on a bootstrap method were built [9,32,33]. With all of the above, eight confidence intervals were constructed for each of the twelve variables: one classic, one nonparametric, one using the bootstrap method, and five robust. In Figures 7-9, these intervals are shown for three of the twelve variables that have been analyzed, specifically those corresponding to the leap years included in the study. Showing more figures would not provide relevant information.
Sensors 2020, 20, x FOR PEER REVIEW 12 of 23 In addition, confidence intervals based on a bootstrap method were built [9,32,33]. With all of the above, eight confidence intervals were constructed for each of the twelve variables: one classic, one nonparametric, one using the bootstrap method, and five robust. In Figures 7-9, these intervals are shown for three of the twelve variables that have been analyzed, specifically those corresponding to the leap years included in the study. Showing more figures would not provide relevant information.   In addition, confidence intervals based on a bootstrap method were built [9,32,33]. With all of the above, eight confidence intervals were constructed for each of the twelve variables: one classic, one nonparametric, one using the bootstrap method, and five robust. In Figures 7-9, these intervals are shown for three of the twelve variables that have been analyzed, specifically those corresponding to the leap years included in the study. Showing more figures would not provide relevant information.    Figures 7-9 show that, in general terms, the variables present similar characteristics regarding the confidence intervals. For example, it can be seen that the classic confidence intervals are the ones that are pushed furthest towards high values, while the median-based confidence intervals are those that are shifted towards the lowest values. Note that this result is consistent with what was said in Section 2, in that it is very likely that the distributions of the variables are heavy-tailed distributions.   9 show that, in general terms, the variables present similar characteristics regarding the confidence intervals. For example, it can be seen that the classic confidence intervals are the ones that are pushed furthest towards high values, while the median-based confidence intervals are those that are shifted towards the lowest values. Note that this result is consistent with what was said in Section 2, in that it is very likely that the distributions of the variables are heavy-tailed distributions.
Furthermore, Figures 7-9 show that among the median-based confidence intervals, the nonparametric intervals and the bootstrap-based intervals are very similar. Also, these figures reflect that the intervals based on the median and the median absolute deviation are the narrowest.
With respect to the confidence intervals based on Andrew's wave and biweight, it can be said that these have similar characteristics in all the variables and that they are located between values that are to the right of the intervals based on the median and to the left of the intervals based on 0.2-trimmed mean.
Finally, Figures 7-9 also show that in all the variables the intervals based on the 0.2-trimmed mean location estimators are the second most displaced towards high values, being these intervals those that are closest to the classic intervals.
Due to all the above, the confidence intervals based on the estimators T(α), s W (α) and (T bi , s bi ) were used to compare the given variables. The reasons for this decision are as follows: first, the classic intervals are unfounded because the underlying distribution is assumed to be approximately normal, which is not true; second, the results obtained with bootstrap estimators and with the point estimators (M e , MAD) and (M e , IQR) are analogous to the results obtained with the nonparametric estimators seen in Section 3; and, third, the results obtained with the estimators based on the Andrew's wave and on the biweight are similar, so either of the two estimators could have been chosen. Table 5, similar to Table 2, includes the limits of the confidence intervals, with a confidence level of 95%, and their length for the estimators T(α), s W (α) and (T bi (c), s bi (c)), for α = 0.2 and c = 9.   (9), s bi (9)) 0.5838 0.5975 0.0137 The above-mentioned confidence intervals are shown in Figures 10 and 11. In addition, lines have been included in this figure to try to classify the variables, analogous to the classification provided by the Wilcoxon rank sum test for the medians in Section 3. With the biweight estimators, the classification of the variables is similar to that obtained with nonparametric estimators, the only difference is that variable X 2 is grouped with variables X 1 and X 3 . The above-mentioned confidence intervals are shown in Figures 10 and 11. In addition, lines have been included in this figure to try to classify the variables, analogous to the classification provided by the Wilcoxon rank sum test for the medians in Section 3. With the biweight estimators, the classification of the variables is similar to that obtained with nonparametric estimators, the only difference is that variable is grouped with variables and .   The first observation that is made is that between 2008 and 2012 the tendency to lower CO concentration values is notable, and that from 2012 to 2019 there are fluctuations with a slight downward trend. Regarding the amplitudes, it can be concluded that the confidence intervals found with biweight estimators are narrower than the confidence intervals found with -trimmed mean and Winsorized standard deviation.

Additional Confidence Intervals
In view of the results found in Section 4.3, it was decided to analyze the same data but with two different groupings. Specifically, variables , … , have been defined as the amount of CO concentration in each of the months of the year, with being the amount of CO concentration in the month of January of all years, the amount of CO concentration in the month of February of all The first observation that is made is that between 2008 and 2012 the tendency to lower CO concentration values is notable, and that from 2012 to 2019 there are fluctuations with a slight downward trend. Regarding the amplitudes, it can be concluded that the confidence intervals found with biweight estimators are narrower than the confidence intervals found with α-trimmed mean and Winsorized standard deviation.

Additional Confidence Intervals
In view of the results found in Section 4.3, it was decided to analyze the same data but with two different groupings. Specifically, variables Y 1 , . . . , Y 12 have been defined as the amount of CO concentration in each of the months of the year, with Y 1 being the amount of CO concentration in the month of January of all years, Y 2 the amount of CO concentration in the month of February of all years, and so on. On the other hand, the variables Z 1 , . . . , Z 12 have also been defined as the amount of CO concentration grouped from two hours to two hours every day of the year, with Z 1 being the amount of CO concentration at 0:00 h and at 1:00 h, Z 2 the amount of CO concentration at 2:00 h and at 3:00 h, and so on.
The confidence intervals found for the variables Y 1 , . . . , Y 12 and Z 1 , . . . , Z 12 are shown in Figures 12  and 13. Regarding these figures, some comments can be made. For example, Figure 12a shows that the highest values are reached in April, followed by a second step in March, May and November. In addition, the variables corresponding to the months of January, February, October and December also behave similarly, with lower values than those previously mentioned. Furthermore, the reduction in CO concentration is very appreciable in June and September. Moreover, the CO concentration values in the central summer months, that is, in July and August, are the lowest of the year. Finally, with respect to the amplitudes of the confidence intervals corresponding to these last two months, it can be said that these intervals appear to be, in general, narrower than the rest of the confidence intervals.
The aforementioned for Figure 12a can be applied quite well to Figure 12b,c, with small differences due to the fact that different estimators were used.
With respect to Figure 13, this figure shows that the only medians that are the same are those of the variables Z 4 , Z 5 and Z 10 , on the one hand, and those of Z 7 and Z 8 , on the other hand. part. The rest of the variables can be assumed different from each other and from all the others. In addition, it can be seen that the hours with the lowest CO concentration are those that correspond to the time interval that begins at 0:00 h and ends at 5:00 h. Also, the highest CO concentration values occur between 6:00 h and 9:00 h, and between 18:00 h and 19:00 h. For the rest of the hours of the day, a decrease in the concentration of CO appears in the time interval that goes from 9:00 h to 15:00 h, time of day at which the CO concentration increases again. Finally, the CO concentration begins to decrease from 21:00 h until the next morning.
The aforementioned suggests the existence of a periodic behavior, which also seems to occur when studying the behavior of CO concentration for the months of the year (see Figure 12). However, this certain periodicity in the data did not emerge when these data were analyzed for years.

Discussion
From an initial statistical summary, it was observed that the values of the CO concentration at the Belisario station are values that are at a desirable level of air pollution, according to the criteria established in QAQI [30]. In addition, it was observed that all the variables present many extreme observations, where said observations are on the right, that is, for high CO concentration values. Furthermore, it was also observed that all the variables present characteristics that are compatible with the possibility that they come from heavy-tailed distributions. Specifically, the variables present medians that are clearly lower than the means, the skewness is greater than zero, the kurtosis is greater than three, and the value of the standard deviation is not small compared to the value of the mean.
Subsequently, a smoothing of the data was performed to decrease the individual influence of each of the data in particular and to highlight possible trends in the data set. This smoothing was performed in the sequence formed by the data corresponding to all the years and for sequences formed for each of the years in particular. All this brought to light that there is a tendency for the values of CO concentration to decrease as the years go by. Likewise, it was also observed that the lowest values are reached in the third quarter of the year and that the highest values occur in the second and fourth quarters of the year. These results are in agreement with the general comments made in [28] about the CO concentration in Quito.
Once the smoothing of the data was performed, an attempt was made to fit all the variables to parametric distributions, through different transformations. This was done with the aim of being able to carry out a statistical analysis applying classical inference techniques, since many observations are available for each variable. However, adequate results were not obtained.
Therefore, due to the impossibility of using classical inference, the study had to be carried out using hypothesis testing and both nonparametric confidence intervals and robust confidence intervals.
This type of exhaustive preliminary analysis, with respect to CO concentration data, is not very frequent, because in general the authors tend to assume independence between the variables, to eliminate outliers and to approximate the remaining data with known parametric distributions. For example, in [11] independence between observations was assumed, peaks in signal amplitude were detected, empirical tails of these peaks were calculated, and theoretical tails of known distributions were fitted to the empirical tails. All of this was done in [11] using classical methods. In [12], kernel principal component analysis was applied to raw data to extract non-linear characteristics from it, and then the K-nearest neighbor algorithm was used for recognition tasks.
On the other hand, in [13] statistical summaries of the data were shown, where the measures of central tendency and dispersion of the data were the following: the mean, the standard deviation, the minimum value, and the maximum value. Then, it focused on the use of the Rach model to define a coherent variable and the interrelation between variables. However, in [15] a statistical summary of the data was not shown, but the analysis was performed using classic confidence intervals at the 95% confidence level and the analysis used the two-tailed Z test for proportions. Additionally, in [16] the preliminary analysis of the data was not shown either, but the descriptive analysis of the data was done in terms of frequency and percentage.
Nevertheless, in [17] the statistical summary of the data was shown, where the measures of central tendency and dispersion used were the mean, standard deviation, range, median, and interquartile range. In addition, for the analysis, the posterior mean and the 95% posterior interval were included. Moreover, Bayesian hierarchical models were used to obtain national-average associations.
In [18], although an initial summary data statistic was not shown, it did explain in detail, exhaustively, the methodology used to discard the data that were not significant for the type of analysis of uncertainty of CO concentration that was performed in that paper. However, the statistical analysis presented in [19] did include statistical summaries of the data, where the authors relied on the mean and standard deviation. Furthermore, in [19] the authors demonstrated that the distribution of the raw data was not normal. Therefore, with the latter, they justified the use of nonparametric statistical methods to carry out the analysis of the CO concentration and other air pollutants.
Regarding the type of nonparametric analysis that was carried out in this research, it can be said that the Wilcoxon rank sum test was used here to compare the medians of the distributions of the variables under study, basing this test on the statistics of the order of samples and on the sign test. Another added value of the present study is that with the nonparametric confidence intervals constructed, the variables could be grouped into different categories, establishing similarities and differences between the data.
Once the nonparametric analysis was carried out, the categorization and discrimination of the data was conducted robustly, because this provides a more in-depth analysis of the characteristics of the CO concentration at the place where the measurements were performed. Here, for the analysis, different robust location and scale statistics were found and, some of them, were used to determine robust confidence intervals. Specifically, the following point estimators were used: the mean, the median, and the trimean. In addition, families of α-trimmed mean estimators, α-winsorized mean estimators, Andrews wave-based estimators, and biweight-based estimators were used, which are defined by the proportion of values not taken into account for the estimation. Here, it was observed that for all the years the point estimates of location were practically limited between the mean and the median. In addition, it was observed that the amount of CO concentration, although all its values were in the range of desirable values according to QAQI, decreased markedly between 2008 and 2012, and that from 2012 onwards the decrease in CO concentration was, in general, much lighter but with year-on-year rises and falls.
On the other hand, the point estimators of scale that were used were the following: the standard deviation, the mean absolute deviation, the median absolute deviation, the one-half of the fourth-spread, and the least median squares. Likewise, regarding the families of scale estimators, biweight midvariance estimators, estimators based on subranges, estimators based on the Andrew's wave, and estimators based on the Winsorized standard deviation were used. For the estimator families, values that are mentioned in the specialized literature on the subject as suitable values were chosen. The graphical representation of the scale estimators showed that these were bounded inferiorly by the least median squares and superiorly by the standard deviation. Additionally, there is a decrease from 2008 to 2012 and a stabilization from that year until 2019. The decrease is due to the decrease in the number of extreme observations and their value.
The exhaustive robust analysis that has been carried out here on the CO concentration constitutes another added value of the study. Specifically, the technical report presented in [30] does not make an in-depth statistical analysis of the variables of air pollution in Quito. In fact, in [30] only the mean and maximum values are used to analyze the behavior of the CO concentration. Therefore, the research carried out here can be used as reference material to explain how the behavior of the CO concentration in Quito has been from 1 January 2008 to 31 December 2019.
Similar research papers in which robust analysis of other air pollution variables has been performed are those shown in [31][32][33]. The results obtained in this paper are in agreement with those obtained in [31][32][33]. Examples of other research papers that are further from the topic discussed here, but that also have employed some of the robust analysis tools used in this paper are [34,35]. In all cases, the importance of the use of robust methods in the analysis of air pollution variables was highlighted.
The robust bilateral confidence intervals were found using six pairs of robust estimators, three with point estimates and three others with families of estimators from which particular values were selected. In addition, bootstrap confidence intervals were found. Here, it was observed that the confidence intervals at 95% more displaced towards higher values were the classic intervals, because they have their center in the mean, while the confidence intervals more displaced to the left were those that have their center in the median. Likewise, among the confidence intervals centered on the median, the nonparametric intervals and those found by the bootstrap method were wider than those found with robust estimators.
The confidence intervals based on the Andrew's wave and the biweight were very similar, because the location and scale estimators found in these families were also very similar. On the other hand, the confidence intervals based on α-trimmed mean location estimators, which have the Winsorized variance as variance, produced intervals between the biweight intervals and the intervals based on the Andrew's wave, on the left, and the classic intervals, on the right.
Due to all the above, the confidence intervals based on the estimators T(α), s W (α) and (T bi , s bi ) were used to compare the given variables. Again, when the variables were compared using confidence intervals, a downward trend in the CO concentration was observed between 2008 and 2012. Moreover, from 2012 onwards, fluctuations were observed with a slight tendency to decreasing the CO concentration values. Overall, the biweight-based confidence intervals were somewhat narrower than those found with α-trimmed mean. Furthermore, the classifications of the variables found with biweight were similar to those found with nonparametric estimators, the difference was that the variable X 2 (2009) was added to the category formed by X 1 (2008) and X 3 (2010).
To complete the study, the proposed robust confidence interval analysis technique was also applied to clusters consisting of CO concentration measurements of months and clusters of CO concentration measurements in groups of two hours. Here, the variables were classified and it was noted that there was a certain periodicity in both the months and the hours of the day. In this sense, it is observed that the lowest confidence intervals corresponding to the analysis of the months are in the third quarter. Moreover, with respect to the hours of the day, it is observed that there is a certain periodicity, showing minimums in the early morning hours and maximums in the early hours of the working day and in the early hours of the night.
An additional contribution of this study is that the observed periodicities have been shown in terms of robust confidence intervals at the 95% confidence level, categorizing the range of values of the possible periodic wave and measuring differences between categories with the measurement precision provided by robust statistical methods. In addition, it is important to mention that these periodicities are not fixed, but are subject to seasonal variations and even to the character of the day in particular. For example, when considering the CO concentration between 2:00 and 3:00, which is where the lowest CO concentration of the day occurs, said concentration will be different if it is measured in different months. Specifically, the amplitude of the possible periodic signal is not the same if it is measured in April, where the CO concentration is higher, as if it is measured in August, where the concentration is lower.
It is possible that the aforementioned variations in amplitude are due to the time periods in which different activities are carried out in the city. Therefore, the highest concentration of CO when analyzed for the hours is not the same if the day is a holiday or a working day. Furthermore, this means that the signal frequency is not fixed, but is also modulated.
Before concluding this section, it is important to mention that the in-depth analysis of the possible periodic waveform that the CO concentration could have, both for the months and for the hours of the day, has not been included. Therefore, this is a task that remains pending to be carried out in future research work.

Conclusions
This paper was aimed at performing the robust statistical analysis of CO concentration measurements taken at the Belisario air quality monitoring station (Quito, Ecuador) from 1 January 2008 to 31 December 2019. This is the first time that this type of analysis has been carried out at this monitoring station and its results show that said concentration tends to decrease year after year. Therefore, the measures that the city authorities have been taking in the last twelve years are giving satisfactory results.
The analysis carried out in [30] is an analysis focused on general environmental issues in the city of Quito, which could be strengthened by the in-depth study carried out in this paper. This highlights some of the possible uses of the results obtained in this research work. In this sense, it is important to highlight that in this paper the measurements were classified according to the criteria established by the Quito Air Quality Index to classify air pollution. Additionally, sets of variables were constructed, the variables were categorized, and similarities and differences were also established between the variables. All of this was performed with the precision provided by both nonparametric and robust statistical methods. In this sense, the robust analysis methodology of the CO concentration developed in this paper presents an exhaustive way of carrying out the analysis of measurements of this air pollution variable. Furthermore, one of the advantages of this methodology is that it does not require a large amount of data to carry out the analysis, as has already been demonstrated in [31,32].
In [30], it is mentioned that the main sources of air pollution in Quito are the means of vehicular transportation, which is aggravated by large traffic jams and all the industries that use bunker and fuel oil, highlighting thermo-electric power plants. Moreover, in [30] it is also mentioned that Quito is a narrow and long city, whose central part is located on the slopes of the Pichincha volcano and all travelers who have to travel from one side of the city to the other have to pass through the center of the city, generating traffic jams and consequent air pollution. On the other hand, volcanic eruptions are also sources of air pollution.
What has been said in the previous paragraph shows that, although the exhaustive robust analysis carried out in this paper showed that air pollution due to CO has been decreasing in recent years, it is necessary to improve the urban dynamics of the city. For example, it is proposed that the city comply with quality standards designed specifically for each of its most critical points, in terms of air pollution. In addition, although the quality of the means of transport in Quito have improved significantly, it is proposed to look for more efficient and less polluting means. Likewise, it is proposed to design elements that protect citizens from air pollution while walking on the sidewalks, build more urban parks as air pollution filters and keep citizens informed at all times about the level of air pollution in the city, both in the region through which they travel every day and in the area where they live. All this is in total agreement with what was said in the research work presented in [31].
Finally, based on the time intervals chosen to perform the analysis and represent the results of the research, it was observed that there is a certain periodicity in the CO concentration, both for the months and for the hours of the day. Nevertheless, this periodicity does not occur when the analysis is carried out for the twelve years under study. Therefore, this implies that modeling the possible periodicity of this type of signals is a very complex research topic, where behavior patterns that vary in amplitude, duration and instants of time in which they appear come to light. Trying to model this type of behavior of the CO concentration using mathematical tools is part of a future research work of the authors.
Author Contributions: W.H. and A.M. created the methodology of formal data analysis and the tools to implement this methodology. In addition, W.H. and A.M. performed the statistical analysis of the data, the validation of the results, and the writing of the article. It is important to say that the authorship was limited to those who have contributed substantially to the work reported. All authors have read and agreed to the published version of the manuscript.