Improving Environmental Sustainability by Characterizing Spatial and Temporal Concentrations of Ozone

: Statistical methods have been widely used to predict pollutant concentrations. However, few efforts have been made to examine spatial and temporal characteristics of ozone in Korea. Ozone monitoring stations are often geographically grouped, and the ozone concentrations are separately predicted for each group. Although geographic information is useful in grouping the monitoring stations, the accuracy of prediction can be improved if the temporal patterns of pollutant concentrations is incorporated into the grouping process. The goal of this research is to cluster the monitoring stations according to the temporal patterns of pollutant concentrations using a k-means clustering algorithm. In addition, this study characterizes the meteorology and various pollutant concentrations linked to high ozone concentrations (>0.08 ppm, 1-h average concentration) based on a decision tree algorithm. The data used include hourly meteorology (temperature, relative humidity, solar insolation, and wind speed) and pollutant concentrations (O 3 , CO, NO x , SO 2 , and PM 10 ) monitored at 25 stations in Seoul, Korea between 2005 and 2010. Results demonstrated that 25 stations were grouped into four clusters, and PM 10 , temperature, and relative humidity were the most important factors that characterize high ozone concentrations. This method can be extended to the characterization of other pollutant concentrations in other regions.


Introduction
Ozone (O 3 ) is a secondary pollutant formed by photochemical reactions. The presence of nitrogen oxides (NO x ) and volatile organic compounds (VOC s ) accelerates ozone formation [1]. Major sources of NO x and VOC s include emissions from cars and trucks and various industrial facilities [2]. Because strong sunlight plays a vital role in ozone formation, ozone concentrations increase during summer. It must be noted that ozone in the stratosphere protects ecosystems from ultraviolet rays that are harmful to human beings. However, high concentrations of ozone near the ground, which is studied in this paper, adversely affects vegetation growth and human health. Studies showed that vegetation exposed to ozone decreased photosynthesis and growth [3,4]. Fuhrer et al. (2016) indicated that the elevated surface ozone level could cause substantial reductions in the agricultural yields. In addition, it may affect the ecosystem through decreasing species diversity of plants, animals, insects, and fish, etc. [4]. Prolonged exposure to high levels of ozone is directly linked with eye irritation, respiratory and cardiovascular diseases, etc., especially among children, the elderly, and patients [5]. To minimize the negative effect on public health associated with exposure to high levels of ozone, an ozone warning system was developed in Korea in July 1995.
In addition, a transition function was developed separately for four areas in Seoul to predict ozone concentrations [16]. Both studies divided the modeling domain based on geography.
Although geographic information can be useful in dividing the modeling domain, a more accurate model can be developed if temporal patterns of pollutant concentrations were considered in the division of the domain. A previous study demonstrated that the modeling domain was successfully clustered solely by a temporal pattern of fine particulate matter (PM 2.5 ) concentrations [17]. The study used 24-h average PM 2.5 concentrations measured every third day between 2001 and 2005 and monitored at 522 sites in the United States. The study used a k-means clustering algorithm based on correlation distance to investigate the similarity between temporal patterns of PM 2.5 concentrations. In addition, the study demonstrated that a rotated principal component analysis (RPCA) was useful in characterizing spatial patterns of pollutant concentrations.
The goal of this study is to cluster monitoring stations based on temporal patterns of ozone concentrations. Thus, the hypothesis of this study is that ozone concentrations in Korea are associated with some meteorology and pollutants and show their own temporal and spatial patterns.
Monitors may be differently clustered depending on the choice of the algorithm. We have used the k-means clustering algorithm. In addition, the characteristics that were linked to high ozone concentrations were analyzed using a decision tree algorithm, which considers the correlation between variables to explain the characteristics of each cluster. The threshold to classify high ozone concentrations is 0.08 ppm of 1-h average concentrations. However, if a difference threshold is used, or the threshold is selected based on 8-h average concentrations, each cluster may have different characteristics. In summary, the limitation of this study is that similar analysis using the same data in this study can have slightly different results depending on the choice of the model or parameters. This paper is organized as follows. Data used in this study are described in Section 2. Major analysis methods, the k-means clustering technique, and the decision tree algorithm are explained in Section 3. Finally, the results of the analysis and future applications of this research are presented in Section 4.

Data
This study used hourly air pollutant concentrations and meteorological variables measured in 25 monitoring stations in Seoul, Korea between 2005 and 2010 [18]. Monitors were operated by Korea Environment Corporation [19]. Heights of monitors were between 1.5 m and 10 m above the ground [20]. Pollutant species included hourly ozone (O 3 ), carbon monoxide (CO), nitrogen dioxide (NO 2 ), sulfur dioxide (SO 2 ), and fine particulate matter (PM 10 ). Meteorological variables included hourly temperature (in • C), relative humidity (in %), solar-insolation (in W·m −2 ), and wind speed (in m·s −1 ). All air pollutants were collected every 1 h. O 3 , CO, NO x , SO 2 , and PM 10 were measured using the ultraviolet photometric method, non-dispersive infrared method, chemiluminescent method, pulsed UV fluorescence method, and β-ray absorption method, respectively. Monitors were regularly inspected following the operational guideline of air pollution monitoring system [20].
Missing data were estimated using the ordinary kriging technique, which calculates missing values by the weighted linear combination of known values (Equation (1)) [21]. Missing values were calculated using the "gstat" package of R software [22].
Because high ozone concentrations usually occur in summer, the ozone warning period begins on 1 May and ends on 15 September of each year. Therefore, the analysis of this research focused on ozone concentrations between 1 May and 15 September from 2005 to 2010.

Methodology
The modeling domain was divided by a k-means clustering algorithm. The ozone concentrations of each clustered region were analyzed by a decision tree algorithm.

k-Means Clustering Algorithm
A k-means clustering algorithm systematically groups data by minimizing variances within the cluster while maximizing variances between clusters [23]. Variances are calculated based on the correlation distance between pollutant concentrations. While the Euclidean distance only measures the difference of the pollutant concentrations, the correlation distance also considers similarity among temporal patterns [17,24].
The k-means clustering algorithm groups monitoring stations in the following way. First, k number of centers are arbitrarily selected. Second, the correlation distance is calculated between the center and other stations. Third, measurement stations are clustered so that variances within clusters are minimized while variances between clusters are maximized. Fourth, the center of each cluster is re-selected so that the variance within the cluster is minimized. The third and fourth steps are then repeated until the center of each cluster is unchanged [25].
The analysis was performed on hourly ozone concentrations between 2000 and 2005, measured from 25 monitoring stations. The total data for each station were 19,872 (=24 h × 138 days × 6 years) because the hourly data from 1 May to 15 September of each year were used for the analysis. To ensure the appropriate clustering of stations through the k-means clustering algorithm, a locally linear embedding method was applied to the hourly ozone data with a dimension of 25 × 19,872. The locally linear embedding method, one of the widely used dimension reduction techniques, reduces the dimension by considering the spatial distribution of pollutant concentrations [26].

Decision Tree Algorithm
A decision tree algorithm uses a tree structure to represent a decision rule for classifying or predicting dependent variables [27]. The independent variables were selected based on the Gini index and were used to structure the decision tree [28]. The analysis was performed using the Classification and Regression Tree (CART) software, which has been widely used to predict and classify data [29].
Results were represented in a hierarchical structure following the "if-then" rule ( Figure 1). The measured pollutant concentrations were distributed in the space composed of two independent variables (X 1 , and X 2 ) ( Figure 1a). Here, an example, which classified circles and rectangles, was represented. The intermediate node (in oval) shows which independent variables and what criteria are used to classify the independent variables in two groups ( Figure 1b). The final nodes (in rectangle) in Figure 1b, which show the rules used for classifying data, correspond to each sector of Figure 1a, respectively. Dependent variables included "high level (≥0.08 ppm, 1-h average concentration)" and "low level (≤0.08 ppm, 1-h average concentration)" ozone concentrations. A threshold of 0.08 ppm was selected as the ozone concentration, and concentrations higher than 0.08 ppm resulted in adverse health effects among children, the elderly, and patients [30].
The growth of the tree was stopped when the number of data points in the node reached 40, and the Gini index was used as a performance measure. The criteria that characterized high ozone concentrations were determined based on the Laplace accuracy, calculated by Equation (2) [31].
where n is the number of observations in each node; n c is the number of properly classified observations; p is the number of categories. The number of observations (n) indicated the total number of observations in each node, while the number of properly classified observations (n c ) indicates the number of high ozone concentrations (>0.08 ppm). The number of categories (p) is two because the dependent variable (ozone concentrations) are categorized into two classes: high and low concentrations.

Decision Tree Algorithm
A decision tree algorithm uses a tree structure to represent a decision rule for classifying or predicting dependent variables [27]. The independent variables were selected based on the Gini index and were used to structure the decision tree [28]. The analysis was performed using the Classification and Regression Tree (CART) software, which has been widely used to predict and classify data [29].
Results were represented in a hierarchical structure following the "if-then" rule ( Figure 1). The measured pollutant concentrations were distributed in the space composed of two independent variables (X1, and X2) (Figure 1(a)). Here, an example, which classified circles and rectangles, was represented. The intermediate node (in oval) shows which independent variables and what criteria are used to classify the independent variables in two groups (Figure 1(b)). The final nodes (in rectangle) in Figure 1(b), which show the rules used for classifying data, correspond to each sector of Figure 1(a), respectively.

Spatial Characteristics of Ozone Concentrations
The results demonstrated that 25 stations in Seoul were clustered in four groups: the northern, central, southern, and eastern areas ( Figure 2). It must be noted that stations in the same cluster were geographically closely located, even though the temporal pattern of ozone concentrations was primarily used for cluster stations. selected as the ozone concentration, and concentrations higher than 0.08 ppm resulted in adverse health effects among children, the elderly, and patients [30]. The growth of the tree was stopped when the number of data points in the node reached 40, and the Gini index was used as a performance measure. The criteria that characterized high ozone concentrations were determined based on the Laplace accuracy, calculated by Equation (2) [31].
where n is the number of observations in each node; nc is the number of properly classified observations; p is the number of categories. The number of observations (n) indicated the total number of observations in each node, while the number of properly classified observations (nc) indicates the number of high ozone concentrations (>0.08 ppm). The number of categories (p) is two because the dependent variable (ozone concentrations) are categorized into two classes: high and low concentrations.

Spatial Characteristics of Ozone Concentrations
The results demonstrated that 25 stations in Seoul were clustered in four groups: the northern, central, southern, and eastern areas ( Figure 2). It must be noted that stations in the same cluster were geographically closely located, even though the temporal pattern of ozone concentrations was primarily used for cluster stations.  Monitoring stations in each cluster from the k-means clustering method were closely located to each other in the reduced dimension ( Figure 3). The results ensured that ozone concentrations of the same cluster exhibit similar temporal patterns. Ozone concentrations of other clusters were statistically different because the p-value of the F-test of the analysis of variance (ANOVA) was close to zero ( Table 1). The ANOVA test allows for a comparison of more than two populations (groups). In other words, ANOVA is a statistical technique for testing whether more than two population means are all equal. Monitoring stations in each cluster from the k-means clustering method were closely located to each other in the reduced dimension ( Figure 3). The results ensured that ozone concentrations of the same cluster exhibit similar temporal patterns. Ozone concentrations of other clusters were statistically different because the p-value of the F-test of the analysis of variance (ANOVA) was close to zero ( Table 1). The ANOVA test allows for a comparison of more than two populations (groups). In other words, ANOVA is a statistical technique for testing whether more than two population means are all equal.

Temporal Patterns of Ozone Concentrations
The annual mean ozone concentrations of each cluster between 2005 and 2010 illustrated that concentrations increased between 2005 and 2010 with significantly high concentrations in 2009 (Figure 4(a)). High ozone concentrations were expected in July and August as the ozone formation was directly related to the temperature. However, monthly average concentrations in May and June were obviously higher than those in July and August (Figure 4b). Relatively low concentrations in July and August were partly because of the low insolation, which was also one of the critical factors in ozone formation. Monthly average precipitation and monthly frequency of precipitation indicated that the total amount and the frequency of precipitation was significantly higher in July and August than in May and June ( Figure 5). Therefore, average ozone concentrations were lower in July and August than in May even though the number of exceedance, in which the ozone levels are higher than the standard, was larger in July and August (Table 2).

Temporal Patterns of Ozone Concentrations
The annual mean ozone concentrations of each cluster between 2005 and 2010 illustrated that concentrations increased between 2005 and 2010 with significantly high concentrations in 2009 (Figure 4a). High ozone concentrations were expected in July and August as the ozone formation was directly related to the temperature. However, monthly average concentrations in May and June were obviously higher than those in July and August (Figure 4b). Relatively low concentrations in July and August were partly because of the low insolation, which was also one of the critical factors in ozone formation. Monthly average precipitation and monthly frequency of precipitation indicated that the total amount and the frequency of precipitation was significantly higher in July and August than in May and June ( Figure 5). Therefore, average ozone concentrations were lower in July and August than in May even though the number of exceedance, in which the ozone levels are higher than the standard, was larger in July and August (Table 2).   Hourly average ozone concentrations in 25 stations had evident hourly variations. Ozone concentrations were the highest in the afternoon between 3:00 p.m. and 5:00 p.m., with a small bump at dawn around 4:00 a.m. (Figure 6). The afternoon peak was partly attributed to the reaction of ozone precursor materials emitted during the morning and afternoon traffic periods. Although ozone precursor materials, such as NOx, are emitted in the morning traffic period, peak ozone concentrations occur in the afternoon because ozone formation is favored in the presence of sunlight and high temperature. In addition, a small increase in ozone concentrations around 4:00 a.m. is caused by air pollutants, which are often isolated in the ground because of the low nocturnal planetary boundary layer [32].    Hourly average ozone concentrations in 25 stations had evident hourly variations. Ozone concentrations were the highest in the afternoon between 3:00 p.m. and 5:00 p.m., with a small bump at dawn around 4:00 a.m. (Figure 6). The afternoon peak was partly attributed to the reaction of ozone precursor materials emitted during the morning and afternoon traffic periods. Although ozone precursor materials, such as NOx, are emitted in the morning traffic period, peak ozone concentrations occur in the afternoon because ozone formation is favored in the presence of sunlight and high temperature. In addition, a small increase in ozone concentrations around 4:00 a.m. is caused by air pollutants, which are often isolated in the ground because of the low nocturnal planetary boundary layer [32]. Hourly average ozone concentrations in 25 stations had evident hourly variations. Ozone concentrations were the highest in the afternoon between 3:00 p.m. and 5:00 p.m., with a small bump at dawn around 4:00 a.m. (Figure 6). The afternoon peak was partly attributed to the reaction of ozone precursor materials emitted during the morning and afternoon traffic periods. Although ozone precursor materials, such as NOx, are emitted in the morning traffic period, peak ozone concentrations occur in the afternoon because ozone formation is favored in the presence of sunlight and high temperature. In addition, a small increase in ozone concentrations around 4:00 a.m. is caused by air pollutants, which are often isolated in the ground because of the low nocturnal planetary boundary layer [32].

Factors Determining High Ozone Concentrations
Meteorology and pollutant concentrations that characterized high ozone concentrations were analyzed using the decision tree algorithm. Independent variables included concentrations of four air pollutants (CO, NOx, PM10, and SO2) including four meteorological variables (temperature, relative humidity, solar insolation, and wind speed). The analysis using the decision tree algorithm in the northern area of Seoul was represented as an example (Figure 7). The criteria that resulted in high Laplace accuracy characterized high ozone concentrations (dotted line in Figure 7). The analysis demonstrated that high ozone concentrations in the northern area of Seoul were expected when the relative humidity was not more than 59.5%, the temperature was not lower than 22.75 °C, and the concentration of PM10 was not lower than 28.5 μg•m −3 .
Relative humidity, temperature, and PM10 concentrations were the key criteria for high ozone concentrations in all clusters in Seoul ( Table 3). The thresholds of relative humidity and the temperature that characterized high ozone concentrations were similar between clusters, while those of the PM10 concentrations were apparently different, which indicates that PM10 was unevenly distributed in Seoul.
The analysis results in this study are consistent with previous studies that used similar statistical approaches. Chu et al. (2012) used a decision tree to identify controlling factors of ground-level ozone measured in five monitors in Taiwan. The study used temperature, wind speed, relative humidity, NOx, alkanes, alkenes, and aromatic hydrocarbons as independent variables. Results varied depending on monitoring stations, but in general, temperature, wind speed, NOx, and aromatic hydrocarbons were important factors [33]. Park (2016) has used meteorological variables and NO2 to find factors characterizing high ozone concentrations in Seoul, Korea. Results indicated that relative humidity and temperature, as well and NO2 concentrations, were primary factors of high ozone concentrations [34].
Various studies used statistical models to classify high ozone concentrations. However, few efforts have been made to find the regional factors that characterize high ozone concentrations through clustering monitoring stations. This study clustered monitoring stations with similar temporal patterns of ozone concentrations. Then, factors linked to the high ozone episode were analyzed in each cluster by using a decision tree algorithm. In that way, the uncertainty in estimated factors can decrease.

Factors Determining High Ozone Concentrations
Meteorology and pollutant concentrations that characterized high ozone concentrations were analyzed using the decision tree algorithm. Independent variables included concentrations of four air pollutants (CO, NO x , PM 10 , and SO 2 ) including four meteorological variables (temperature, relative humidity, solar insolation, and wind speed). The analysis using the decision tree algorithm in the northern area of Seoul was represented as an example (Figure 7). The criteria that resulted in high Laplace accuracy characterized high ozone concentrations (dotted line in Figure 7). The analysis demonstrated that high ozone concentrations in the northern area of Seoul were expected when the relative humidity was not more than 59.5%, the temperature was not lower than 22.75 • C, and the concentration of PM 10 was not lower than 28.5 µg·m −3 .
Relative humidity, temperature, and PM 10 concentrations were the key criteria for high ozone concentrations in all clusters in Seoul ( Table 3). The thresholds of relative humidity and the temperature that characterized high ozone concentrations were similar between clusters, while those of the PM 10 concentrations were apparently different, which indicates that PM 10 was unevenly distributed in Seoul.
The analysis results in this study are consistent with previous studies that used similar statistical approaches. Chu et al. (2012) used a decision tree to identify controlling factors of ground-level ozone measured in five monitors in Taiwan. The study used temperature, wind speed, relative humidity, NOx, alkanes, alkenes, and aromatic hydrocarbons as independent variables. Results varied depending on monitoring stations, but in general, temperature, wind speed, NOx, and aromatic hydrocarbons were important factors [33]. Park (2016) has used meteorological variables and NO 2 to find factors characterizing high ozone concentrations in Seoul, Korea. Results indicated that relative humidity and temperature, as well and NO 2 concentrations, were primary factors of high ozone concentrations [34].
Various studies used statistical models to classify high ozone concentrations. However, few efforts have been made to find the regional factors that characterize high ozone concentrations through clustering monitoring stations. This study clustered monitoring stations with similar temporal patterns of ozone concentrations. Then, factors linked to the high ozone episode were analyzed in each cluster by using a decision tree algorithm. In that way, the uncertainty in estimated factors can decrease.

Conclusions
We analyzed hourly ozone concentrations measured in 25 monitoring stations in Seoul, Korea between 2005 and 2010. The k-means clustering algorithm was applied, and monitoring stations were clustered in four groups: the northern, central, southern, and eastern areas. The decision tree algorithm was useful in analyzing meteorology and various pollutant concentrations that characterized high ozone concentrations.
Ozone warning in Seoul was separately reported for five areas divided by geographic information: central, northeastern, southwestern, southeastern, and southwestern areas. The accuracy of prediction could be improved if the ozone concentrations were separately predicted for four clusters as ozone concentrations of the same cluster demonstrating similar temporal variations. In future research, separate time series models of hourly ozone concentrations will be developed for each cluster. The analysis performed in this study can be further applied to ozone prediction on a national level. In addition, this method can be applied to other pollutant species in any region and time period of interest.