A Visualization Approach to Air Pollution Data Exploration—a Case Study of Air Quality Index (pm 2.5 ) in Beijing, China

In recent years, frequent occurrences of significant air pollution events in China have routinely caused panic and are a major topic of discussion by the public and air pollution experts in government and academia. Therefore, this study proposed an efficient visualization method to represent directly, quickly, and clearly the spatio-temporal information contained in air pollution data. Data quality check and cleansing during a preliminary visual analysis is presented in tabular form, heat matrix, or line chart, upon which hypotheses can be deduced. Further visualizations were designed to verify the hypotheses and obtain useful findings. This method was tested and validated in a year-long case study of the air quality index (AQI of PM 2.5) in Beijing, China. We found that PM 2.5 , PM 10 , and NO 2 may be emitted by the same sources, and strong winds may accelerate the spread of pollutants. The average concentration of PM 2.5 in Beijing was greater than the AQI value of 50 over the six-year study period. Furthermore, arable lands exhibited considerably higher concentrations of air pollutants than vegetation-covered areas. The findings of this study showed that our visualization method is intuitive and reliable through data quality checking and information sharing with multi-perspective air pollution graphs. This method allows the data to be easily understood by the public and inspire or aid further studies in other fields.


Introduction
Since 2011, frequent occurrences of haze in China have become a cause for panic and routinely appear as a major topic in the media and on climate websites [1,2].Thus, considerable research in various fields has focused on air pollution, attempting to solve the problem with different approaches [3][4][5][6][7][8].Nevertheless, in many instances, the methods are not completely intuitive and are difficult for governmental officials and the general public to understand.Visual exploration [9][10][11][12][13][14] of air pollution with spatio-temporal data is a solution that makes complex data understandable because graphical representation is relatively intuitive.
Atmospheric particulate matter is a commonly used criterion to evaluate air quality [7,15,16].The degree of adverse health effects depends on the size and composition of the particles [16].PM 2.5 and PM 10 are defined as particles with diameters of 2.5 µm or less and 10 µm or less, respectively; these parameters are usually measured using air quality index (AQI).AQIs are calculated from the particle concentration at monitoring stations expressed as micrograms per cubic meter [17].According to the technical regulation on ambient air quality index (on trial) [18], the air pollution index for PM 2.5 is divided into six levels, namely, 0-50, 51-100, 101-150, 151-200, 201-300, and greater than 300.With these levels, we can identify the severity of air pollution.
Literature on chemical and remote-sensing fields greatly contributes to air pollution analysis.Examining air composition with related chemometric techniques can achieve effective analysis of pollutant sources [19], which is suitable for micro and local level analyses but is not satisfactory for spatio-temporal pattern exploration of larger areas.Satellite remote-sensing assessment, subsequent mapping of the spatial distribution of aerosols [4,20], and air pollution data measured by satellite-borne sensors or ground-based monitoring stations [21,22] identified a strong relationship between PM 2.5 and PM 10 .However, complicated calibration and data processing are executed with professional software, most of which are standalone and specialized and thus cannot be easily understood or shared with the general public.
By dealing with graphical representations of complex data, conventional plots, such as scatter plot, are used to analyze time-series data and show the correlation among various factors in air pollution exploration [7,23].These plots are simple and have been widely used for a long time, but they cannot effectively express spatial relationships and are less attractive than the newer multi-perspective visualizations and graphs.Moreover, the rapid development of web technology allows interactive visualizations to combine various technologies [24], such as HTML for page content, CSS for aesthetics, JavaScript for interaction, and SVG for vector graphics.These technologies render sharing intuitive information highly convenient.
Many spatial distribution explorations and complex representations of air pollution [3,8,25] rely on standalone or proprietary software products like ArcGIS.VIS-STAMP [26] is a software that provides tools for users to generate self-organizing maps, parallel coordinate maps, map matrices, and reorderable matrices (a type of heat map).This software allows users to visualize multivariate statistical analysis intuitively, as well as explore and understand spatio-temporal and multivariate patterns.Nevertheless, this type of exploration does not consider the extendable and sharable functions that are important for the public to acquire multi-perspectives and latest information on air pollution.
Numerous researchers focus on systematic theories of visual exploration for spatio-temporal data.Kraak [12] claims that graphics can reveal patterns that are not necessarily visible when conventional map display methods are applied, demonstrating the usefulness of geovisualization.The idea of a generic scenario extends the idea of geovisualization following the schema developed [9,13].A general framework was proposed then for using aggregation in visual exploration of massive movement data [27].Similarly, another framework [28] was discussed for spatio-temporal visualization with the visual method space-time cube.However, these frameworks must be extended further and adapted for the visual exploration of air pollution data considering data quality, correlations in multivariate data, and spatio-temporal pattern analysis.
Air composition analysis and remote-sensing methods require complex computations and are time consuming.Moreover, the current graphical approaches for air pollution analysis lack interactive and sharable multi-perspective visualizations.The public urgently needs rapid and reliable knowledge on air pollution to make daily decisions.Similarly, most of the government staff are not professionals and need intuitive understanding of the conditions before they can execute any actions against the increasingly serious air pollutants.Thus, a visual methodology is needed for efficient and reliable exploration, particularly in the case of air pollution data, to improve the depth, readability, and accuracy of data analysis.We were motivated by the idea of visual thinking and the geovisualization concept [12], which may be well suited for the exploration of air pollution data.Building on the extensive body of previous work, our paper presents a visualization exploration method that realizes the process of observation-hypothesis-verification.This method was tested and validated in a year-long case study of the air quality index (AQI of PM 2.5 ) in Beijing, China.The useful findings of air pollution in Beijing show that our new method extends the existing work and fills a gap in the research, focusing on visual exploration to support various applications, such as knowledge-based decision making and aided research of air pollution.

Method and Data
Workflow for visualizing air pollution data is shown in Figure 1 and consists of five blocks, namely, Database, Preliminary analysis, Hypotheses, Verification, and Application.Our previous work, a Real-time Sensor Data Provision system (ReSDaP) [29][30][31], achieved the on-demand data provision for the Database.Data are then checked and analyzed by the Preliminary analysis block through various basic visual methods, such as bar charts, heat tables, and line charts.Based on these data, hypotheses can be built and verified through further visual analytics.Finally, the findings of previous steps can be used for applications.gap in the research, focusing on visual exploration to support various applications, such as knowledge-based decision making and aided research of air pollution.

Method and Data
Workflow for visualizing air pollution data is shown in Figure 1 and consists of five blocks, namely, Database, Preliminary analysis, Hypotheses, Verification, and Application.Our previous work, a Real-time Sensor Data Provision system (ReSDaP) [29][30][31], achieved the on-demand data provision for the Database.Data are then checked and analyzed by the Preliminary analysis block through various basic visual methods, such as bar charts, heat tables, and line charts.Based on these data, hypotheses can be built and verified through further visual analytics.Finally, the findings of previous steps can be used for applications.Data accessing and storage are displayed as Step (1) in Figure 1.The backend data are acquired from our previous work through the real-time data provision system architecture for sensor webs (ReSDaP) [29].Air pollution data often contain time and geographic location and can include an AQI, a PM2.5 concentration measurement.The data may also include other information related to air pollution, such as weather and economic development data.These data exist in various formats and are stored based on these formats in different types of databases.Data similar to tab-separated values, comma-separated values, spreadsheets, and other types of data tables can be directly stored in a relational database, such as the open source database MySQL.For unstructured data, such as text, pictures, and videos, the NoSQL (not only SQL) database can be used for storage, such as the MongoDB open source database that can also store structured data.Through ReSDaP [29], air pollution data are directly stored in the database, providing support for data access and storage.
The Preliminary analysis block is shown as Step (2) in Figure 1.The main goal of this layer is to carry out an overall analysis of the data, perform basic data cleansing, and obtain several elementary discoveries.The next block shows the Hypotheses as Step (3).Verification is shown in Step (4), which is used to verify previously tested hypotheses.The main processes in this block include visualization design, data preprocessing, and visual analytics to achieve visualization flexibility with various methods, such as heat maps, parallel coordinate plots, heat circles, and calendar views.The last step is the application block seen as Step (5) in Figure 1.The following subsections will describe these four blocks in detail.Data accessing and storage are displayed as Step (1) in Figure 1.The backend data are acquired from our previous work through the real-time data provision system architecture for sensor webs (ReSDaP) [29].Air pollution data often contain time and geographic location and can include an AQI, a PM 2.5 concentration measurement.The data may also include other information related to air pollution, such as weather and economic development data.These data exist in various formats and are stored based on these formats in different types of databases.Data similar to tab-separated values, comma-separated values, spreadsheets, and other types of data tables can be directly stored in a relational database, such as the open source database MySQL.For unstructured data, such as text, pictures, and videos, the NoSQL (not only SQL) database can be used for storage, such as the MongoDB open source database that can also store structured data.Through ReSDaP [29], air pollution data are directly stored in the database, providing support for data access and storage.

Relationship among multiple variables
The Preliminary analysis block is shown as Step (2) in Figure 1.The main goal of this layer is to carry out an overall analysis of the data, perform basic data cleansing, and obtain several elementary discoveries.The next block shows the Hypotheses as Step (3).Verification is shown in Step (4), which is used to verify previously tested hypotheses.The main processes in this block include visualization design, data preprocessing, and visual analytics to achieve visualization flexibility with various methods, such as heat maps, parallel coordinate plots, heat circles, and calendar views.The last step is the application block seen as Step (5) in Figure 1.The following subsections will describe these four blocks in detail.

Preliminary Analysis
The Preliminary analysis block allows users to perform preliminary global visualization, explore the data to examine possible inferences, and provide clues and guides for the next block, Hypotheses.This block enables global analysis to verify the correctness and the completeness of the data, which permits preliminary data visualization with basic graphs, such as heat tables, line charts, and bar graphs, for a simple and rough overview of the data.If necessary, this step can be repeated to perform new data inspection and processing.A reliable process for data quality checking and cleaning is needed before information extraction from the original data.An inspection of the raw data reveals defects; thus, more reliable results can be acquired after visual support.Rearranging the rows and columns with drag-and-drop and partially selecting some data items through check boxes are highly convenient, which make finding missing data and logical errors in the data easier.
The Preliminary analysis block is rapid and easy with visualization tools.The complicated algorithms for accessing data, chemical analysis [19] of air, and remote-sensing [4,20] methods require proprietary software and knowledge to interpret the data or are time consuming and difficult to understand.The same is true regarding the analysis of the spatio-temporal distribution of air pollution values in space and time [3,7,23,25,26,32].An Online interaction user interface can perform rapid basic visualizations to find preliminary results and inspect data defects with its flexible data formatting and special functions.Many visualization tools are free, open source, easy to acquire, and extendable.JavaScript is one of the techniques that make it convenient for the public to access and understand the data.

Hypotheses
Basing on basic visualization and data cleansing, possible patterns in air pollution data can be hypothesized, such as relationships among multiple variables, time-related fluctuation of pollutant concentrations, and spatial distributions.This part is based on adequate basic visualization to find specific characteristics or trends shown in the charts.After proposing the hypotheses, verification is needed as described in the succeeding section.

Verification of the Hypotheses
Specific visualization methods were designed for testing and verification of possible inferences.To verify the hypotheses by visual analytics, a visualization design will help users determine the next steps for further analysis.The Visual analytics with more refined visualization and analysis can be used to verify the thoughts in the Hypotheses block.If hot spots were found in this part, more hypotheses could be built and further verification would be needed.Thus, this can be a recursive process.
We achieved a better visual effect from multi-perspective visualization for spatio-temporal analysis.Conventional plots, such as scatter and line plots, are usually used for the time-series data exploration of air pollution [7] without spatial analysis and do not include the new attractive graphs.We analyzed the characteristics of various types of graphs and found that the circular heat map is extremely suitable to display time series data and calendar view [33] is useful for displaying years of daily data.This method provides a new perspective on time series data and a means to comprehensively understand the conventional line charts or matrix plots that fail to deliver because they separate the data into several parts according to seasons or years.We also used the geovisualization graphs to represent the spatial distribution of air pollution data clearly.

Application Layer
Visualization highlights the regularities in the data to be verified by later statistical analysis.Both basic visualization and visual analytics lead to discovery and provide support for applications.In the Application layer, relevant findings are summarized from the visual analytics of the Preliminary analysis and Verification blocks to the appropriate treatment programs.For example, if long-term positive correlations were found between carbon dioxide concentrations and temperatures all over the world, controlling the irregular emission of carbon dioxide may be an approach to address global warming.Thus, the visual display of data supports decision-making by experts and non-experts alike.However, anomalies seen in the charts that illustrate data hot spots and allow users to identify data for further statistical analysis.At the same time, data visualization can act as aided research of other methods in the air pollution field, and sharing interactive findings through the web.Therefore, the application block in the workflow can be coordinated with other systems, providing a basis for a comprehensive model of a phenomenon.
We can further stress the advantages of our method by comparing them with previous methods for visual analytics.The simple model of visualization for visual analytics [9] was too abstract and coarse to be used for the analysis of air pollution data with graphs.Andrienko and Andrienko [27] proposed a general framework using aggregation in visual exploration of massive movement data.A similar framework for spatio-temporal visualization was suitable for movement data with the visual method space-time cube [28].It was a problem-driven process with domain experts as users.Nevertheless, our method contains a data-driven process with multi-perspective spatio-temporal visualization that is easily understood by expert and non-expert users alike.Moreover, we described an efficient visual method to check data quality and enable our method to acquire more reliable results.
Visualization value can be assessed from three viewpoints: technology, art, and empiric science [9].Our method is efficient and extendable from the technological viewpoint and beautiful from the artistic viewpoint.Meanwhile, it follows the "observation-hypothesis-verification" process of the scientific method and thus complies with the demands of empirical science.In Section 3, this method will be tested and validated in a year-long case study of the AQI of PM 2.5 in Beijing, China.

Experimental Data
Two datasets are utilized for the case study of air quality in Beijing, the capital of China.The first dataset is historical hourly PM 2.5 data (2009-2014) of Beijing obtained from the US Department of State air quality files available on its website [34] as measured at the US Embassy in Beijing.Observation values include PM 2.5 concentration, with parameter concentration units in micrograms per cubic meter (µg/m³) transformed to AQI values for our case study.However, these data are not completely verified or validated as indicated in the data statement.These data were used to demonstrate a practical application of the visualization approach, as well as its feasibility and efficiency.The data are referred to as the USE data in the rest of this paper.
The second dataset was created from the U-Air project [35] and comprises one year (8 February 2013 to 8 February 2014) of air quality data from 36 air quality monitoring stations in Beijing, all with geographic coordinates.The stations are shown in Figure 2. The observation data include time, AQI of PM 2.5 , PM 10 , NO 2 , temperature, humidity, wind speed, and weather.Notably, the original weather data have several anomalies; thus, we decided to explore the relationship between pollution and the rainy weather with precipitation data only [36].These data are termed the U-Air data in the rest of this paper.
The second dataset was created from the U-Air project [35] and comprises one year (8 February 2013 to 8 February 2014) of air quality data from 36 air quality monitoring stations in Beijing, all with geographic coordinates.The stations are shown in Figure 2. The observation data include time, AQI of PM2.5, PM10, NO2, temperature, humidity, wind speed, and weather.Notably, the original weather data have several anomalies; thus, we decided to explore the relationship between pollution and the rainy weather with precipitation data only [36].These data are termed the U-Air data in the rest of this paper.

General Analysis and Hypotheses
Before original data records could be used to convey information, preliminary analyses are necessary to evaluate data quality and find general facts in the data itself.Moreover, data-driven hypotheses could be built to guide further analysis.In this section, we will cover these steps without going into details.Readers can find more details in the Supplementary Materials as the main goal of this paper is the flexible visual exploration method for air pollution.Missing data and logic errors may occur in the original data.Therefore, rapid data inspection is needed before they can be used to acquire reliable findings.Visual analytics is an intuitive and clear method for this purpose.Data misreads can be avoided to some extent if these defects are found and processed through graphs.We will show how the data quality of the original U-Air data can be verified and validated with visualization methods.
With tables or scatter plots, missing data can be easily found.As presented in Figure 3, stations 23-36 from March to October have no observations, which clearly show the missing blocks to be handled.Moreover, logic errors could be found with this visualization type.For example, when setting rows as months and columns as weather, the evident relationship between weather and seasons could be observed intuitively.Beijing has a sub-humid warm temperate continental monsoon climate [37] on the northern hemisphere; thus, it is unlikely to snow in summer.
These problems can be handled in several ways.Multi-source data fusion may supplement the missing data and replace the part with logic errors.In this case study, the precipitation data were integrated to determine whether rainy weather can affect pollutant concentration.Interpolation for missing or coarse records is another commonly used method [23].In addition, data could be divided into parts and analyzed separately to be used for the missing data problem in this case study.
Figure 4 shows the missing data in stations 23-36 from March to October.To minimize the effect of these missing data in the analysis, we separated the data into two parts and outlined them in Table 1.The first strategy involves the time pattern analysis of PM 2.5 for stations 01-22 for all months.This analysis includes visualizing the average AQI of PM 2.5 by month, day, and hour.The second strategy is spatial distribution analysis of PM 2.5 for observations in January, February, November, and December because all stations in these months have complete data records.With this separated analysis, more reliable exploration can be performed.
With tables or scatter plots, missing data can be easily found.As presented in Figure 3, stations 23-36 from March to October have no observations, which clearly show the missing blocks to be handled.Moreover, logic errors could be found with this visualization type.For example, when setting rows as months and columns as weather, the evident relationship between weather and seasons could be observed intuitively.Beijing has a sub-humid warm temperate continental monsoon climate [37] on the northern hemisphere; thus, it is unlikely to snow in summer.These problems can be handled in several ways.Multi-source data fusion may supplement the missing data and replace the part with logic errors.In this case study, the precipitation data were integrated to determine whether rainy weather can affect pollutant concentration.Interpolation for missing or coarse records is another commonly used method [23].In addition, data could be divided into parts and analyzed separately to be used for the missing data problem in this case study.
Figure 4 shows the missing data in stations 23-36 from March to October.To minimize the effect of these missing data in the analysis, we separated the data into two parts and outlined them in Table 1.The first strategy involves the time pattern analysis of PM2.5 for stations 01-22 for all months.This analysis includes visualizing the average AQI of PM2.5 by month, day, and hour.The second strategy is spatial distribution analysis of PM2.5 for observations in January, February, November, and December because all stations in these months have complete data records.With this separated analysis, more reliable exploration can be performed.We performed the global visualization of data using basic charts.Based on these visualizations, possible hypotheses were generated for the visualization design program.We used the U-Air data.According to Table 1, the first strategy is used to determine whether a temporal pattern is evident in the data from stations 01-22 for all months.Through basic visualization, as shown in the heat table in Figure 4, higher PM2.5 concentrations are observed from Friday to Sunday from February to March than during other times in this period.

Data missing
The line chart shown in Figure 5 indicates that the highest mean AQI for PM2.5 was in March, whereas the minimum was in November.We performed the global visualization of data using basic charts.Based on these visualizations, possible hypotheses were generated for the visualization design program.We used the U-Air data.According to Table 1, the first strategy is used to determine whether a temporal pattern is evident in the data from stations 01-22 for all months.Through basic visualization, as shown in the heat table in Figure 4, higher PM 2.5 concentrations are observed from Friday to Sunday from February to March than during other times in this period.
The line chart shown in Figure 5 indicates that the highest mean AQI for PM 2.5 was in March, whereas the minimum was in November.
possible hypotheses were generated for the visualization design program.We used the U-Air data.According to Table 1, the first strategy is used to determine whether a temporal pattern is evident in the data from stations 01-22 for all months.Through basic visualization, as shown in the heat table in Figure 4, higher PM2.5 concentrations are observed from Friday to Sunday from February to March than during other times in this period.
The line chart shown in Figure 5 indicates that the highest mean AQI for PM2.5 was in March, whereas the minimum was in November.A heat matrix is a representation of the correlations between the data with a color matrix.The correlation value is a Pearson product moment correlation coefficient, r (Pearson's r for short, with the range of ´1 to 1), which is calculated using Equation (1) given a series of n measurements of variables X and Y written as x i and y i , where i = 1, 2, . . ., n.This computation yields a correlation matrix in which each i, j element is equal to the r value between the X and Y variables.
A heat matrix can be used to intuitively discover the overall relevance of data; thus, it can be used for preliminary analysis of the data.As shown in Figure 7, colors represent Pearson's r, and color ranges from red to blue by gradient.The area marked with red dotted boxes in Figure 7 indicates several primary results.The AQIs of PM 2.5 , PM 10 , and NO 2 showed strong positive correlations, whereas the wind had a negative correlation with the AQIs of PM 2.5 , PM 10 , NO 2 , and humidity.When the AQI of PM 2.5 is high, high AQI of PM 10 and NO 2 are obtained, and the wind speed may be low.Through the general analysis, we constructed three hypotheses and the corresponding visualization design for the air pollution exploration.The first hypothesis is that a correlation exists among the AQI of PM 2.5 , PM 10 , and NO 2 , and wind speed.This correlation can be visualized through scatter plots that show the relationship between any two factors.The second hypothesis is that a regular time pattern exists for air pollutants.Heat maps would be an ideal visualization method, including a circle heat map and a calendar view.The third hypothesis is that air contaminants possess a geographical distribution.This distribution can be represented appropriately by geovisualization, a method suitable for illustrating continuous spatial distributions.
the AQI of PM2.5 is high, high AQI of PM10 and NO2 are obtained, and the wind speed may be low.Through the general analysis, we constructed three hypotheses and the corresponding visualization design for the air pollution exploration.The first hypothesis is that a correlation exists among the AQI of PM2.5, PM10, and NO2, and wind speed.This correlation can be visualized through scatter plots that show the relationship between any two factors.The second hypothesis is that a regular time pattern exists for air pollutants.Heat maps would be an ideal visualization method, including a circle heat map and a calendar view.The third hypothesis is that air contaminants possess a geographical distribution.This distribution can be represented appropriately by geovisualization, a method suitable for illustrating continuous spatial distributions.

Multi-Perspective and Various Visual Analysis
We explore results in this section using multi-perspective analysis and various visual graphs.The relationship between pollutants and weather factors is analyzed first, followed by the presentation of temporal characteristics of PM2.5.Afterward, spatial features of air pollution are shown and studied with stations overlaid on a satellite map.The last subsection includes the findings of air pollution in Beijing.

Relationship between Multiple Factors
Section 2.2 introduces the hypothesis that a relationship exists among pollutants and wind speed.In this section, plots are used to visualize and explore the relationships between the pollutants and weather using the U-Air data.Figure 8 shows that the three pollutants are strongly positively correlated because the Pearson's r values are rather high.However, wind speed has a negative correlation with PM2.5, PM10, and NO2, which have high negative r values.Precipitation exhibits an extremely weak negative correlation with the pollutants.The strong positive correlation between

Multi-Perspective and Various Visual Analysis
We explore results in this section using multi-perspective analysis and various visual graphs.The relationship between pollutants and weather factors is analyzed first, followed by the presentation of temporal characteristics of PM 2.5 .Afterward, spatial features of air pollution are shown and studied with stations overlaid on a satellite map.The last subsection includes the findings of air pollution in Beijing.

Relationship between Multiple Factors
Section 2.2 introduces the hypothesis that a relationship exists among pollutants and wind speed.In this section, plots are used to visualize and explore the relationships between the pollutants and weather using the U-Air data.Figure 8 shows that the three pollutants are strongly positively correlated because the Pearson's r values are rather high.However, wind speed has a negative correlation with PM 2.5 , PM 10 , and NO 2 , which have high negative r values.Precipitation exhibits an extremely weak negative correlation with the pollutants.The strong positive correlation between PM 2.5 and PM 10 shown in Figure 8 verifies the findings in [7].Because PM 2.5 and PM 10 are defined as particles with diameters of 2.5 µm or less and 10 µm or less, respectively, the relationship reveals that the particle density distributed stably according to their sizes.The positive correlation between pollutants indicates that PM 2.5 , PM 10 , and NO 2 may be emitted by the same sources, or one may be emitted by the transformation of another through some type of chemical mechanism [38].To determine the specific reasons, a combined physical and chemical analysis of pollutants is desirable [19].

Temporal Characteristics
The USE data, which had long-term observations, were used in this section to determine the temporal characteristics of the mean AQI values of PM2.5.The average AQI of PM2.5 in five years is visualized through two circular heat maps; one map is a monthly variation and the other is an hourly variation.A daily concentration of PM2.5 is presented as a calendar map, which is useful for an intuitive inspection of the severity of pollutants.As shown in Figure 9, pollutant concentrations are extremely high from 2010 to 2014. Figure 9a shows a circular heat map with average AQI for every  The red circle marked in Figure 8 shows a different record in spring, which is far from the main cluster.This data item was found on 9 March 2013.The AQI of PM 10 exhibited an extremely high value of 189.28, whereas that of PM 2.5 had a low value of 66.95.At that time, Beijing was experiencing a sandstorm caused by the northerly wind.Moreover, a sandstorm contains particles with large sizes that may lead to high concentrations of PM 10 despite the consistency of PM 2.5 [39].This point is different from that of the others.Figure 8 also shows that pollutants and wind speed are negatively correlated; that is, larger wind speed results in lower pollutant concentration.Strong winds will accelerate the spread of pollutants, and a mixture of fresh air may decrease pollutant concentration.Moreover, precipitation has a weak negative correlation with the pollutants.Precipitation can wash the PM 2.5 , which can be replenished very soon in Beijing.Summer and autumn have strong precipitation but there is extremely light rainfall in spring and autumn; this conforms to the sub-humid, warm, temperate, and continental monsoon climate in Beijing [37].

Temporal Characteristics
The USE data, which had long-term observations, were used in this section to determine the temporal characteristics of the mean AQI values of PM 2.5 .The average AQI of PM 2.5 in five years is visualized through two circular heat maps; one map is a monthly variation and the other is an hourly variation.A daily concentration of PM 2.5 is presented as a calendar map, which is useful for an intuitive inspection of the severity of pollutants.As shown in Figure 9, pollutant concentrations are extremely high from 2010 to 2014. Figure 9a shows a circular heat map with average AQI for every month in selected years, whereas Figure 9b represents 24-h periods for selected years.Basing on the charts, AQIs of not less than 150 apparently occurred in January and February, whereas AQIs of more than 100 occurred in September, October, January, and February.Moreover, the AQIs of more than 100 were concentrated from 6:00 p.m. to 4:00 a.m. during the selected years.As shown in the calendar view (Figure 10), days with a mean AQI of PM 2.5 greater than 200 (heavily polluted) were concentrated from January to March and October to December.No significant difference was observed in PM 2.5 between weekdays and weekends, as shown in Figure 10.These graphs of temporal characteristic show that the average concentration of PM 2.5 in Beijing was greater than the AQI value of 50 over the six-year study period; therefore, the residents suffer from long-term moderate pollution.The concentrations are more serious from October to March and from 6:00 p.m. to 4:00 a.m. on average.People are advised to stay indoors during those time periods.

Spatial Characteristics
Spatial distribution of the mean AQI of PM2.5 is analyzed in this section.Observations of the 36 stations from November 2013 to February 2014 are used.We first interpolate the AQI surface through the values of those stations by Ordinary Kriging (OK) [40], which is a geostatistical method where the weights for interpolation are computed by the neighboring values called "semivariances" (γ).In Equation (2), n is the number of pairs of sample points z separated by distance h, and (ℎ) is the semivariogram which is a function of distance [41].

Spatial Characteristics
Spatial distribution of the mean AQI of PM 2.5 is analyzed in this section.Observations of the 36 stations from November 2013 to February 2014 are used.We first interpolate the AQI surface through the values of those stations by Ordinary Kriging (OK) [40], which is a geostatistical method where the weights for interpolation are computed by the neighboring values called "semivariances" (γ).In Equation (2), n is the number of pairs of sample points z separated by distance h, and γ phq is the semivariogram which is a function of distance [41].
The basic formula for OK is Equation (3), in which the λ i is the kriging weight and Z px 0 q is the observed value at point x 0 .
With Equations ( 2) and (3), provided by Clark [42], we created the Geovisualization heatmap of Beijing air pollution shown in Figure 11.The interpolation heatmap presents higher concentrations of pollutants in red, and lower concentrations in blue.It can be observed that air pollution has an increasing trend from north to south.From the land use figure of the year 2011 [43], it can be found that the north area in Beijing is mainly wood and orchard land, the center area is urban land, and the south is mainly arable land with a few residential areas.This could be further inspected through the satellite map of this area.

= ( ) = [ ( ) − ( ) ]
With Equations ( 2) and ( 3), provided by Clark [42], we created the Geovisualization heatmap of Beijing air pollution shown in Figure 11.The interpolation heatmap presents higher concentrations of pollutants in red, and lower concentrations in blue.It can be observed that air pollution has an increasing trend from north to south.From the land use figure of the year 2011 [43], it can be found that the north area in Beijing is mainly wood and orchard land, the center area is urban land, and the south is mainly arable land with a few residential areas.This could be further inspected through the satellite map of this area.A satellite map with 36 overlaid monitoring stations is shown in Figure 12.Stations 30, 31, and 32 are found in areas covered with vegetation.These stations showed lower concentrations of PM2.5 compared with the other sites.The north and west of Beijing have higher terrain, are densely covered by vegetation, and are sparsely populated.Station 32 had the lowest pollutant concentration (Figure 6).As shown in Figure 12a, station 32 is surrounded by mountains and a lake and is far from residential areas.Figures 6 and 11   A satellite map with 36 overlaid monitoring stations is shown in Figure 12.Stations 30, 31, and 32 are found in areas covered with vegetation.These stations showed lower concentrations of PM 2.5 compared with the other sites.The north and west of Beijing have higher terrain, are densely covered by vegetation, and are sparsely populated.Station 32 had the lowest pollutant concentration (Figure 6).As shown in Figure 12a, station 32 is surrounded by mountains and a lake and is far from residential areas.Figures 6 and 11  concentrations among other areas of the city.Figures in this section suggest that air pollutants exhibit strong spatial distribution.Vegetation, terrain, and land use influence air contamination.Vegetation can reduce the diffusion of atmospheric pollutants and absorb some pollutants.Arable lands have considerably higher air pollutant concentrations than vegetation-covered areas.

Conclusions
A visual exploration method was proposed to analyze air pollution data, which enables rapid processing and multi-perspective exploration of air pollution data to reveal spatio-temporal patterns and basic relationships among multiple variables.The developed method follows the observationhypothesis-verification process of the scientific method and thus complies with the demands of empirical science.The proposed visual exploration method achieved rapid processing and accurate air pollution results for Beijing to guide the daily lives of residents and government decisions.Based on a series of multi-perspective visualizations of PM2.5 data for Beijing, we conclude that the following propositions are suitable topics for further empirical study: (1) A strong correlation existed between pollutants and wind speed.The positive correlation between pollutants indicates that PM2.5, PM10, and NO2 may be emitted by the same sources, or one may be produced by the transformation of another through some type of chemical mechanism.Pollutants and wind speed were negatively correlated because wind accelerated the spread of pollutants and a mixture of fresh air may reduce pollutant concentration.
(2) Temporal characteristics were found through visual analytics.The average concentration of PM2.5 in Beijing was greater than the AQI value of 50 over the six-year study period; therefore, residents suffer from long-term moderate pollution.The concentrations are more serious from October to March and from 6:00 p.m. to 4:00 a.m. on average.People should stay indoors during these time periods.
(3) Spatial distribution of air pollutants was also determined through geovisualization.Vegetation, terrain, and land use influenced air contamination.Vegetation could reduce the diffusion

Conclusions
A visual exploration method was proposed to analyze air pollution data, which enables rapid processing and multi-perspective exploration of air pollution data to reveal spatio-temporal patterns and basic relationships among multiple variables.The developed method follows the observation-hypothesis-verification process of the scientific method and thus complies with the demands of empirical science.The proposed visual exploration method achieved rapid processing and accurate air pollution results for Beijing to guide the daily lives of residents and government decisions.Based on a series of multi-perspective visualizations of PM 2.5 data for Beijing, we conclude that the following propositions are suitable topics for further empirical study: (1) A strong correlation existed between pollutants and wind speed.The positive correlation between pollutants indicates that PM 2.5 , PM 10 , and NO 2 may be emitted by the same sources, or one may be produced by the transformation of another through some type of chemical mechanism.Pollutants and wind speed were negatively correlated because wind accelerated the spread of pollutants and a mixture of fresh air may reduce pollutant concentration.
(2) Temporal characteristics were found through visual analytics.The average concentration of PM 2.5 in Beijing was greater than the AQI value of 50 over the six-year study period; therefore, residents suffer from long-term moderate pollution.The concentrations are more serious from October to March and from 6:00 p.m. to 4:00 a.m. on average.People should stay indoors during these time periods.
(3) Spatial distribution of air pollutants was also determined through geovisualization.Vegetation, terrain, and land use influenced air contamination.Vegetation could reduce the diffusion of atmospheric pollutants and absorb some pollutants.Arable lands had considerably higher concentrations of air pollutants than vegetation-covered areas.
The findings obtained in this study can be used as reference for further statistical analysis.By incorporating remote-sensing data, AQI data can be used to evaluate air quality at a regional or global extent.With the hot spots identified by our method, the severely polluted periods could be easily determined for air composition and transformation analysis from the physical and chemical perspectives; thus, the emission source could be more easily confirmed.Our future work will combine passive remote sensing and Mie/Raman LiDAR and incorporate extra models for more comprehensive analysis based on our previous studies [44].Moreover, our method can be extended to visual analyses in other domains, such as soil pollution, climate change, and urban sprawl.selected for analysis; for example, we can choose January from the 'month' attribute for analysis.F is the Data Visualization area, where the renderer builds an appropriate visualization.
PivotTable has data preprocessing capabilities.A global analysis visualization program can be used to preprocess data.By dragging and dropping attributes to the rows and columns and selecting TSV output in the renderer, the corresponding data are shown in the Data Visualization Area.This custom function makes it very easy and efficient to export data that can be directly used for further analysis applications.
Third, it is an efficient and intuitive basic visual analysis tool that includes the capacity to create heat maps and bar, line, and pie charts using interactive drag-and-drop operations.
Finally, users can easily use PivotTable to perform basic statistical analysis, such as calculating mean, sum and median.If necessary, customized TSV data can be outputted as data for inputting into the next layer for further visualization and analysis.

Figure A1. PivotTable Graphical User Interface (GUI). A to F represent function areas
Figure A1 is the PivotTable GUI, showing its basic function areas.As seen in the figure, A and B are drop-down boxes that display options for users.A is the renderer, used to select the type of visualization, such as table, heat map, and line chart.B is the aggregator, used for selecting data for display, like mean AQI of PM2.5.C is the container for the attributes, and D is the Row attribute.E is the Column, to which attributes are dragged from C for visualization.The attribute data can be selected for analysis; for example, we can choose January from the 'month' attribute for analysis.F is the Data Visualization area, where the renderer builds an appropriate visualization.
PivotTable has data preprocessing capabilities.A global analysis visualization program can be used to preprocess data.By dragging and dropping attributes to the rows and columns and selecting TSV output in the renderer, the corresponding data are shown in the Data Visualization Area.This custom function makes it very easy and efficient to export data that can be directly used for further analysis applications.

D3.js
Data-Driven Documents (D3) [46] is a JavaScript library that uses HTML, SVG and CSS techniques to visualize data.Similar to PivotTable, it is a lightweight library with a core functionality file size less than 150 kb.It is data-driven for data inspections and operations based on the Document Object Model (DOM).D3 has detailed API documentation and an abundance of examples, and its maintenance and updating frequency is high.Thus, D3 is very suitable for statistical visualization.

Appendix B. Data quality check of the U-Air data
It has been noted in the paper that the U-Air data used in the cased study has some defects.This supplement file is to strengthen the idea and give more inspection to the original data.We use another visualization to show the data missing and a table to present the anomaly weather data in it.

D3.js
Data-Driven Documents (D3) [46] is a JavaScript library that uses HTML, SVG and CSS techniques to visualize data.Similar to PivotTable, it is a lightweight library with a core functionality file size less than 150 kb.It is data-driven for data inspections and operations based on the Document Object Model (DOM).D3 has detailed API documentation and an abundance of examples, and its maintenance and updating frequency is high.Thus, D3 is very suitable for statistical visualization.

Appendix B. Data quality check of the U-Air data
It has been noted in the paper that the U-Air data used in the cased study has some defects.This supplement file is to strengthen the idea and give more inspection to the original data.We use another visualization to show the data missing and a table to present the anomaly weather data in it.In Figure A2, the red dot rectangle marks the missing data area.The black line is the regression line of the observations.The scatter plot is another kind of visual methods for missing data check.It confirms the results shown in Figure 3 in Section 3.1, identifying that data are missing from some stations from March to October.
As seen in Figure A3, snowy weather occurred from April to June, showing incorrect records because this is impossible in Beijing (39.92N, 116.44E); there was no sunny weather from June to October, also suggesting data anomalies.Considering the defects identified in this figure, we decided to replace the weather records in the U-Air data.

Appendix C. Data processing
This part focuses on the design for visualization and data preprocessing.Through the, we developed three hypotheses and a corresponding visualized design and implementation tools to test these propositions.Data preprocessing is also considered.In Figure A2, the red dot rectangle marks the missing data area.The black line is the regression line of the observations.The scatter plot is another kind of visual methods for missing data check.It confirms the results shown in Figure 3 in Section 3.1, identifying that data are missing from some stations from March to October.

Data missing
As seen in Figure A3, snowy weather occurred from April to June, showing incorrect records because this is impossible in Beijing (39.92N, 116.44E); there was no sunny weather from June to October, also suggesting data anomalies.Considering the defects identified in this figure, we decided to replace the weather records in the U-Air data.In Figure A2, the red dot rectangle marks the missing data area.The black line is the regression line of the observations.The scatter plot is another kind of visual methods for missing data check.It confirms the results shown in Figure 3 in Section 3.1, identifying that data are missing from some stations from March to October.
As seen in Figure A3, snowy weather occurred from April to June, showing incorrect records because this is impossible in Beijing (39.92N, 116.44E); there was no sunny weather from June to October, also suggesting data anomalies.Considering the defects identified in this figure, we decided to replace the weather records in the U-Air data.

Appendix C. Data processing
This part focuses on the design for visualization and data preprocessing.Through the, we developed three hypotheses and a corresponding visualized design and implementation tools to test these propositions.Data preprocessing is also considered.

Data missing
Figure A3.The average AQI of PM 2.5 as shown in PivotTable; rows represents months, and columns contain the weather data.With PivotTable.js, this table can be reproduced by making the row as "month name", and the column with "weather name".

Figure 1 .
Figure 1.Visualization workflow for air pollution data.

Figure 2 .
Figure 2. Distribution of Beijing observation stations.Blue star represents the US Embassy (39.92 N, 116.44 E) located in the city center.

Figure 2 .
Figure 2. Distribution of Beijing observation stations.Blue star represents the US Embassy (39.92 N, 116.44 E) located in the city center.

Figure 3 .
Figure 3. Missing data identified through an organized table.

Figure 3 .
Figure 3. Missing data identified through an organized table.

Figure 4 .Table 1 .
Figure 4. Heat table of month and week mean air quality index of PM2.5.

Figure 4 .Table 1 .
Figure 4. Heat table of month and week mean air quality index of PM 2.5 .Table 1.Visualized design for discovery of defects in the U-Air data.

Figure 5 .
Figure 5. Monthly mean values of PM2.5.Colored lines represent days from Monday to Sunday.Figure 5. Monthly mean values of PM 2.5 .Colored lines represent days from Monday to Sunday.

Figure 5 .
Figure 5. Monthly mean values of PM2.5.Colored lines represent days from Monday to Sunday.Figure 5. Monthly mean values of PM 2.5 .Colored lines represent days from Monday to Sunday.

Figure 6 .
Figure 6.Area chart displaying the mean AQI of PM2.5 on each observation site (id: 01-36) in January, February, November, and December.Figure 6. Area chart displaying the mean AQI of PM 2.5 on each observation site (id: 01-36) in January, February, November, and December.

Figure 6 .
Figure 6.Area chart displaying the mean AQI of PM2.5 on each observation site (id: 01-36) in January, February, November, and December.Figure 6. Area chart displaying the mean AQI of PM 2.5 on each observation site (id: 01-36) in January, February, November, and December.Atmosphere 2016, 7, 35 9 of 19

Figure 7 .
Figure 7. Correlation matrix of the U-Air data.

Figure 7 .
Figure 7. Correlation matrix of the U-Air data.

Figure 8 .
Figure 8. Scatter plots showing relationships among PM2.5, PM10, NO2, wind, and precipitation.Different colors represent four seasons, where r is the correlation coefficient of two factors.Red circle marks a special point that is markedly different from the others.

Figure 8 .
Figure 8. Scatter plots showing relationships among PM 2.5 , PM 10 , NO 2 , wind, and precipitation.Different colors represent four seasons, where r is the correlation coefficient of two factors.Red circle marks a special point that is markedly different from the others.

Figure 9 .
Figure 9. Circular heat maps from 2010 to 2014 are selected: (a) year and month circular heat map showing the monthly air pollution concentration; and (b) year and hour circular heat map showing the daily air pollution concentration.

Figure 9 .
Figure 9. Circular heat maps from 2010 to 2014 are selected: (a) year and month circular heat map showing the monthly air pollution concentration; and (b) year and hour circular heat map showing the daily air pollution concentration.

Figure 9 .
Figure 9. Circular heat maps from 2010 to 2014 are selected: (a) year and month circular heat map showing the monthly air pollution concentration; and (b) year and hour circular heat map showing the daily air pollution concentration.

Figure 10 .
Figure 10.Calendar view: the relationship between pollutants and days from 2009 to 2014.

Figure 10 .
Figure 10.Calendar view: the relationship between pollutants and days from 2009 to 2014.
present that the central area (03-04, 06-16, and 20-22) has average air pollution concentrations.The southern areas of Beijing have a mix of arable and residential lands, and the air pollution concentrations in the corresponding stations (IDs 05, 17-19) have relatively higher AQIs.Stations 34-36 in the south end of Beijing, a major part of which is arable land, show the highest

Figure 11 .
Figure 11.Average AQI Heatmap of PM 2.5 from November 2013 to February 2014 in Beijing.
present that the central area (03-04, 06-16, and 20-22) has average air pollution concentrations.The southern areas of Beijing have a mix of arable and residential lands, and the air pollution concentrations in the corresponding stations (IDs 05, 17-19) have relatively higher AQIs.Stations 34-36 in the south end of Beijing, a major part of which is arable land, show the highest concentrations among other areas of the city.Figures in this section suggest that air pollutants exhibit strong spatial distribution.Vegetation, terrain, and land use influence air contamination.Vegetation can reduce the diffusion of atmospheric pollutants and absorb some pollutants.Arable lands have considerably higher air pollutant concentrations than vegetation-covered areas.

Figure 12 .
Figure 12.Distribution of monitoring stations (labels are IDs) overlaid on a satellite map.Three areas are enlarged to show the conditions of the corresponding land surface.(a) The lowest average observation was in station 32; (b) The highest one was in station 36; (c) The central area is also enlarged to show the locations of stations and the US Embassy, which is marked in yellow.

Figure 12 .
Figure 12.Distribution of monitoring stations (labels are IDs) overlaid on a satellite map.Three areas are enlarged to show the conditions of the corresponding land surface.(a) The lowest average observation was in station 32; (b) The highest one was in station 36; (c) The central area is also enlarged to show the locations of stations and the US Embassy, which is marked in yellow.

Figure A2 .
Figure A2.A scatterplot for 12 months and 36 stations

Figure A3 .
Figure A3.The average AQI of PM2.5 as shown in PivotTable; rows represents months, and columns contain the weather data.With PivotTable.js, this table can be reproduced by making the row as "month name", and the column with "weather name".

Figure A3 .
Figure A3.The average AQI of PM2.5 as shown in PivotTable; rows represents months, and columns contain the weather data.With PivotTable.js, this table can be reproduced by making the row as "month name", and the column with "weather name".

Figure 1. Visualization workflow for air pollution data.
⑤Application② Preliminary analysis

Table A1 .
Formatting parameters and corresponding return values.