1. Introduction
In recent years, with economic growth, environmental problems have become increasingly prominent and air pollution is receiving unprecedented attention [
1,
2,
3,
4]. The air-quality index (AQI) provides a number used by government agencies to communicate to the public how polluted the air is currently. As the AQI increases, an increasingly large percentage of the population is likely to experience increasingly severe adverse health effects [
5,
6,
7]. To compute this AQI, an air-pollutant concentration from a monitor or model is required, such as carbon monoxide (CO), carbon dioxide (CO
), hydrocarbons (HC), nitrogen oxides (NO
), solid particulate matter (PM
and PM
), etc. In order to reflect the air quality and its development trends in a timely and accurate manner, we need accurate air-quality-monitoring equipment.
However, it is unrealistic to establish a large number of monitoring equipment in different areas of the city due to the expensive construction and later maintenance costs [
8]. Based on the premise of a limited number of monitoring equipment, a reasonable monitoring station location layout is the first basic link to accurately infer air quality. On the other hand, with the rapid development of urbanization and the continuous growth in the number of motor vehicles, the number of established monitoring equipment is too few to accurately reflect the air-quality distribution in an entire urban area.
Thus far, there is few research on the optimization methods of regional air-quality-monitoring networks and there is also no uniform standard for recommending optimal monitoring equipment locations. Therefore, we need to design a framework that could recommend optimal station locations to establish new monitoring equipment on the basis of the existing ones, which can bring about the greatest accuracy improvement to the inference model. In this way, it not only meets the accuracy requirements for air-quality inference but also saves economic costs.
In recent years, the research on the location selection is mainly divided into the following two categories: knowledge-driven approaches and data-driven approaches.
The knowledge-driven approaches use mathematical models [
9,
10] and physics knowledge [
11] to solve the location recommendation problem through computational simulation. In addition, the United States, Europe, and Japan have also established regional air-quality-monitoring networks around photochemical smog pollution and aerosol pollution [
12,
13,
14]. These methods all adopted a series of optimization techniques to determine the layout of network sites, including statistical analysis methods such as correlation analysis and cluster analysis [
15,
16] or mathematical methods such as multi-objective optimization [
17,
18]. In order to reach a stable state, the simulation process not only requires complex system programming but also consumes a lot of computing power. Simplifications and stationarity assumptions that may be unrealistic in modeling further degrade the model efficiency.
Recently, data-driven approaches for location recommendation have been developed based on fixed monitoring stations [
19] or Taxi GPS trajectory [
20,
21] but without considering the road network spatial structure, such as road segment length, road types (highways, main roads, and streets), POI density, etc. Kang et al. [
22] transformed the station recommendation into a graph problem, which is committed to covering the urban area with the least number of monitoring stations but without considering the influences of complex external factors, which might lead to geographically non-smooth values in air-quality distribution. Hsieh et al. [
23] studied label propagation [
24] on graphs and considered external influential factors but failed to capture the spatiotemporal correlation between nodes in graph.
In this paper, we want to solve a practical problem: how to recommend optimal station locations to establish air-quality-monitoring equipment based on existing ones to maximize the accuracy of the air-quality-inference model, so as to reflect the air-quality distribution and its development trend in a timely manner and accurately. This task is challenging for the following reasons:
(1)
The air-quality distribution is affected by many complex external influential factors (such as weather, traffic volume, land use, etc). For commercial centers with heavy traffic flow, the air quality is often worse than that in parks or lakes. There will be geographical non-smooth data due to the influence of external influential factors, and it is difficult to obtain accurate air-quality distribution in unobserved areas through interpolation-based methods. For example,
Figure 1 is a real-time record of a certain day at the Beijing Air-Quality-Monitoring Station. We can find that the PM
concentration data are not smooth. The monitoring stations in the red circle are very geographically close, but the PM
concentration varies greatly throughout the year. The most likely reason for this result is that the monitoring stations with low PM
concentration are close to parks or lakes and that the stations with high PM
concentration are located near commercial centers or main roads with heavy traffic.
(2) There is spatiotemporal interaction between different nodes in the urban spatiotemporal graph: At the same time, the air-quality value of a node is affected by other nodes; in the same space, the air-quality values of the same node at different times are correlated.
(3) Complex correlation between the monitoring station location recommendation model and the air-quality inference model. The task of the monitoring station location recommendation model is to select optimal locations to establish new monitoring stations to maximize the accuracy of an air-quality-inference model for inferring the air-quality distribution of an entire urban area. There is a close relationship between two models.
Aiming to solve the main research problems of this paper, the current research methods either fail to consider the urban spatial structure or the influence of external factors on air-quality distribution; the first problem is how to accurately infer air-quality distribution. However, the performance of an air-quality-inference model has a crucial impact on the subsequent monitoring-station location recommendation. The next most direct question is how to associate the monitoring-station location recommendation model with the air-quality-inference model to recommend optimal station locations. The existing methods cannot capture the correlation between nodes because of the performance limitation of the model, so the recommended station location is not optimal.
In order to solve the aforementioned challenges, we formulate monitoring station locations as an urban spatiotemporal graph (USTG) node recommendation problem in which each node represents a region with time-varying air-quality values and propose a two-step learning framework.
The main contributions of this research are as follows:
(1) We propose a variant of graph convolutional network (GCN) called higher-order graph convolutional network (HGCN). The ordinary GCN can capture the one-hop neighbor node’s spatiotemporal correlation. We improve the ordinary graph convolution network and design a high-order graph convolution network. Compared with an ordinary (one-hop) GCN, HGCN has a larger receptive field, which could capture more and higher-order neighbor node information.
(2) We designed an accurate air-quality-inference model (HGCNInf) based on the proposed HGCN to infer urban air-quality distribution. We applied the graph convolutional network to the field of air-quality monitoring for the first time. By modeling the urban area as an USTG, the HGCNInf can accurately infer the air quality of an entire urban area: the HGCN can effectively capture the spatiotemporal interactions of the air-quality distribution, and a fully connected neural network is used to extract the external influential factor features.
(3) We analyze and correlate the air-quality-inference model with the monitoring-station location recommendation model. By using the convolution weight parameters of HGCNInf, we design a GMIE based on the correlation degree between nodes in USTG reflecting the air-quality spatiotemporal changes, marking the recommendation priority of unlabeled nodes according to the ability to improve the inference accuracy of HGCNInf iteratively through the node incremental learning method. The recommended optimal node could bring about the greatest accuracy improvement to HGCNInf.
(4) We evaluate our model using Beijing air-quality data from 1 January 2015 to 31 December 2015, and the experimental results show that our approach far outperforms the state-of-the-art baseline methods.