RAQ–A Random Forest Approach for Predicting Air Quality in Urban Sensing Systems

Air quality information such as the concentration of PM2.5 is of great significance for human health and city management. It affects the way of traveling, urban planning, government policies and so on. However, in major cities there is typically only a limited number of air quality monitoring stations. In the meantime, air quality varies in the urban areas and there can be large differences, even between closely neighboring regions. In this paper, a random forest approach for predicting air quality (RAQ) is proposed for urban sensing systems. The data generated by urban sensing includes meteorology data, road information, real-time traffic status and point of interest (POI) distribution. The random forest algorithm is exploited for data training and prediction. The performance of RAQ is evaluated with real city data. Compared with three other algorithms, this approach achieves better prediction precision. Exciting results are observed from the experiments that the air quality can be inferred with amazingly high accuracy from the data which are obtained from urban sensing.


Introduction
As urbanization leads to urban community growth, the transportation infrastructure dependent on fossil fuels also expands consequently [1]. The popularity in vehicle use gives rise to an increase in traffic related pollutant emissions. Urban air pollution is a major problem in both developed and developing countries, as atmospheric pollutants have a great effect on human health. Numerous illnesses such as lung cancer may be caused by various atmospheric pollutants [2]. In addition, some other serious environmental problems can also result from air pollution, such as acid rain and the greenhouse gas effect. For example, SO 2 and NO 2 are the main causes of acid rain [3], while CO 2 and N 2 O are the main reasons for the greenhouse gas effect [3]. Recently, especially in China, environmental problems have become a major concern in big cities such as Beijing and Shanghai, where the primary sources of pollutants include exhaust emissions from Beijing's more than five million motor vehicles, coal burning in neighboring regions, dust storms from the north and local construction dust [4]. A particularly severe smog engulfed the Beijing for weeks in early 2013, elevating public awareness to unprecedented levels and prompting the government to roll out emergency measures [4]. Air pollution monitoring is thus becoming more and more significant. Real-time air quality information, such as the concentration of PM 2.5 , PM 10 and NO 2 , is an important aspect for pollution management and protecting human beings from damages caused by air pollutants. Considering the significance of air quality, governments take measures to monitor it through establishing air quality monitoring  . Air quality on continuous two days can also display big jumps such as AQI at S3 which raised from 55 to 408 in the morning between 6 May 2015 and 7 May 2015 [6]. Figure 2b shows the ways in which air quality changes follow different rules in different locations. For example, no matter whether stations are a short distance apart like S5 and S6 or a long distance like S5 and S10, they showed different changes between points in time.   . Air quality on continuous two days can also display big jumps such as AQI at S3 which raised from 55 to 408 in the morning between 6 May 2015 and 7 May 2015 [6]. Figure 2b shows the ways in which air quality changes follow different rules in different locations. For example, no matter whether stations are a short distance apart like S5 and S6 or a long distance like S5 and S10, they showed different changes between points in time. through establishing air quality monitoring stations. However, because of the high expense to start up and maintain these facilities, there are not sufficient stations in cities. For example, Figure 1 shows the Google Map of Shenyang City. The red pins represent the 11 air quality monitoring stations. Among them, S1 is located in a college; S2, S3, S4, S6, S8 are located on the roofs of buildings; S5, S9 are located along roads; S10 is located in a park; S11 is located near factories. These only 11 stations that cover more than three thousand square kilometers of downtown area in Shenyang.
Another example is to compare London and Beijing. The area of Beijing is 10 times bigger than London but the number of monitoring stations is less than one fourth of London's [5]. One station can only monitor an area of limited size, therefore precise air quality reports for many areas cannot be generated.  . Air quality on continuous two days can also display big jumps such as AQI at S3 which raised from 55 to 408 in the morning between 6 May 2015 and 7 May 2015 [6]. Figure 2b shows the ways in which air quality changes follow different rules in different locations. For example, no matter whether stations are a short distance apart like S5 and S6 or a long distance like S5 and S10, they showed different changes between points in time.   It is hard to reflect these changes in a general function which can be applied to all the locations, therefore, we cannot come up with a general formula to predict the air quality in a certain time slot. Therefore, how to infer the air quality in the blank areas is a challenging and meaningful topic. In this paper, we come up with an algorithm to infer the air quality indications throughout the city. In an urban sensing system, an algorithm (RAQ) based on a random forest concept is proposed to predict the urban area air quality through the use of historical air quality data, meteorology data, historical traffic and road status as well as POI distribution information. These data are collected from all kinds of urban sensors such as weather monitoring stations. This method hides all these kinds of inaccessible factors in the traditional mathematic models. In practical applications, we cannot take all the factors such as vehicle emissions and factory emissions into count, as it is hard to get accurate data about these factors. This kind of replacement is not only good for the computation but also good for increased prediction accuracy. At the same time, all the features used in this paper are much cheaper than the accurate measured data from monitoring stations. No equipment cost is required in this approach. As for the accuracy, this algorithm performs better than some other classical ones and the overall results can provide meaningful references to citizens. Regarding the scalability and expansibility, more possible related features such as human mobility can be input into this algorithm without significant changes. The algorithm itself is also robust enough for even higher dimensions.
The remainder of this paper is organized as follow: Section 2 presents related work. The problem description and formulation are presented in Section 3. In Section 4, the system framework and the RAQ algorithm is proposed. Extensive experiments are implemented in Section 5. We conclude and outline the directions for future work in Section 6.

Related Work
In the past decades, many studies on air quality inference have been done using approaches such as dispersion models, satellite remote sensing and wireless sensor networks. Air pollution dispersion models are tools that use a mathematical model such as the Box model [7], Gaussian model [8], Lagrangian model [9], Eularian model [10], SLAB model [11] or some mixed models. to simulate how air pollution disperses in the atmosphere. The classical dispersion models are mainly functions of meteorology, traffic volumes, building distributions and so on. These models depend mainly on experience and the parameters above to simulate the pollution dispersion, but some other potential factors are not taken into consideration such as human mobility and concentrations. In the meantime, dispersion models depend on access to relatively accurate data, such as the strength of pollutant sources, wind speed, traffic emissions and so on, which accuracy cannot be guaranteed in certain conditions. For example, wind speed may vary a lot in different regions because of the obstructions of buildings, and their roles in determining the modified wind circulation between and over structures. Accurate traffic emissions are also hard to obtain. We can only estimate the value according to the fuel consumption and distances travelled.
Satellite remote sensing technology is another possible way to monitor air quality. Research has developed quickly using satellites to monitor air conditions in the past decades. For example, Liu et al. came up with an approach using satellite remote sensing technology to test the thickness of PM 2.5 on the ground [12]. Similarly, Martin et al. came up with a way of using satellite remote sensing technology to test some ground air pollutants, including CO, NO, SO 2 and so on [13]. Pawan et al. used this technology to evaluate the air conditions of every city [14]. These methods mainly use satellite remote sensing technology to directly measure the concentration of certain air pollutants by analyzing the images obtained by the satellites to estimate the concentrations of air pollutants. However, many air quality managers are not yet taking full advantage of satellite data for their applications because of the challenges associated with accessing, processing, and properly interpreting observational data. That is, a certain degree of technical skill is required on the part of the data end-user, which is often problematic for organizations with limited resources [15]. Sensor networks have also been studied extensively because of their broad applicability and enormous application potential in areas such the environmental monitoring field. A Wireless Sensor Network Air Pollution Monitor System (WAPMS) was deployed on the island of Mauritius for monitoring air quality [16]; distributed infrastructure-based wireless sensor networks and grid computing is also used for monitoring the air quality of London [17]. Rajasegarar et al. also used wireless sensor networks to monitor air pollutants [18]. However, sensor networks require a large number of sensor devices, and can only be deployed in a small range, such as indoors and in small areas. For a city and other large areas, if using cheap sensors with single function, we cannot get information about all kinds of air pollutants. If using sensors with complex functions such as monitoring stations, infrastructure construction and maintenance costs make it difficult to promote wireless sensor networks for a wide usage range. It is the same reason which limits the number of stations in cities of China.
Besides all the methods above, participatory sensing is also an important approach for air quality prediction. With the popularity of smart devices, participatory sensing and crowdsourcing has been a hot topic of discussion in recent years. People see unlimited possibilities in smart devices. A personalized mobile sensing system (MAQS) was proposed for indoor air quality monitoring [19]; a system based on smart phones and monitoring sensors has also been used to monitor outdoor air quality [20]; noise pollution is also monitored using mobile phones [21]. Sivaraman et al. used a participatory sensor system to monitor air pollutants in Sydney (Australia) [22]. However, most current smartphones does not carry air pollutant sensors, so the sensing devices required for the system need external sensing modules which leads to extra costs. Besides the high expense, user participation and the accuracy of the data are problems that remain to be solved.
Recently, urban computing has been one of the ways to solve problems in cities. Yuan Jing et al. proposed an algorithm to infer the functional areas of cities by using trajectories [23]; Zheng et al. made use of the city daily data to infer urban air quality [24,25]. However, similarly, urban computing also requires pre-installed urban sensors such as GPS devices. For instance, when inferring the air quality, Zheng made use of months of data collected from the GPS installed in taxis in Beijing. This is an important limitation that prevents the promotion of this approach because in most cities we cannot access the GPS information of taxis. Spatiotemporal data analysis is also an important aspect for air quality prediction. Chen et al. established a spatiotemporal data framework named BigSmog to provide China smog analysis [26]. Zhu et al. proposed Granger-causality-based air quality estimation with heterogeneous spatiotemporal data [27]. Some other studies [28,29] also analyzed spatiotemporal data to generate air pollutant distributions.

Air Quality Index
An air quality index (AQI) is a number used by government agencies to communicate to the public how polluted the air is currently or how polluted it is forecasted to become [30]. As the AQI increases, an increasingly large percentage of the population is likely to be exposed, and people might experience increasingly severe health effects. Different countries have their own air quality indices, corresponding to different national air quality standards. In this paper, we use the standard of China, where the AQI is based on the levels of six atmospheric gases, namely sulfur dioxide (SO 2 ), nitrogen dioxide (NO 2 ), suspended particulates smaller than 10 µm in aerodynamic diameter (PM 10 ), suspended particulates smaller than 2.5 µm in aerodynamic diameter (PM 2.5 ), carbon monoxide (CO), and ozone (O 3 ), measured at the monitoring stations throughout each city [31]. The AQI value is calculated per hour according to a formula published by China's Ministry of Environmental Protection [31]. AQI is the maximum value of I AQI p which is a reference value of one air pollutant p: AQI " max tI AQI 1 , I AQI 2 , I AQI 3 , . . . , I AQI n u (1) where C p is mass concentration value of the air pollutant p, BP H i is the high value of the concentration limit which can be checked in the reference table from the paper [31], BP L o is the low value of the concentration limit which can be checked in the reference table from [31], I AQI H i is the corresponding value of BP H i in the same reference table, I AQI L o is also the corresponding value of BP L o in the reference table. Table 1 shows the relationship between AQI values and air pollution levels which are marked by different colors. In this way, air quality prediction can be treated as a classification problem so that we only need to match the air quality index to different classification levels in Table 1. The six levels in Table 1 represent six AQI levels.

Traffic Congestion Status
Traffic Congestion Status (TCS) describes the traffic conditions on a certain road. Different colors denote different levels of congestion. For example, Figure 3 shows an example of a TCS graph. ntration limit which can be checked in the reference table from the paper [31], is the of the concentration limit which can be checked in the reference table from [31], i sponding value of in the same reference table, is also the corresponding val in the reference table. Table 1 shows the relationship between AQI values and air pollu which are marked by different colors. In this way, air quality prediction can be treated fication problem so that we only need to match the air quality index to different classific in Table 1. The six levels in Table 1 represent six AQI levels.  Figure 3 shows an example of a TCS gr

Point of Interest
A point of interest, or POI, is a specific location that someone may be interested in. For example, restaurants and shopping malls surrounding us are POI. Figure 4 presents the restaurant locations around Sanhao Street of Shenyang on Google Maps.

Point of Interest
A point of interest, or POI, is a specific location that someone may be interested in. For example, restaurants and shopping malls surrounding us are POI. Figure 4 presents the restaurant locations around Sanhao Street of Shenyang on Google Maps.

Problem Formulation
This paper uses urban sensing data to solve the problem of air quality inference which means to infer the unknown air quality of areas by using all kinds of data. These data affect either the sources of air pollution such as traffic emissions and point of interest distribution or their results such as the air quality index, so establishing the relationship between these data and air quality is the key to this kind of approach. The RAQ algorithm collects several kinds of related data including air monitoring station data (AQI), meteorology data (MD), traffic (TCS), road information (RI) and POI data. All these data are fetched at intervals of one hour. We divide the city into grids (G) and each grid is regarded as one unit. Those grids (G 1 ) with air quality monitoring stations generate the data with the label AQI while the grids (G 2 ) without stations generate the data used for prediction. Data from G 1 are used for training our learning model and data from G 2 are input into the model to generate the predication value. The only difference of data from G 1 and G 2 is data from G 1 are labeled as an AQI value. The results are given as different AQI levels. If the actual value from monitoring stations belongs to this AQI level, then we know the prediction is right. Otherwise the prediction is wrong.
This problem can be formulated as follows: given a collection of grids G = G 1 Y G 2 (|G 1 |! |G 2 |), where g 1¨A QI (g 1 P G 1 ) is known and g 2¨A QI (g 2 P G 2 ) is unknown, g¨MD, g¨TCS, g¨RI and g¨POI are known (g P G), RAQ aims to predict g 2¨A QI at intervals of one hour.

RAQ Algorithm
In the RAQ algorithm, all data are collected from the urban sensing system including air monitoring station data, meteorology data, traffic data, road information and POI data and necessary features are extracted from heterogeneous data. These features are the most common data in city life. Traffic-related sources like vehicle emissions and POI like factories are the main sources for air pollutants [3]. Meteorology is the main approach for dispersion of air pollutants [3]. These data can represent well the air quality situation. The training dataset includes all the necessary features and is divided into subsets using bootstrap technology. Figure 5 shows the structure of the dataset. A decision tree is constructed on each subset, and the classification is done by aggregating the results generated from all decision trees. Figure 6 shows the procedure of the RAQ algorithm.

Problem Formulation
This paper uses urban sensing data to solve the problem of air quality inference which means to infer the unknown air quality of areas by using all kinds of data. These data affect either the sources of air pollution such as traffic emissions and point of interest distribution or their results such as the air quality index, so establishing the relationship between these data and air quality is the key to this kind of approach. The RAQ algorithm collects several kinds of related data including air monitoring station data (AQI), meteorology data (MD), traffic (TCS), road information (RI) and POI data. All these data are fetched at intervals of one hour. We divide the city into grids (G) and each grid is regarded as one unit. Those grids (G1) with air quality monitoring stations generate the data with the label AQI while the grids (G2) without stations generate the data used for prediction. Data from G1 are used for training our learning model and data from G2 are input into the model to generate the predication value. The only difference of data from G1 and G2 is data from G1 are labeled as an AQI value. The results are given as different AQI levels. If the actual value from monitoring stations belongs to this AQI level, then we know the prediction is right. Otherwise the prediction is wrong.

RAQ Algorithm
In the RAQ algorithm, all data are collected from the urban sensing system including air monitoring station data, meteorology data, traffic data, road information and POI data and necessary features are extracted from heterogeneous data. These features are the most common data in city life. Traffic-related sources like vehicle emissions and POI like factories are the main sources for air pollutants [3]. Meteorology is the main approach for dispersion of air pollutants [3]. These data can represent well the air quality situation. The training dataset includes all the necessary features and is divided into subsets using bootstrap technology. Figure 5 shows the structure of the dataset. A decision tree is constructed on each subset, and the classification is done by aggregating the results generated from all decision trees. Figure 6 shows the procedure of the RAQ algorithm.

Meteorology Data
Meteorology data such as temperature and humidity are very important factors that severely affect the concentration and spread of air pollutants. Understanding the behavior of meteorological parameters in the planetary boundary layer is important because the atmosphere is the medium in which air pollutants are transported away from the source, which is governed by the meteorological parameters such as atmospheric wind speed, wind direction, and temperature [32]. In this paper, we use weather monitoring stations as one part of the urban sensing system. Considering the accessibility of the data, we use following meteorology data features: temperature (Fmt, °C), humidity (Fmh, %), barometric pressure (Fmp, mmHg), wind speed (Fmw, m/s) and visibility (Fmv, m).

Traffic and Road Data
Traffic is one of the most important factors that affect the air quality. Figure 3 is a sample of the original data that is available from map service providers. In this paper, we rely on two important characteristics of traffic, which are road length (Frl) and traffic congestion status (Ftcs). If the road is very long and traffic congestion is relatively light, exhaust gas emissions can be at a high level because of the total number of vehicles on this road. Similarly, if a road is short and traffic congestion is heavy. However, we do not have a method or accurate data to quantify these two characteristics directly. Most map service providers offer online maps and real-time traffic status. They do not publish public application interfaces (APIs) for third party developers to access these data, but we can still get some useful hints through analyzing the web http requests of the map. Essentially, these data are collected from GPS equipment installed in cars or speed measurement sensors. These data denote another important part of the urban sensing systems. Figure 7 shows the http request records of a typical Baidu map when we invoke the traffic widget. Meteorology data such as temperature and humidity are very important factors that severely affect the concentration and spread of air pollutants. Understanding the behavior of meteorological parameters in the planetary boundary layer is important because the atmosphere is the medium in which air pollutants are transported away from the source, which is governed by the meteorological parameters such as atmospheric wind speed, wind direction, and temperature [32]. In this paper, we use weather monitoring stations as one part of the urban sensing system. Considering the accessibility of the data, we use following meteorology data features: temperature (F mt ,˝C), humidity (F mh , %), barometric pressure (F mp , mmHg), wind speed (F mw , m/s) and visibility (F mv , m).

Traffic and Road Data
Traffic is one of the most important factors that affect the air quality. Figure 3 is a sample of the original data that is available from map service providers. In this paper, we rely on two important characteristics of traffic, which are road length (F rl ) and traffic congestion status (F tcs ). If the road is very long and traffic congestion is relatively light, exhaust gas emissions can be at a high level because of the total number of vehicles on this road. Similarly, if a road is short and traffic congestion is heavy. However, we do not have a method or accurate data to quantify these two characteristics directly. Most map service providers offer online maps and real-time traffic status. They do not publish public application interfaces (APIs) for third party developers to access these data, but we can still get some useful hints through analyzing the web http requests of the map. Essentially, these data are collected from GPS equipment installed in cars or speed measurement sensors. These data denote another important part of the urban sensing systems. Figure 7 shows the http request records of a typical Baidu map when we invoke the traffic widget.
As we know, a picture is composed of many pixels, so a picture can be digitized into a matrix. We use the colored pixel distribution to represent the information of road length and congestion status. For each tile grid, we count the quantity of pixels to represent the road. The larger the quantity of pixels, the greater the length of the road is in one tile grid.
As shown in Figure 3, traffic congestion status is denoted by different colors (green, orange and red) in the pixels which represent roads. According to the traffic volume of different congestion levels, different weights are assigned to the numbers of pixels in different colors (1, 2 and 5). In Figure 8, the weighted tcs value is calculated by formula a + 2b + 5c, where a is the number of pixels in green, b is the number of pixels in orange and c is the number of pixels in red. As we know, a picture is composed of many pixels, so a picture can be digitized into a matrix. We use the colored pixel distribution to represent the information of road length and congestion status. For each tile grid, we count the quantity of pixels to represent the road. The larger the quantity of pixels, the greater the length of the road is in one tile grid.
As shown in Figure 3, traffic congestion status is denoted by different colors (green, orange and red) in the pixels which represent roads. According to the traffic volume of different congestion levels, different weights are assigned to the numbers of pixels in different colors (1, 2 and 5). In Figure 8, the weighted tcs value is calculated by formula a + 2b + 5c, where a is the number of pixels in green, b is the number of pixels in orange and c is the number of pixels in red.   As we know, a picture is composed of many pixels, so a picture can be digitized into a matrix. We use the colored pixel distribution to represent the information of road length and congestion status. For each tile grid, we count the quantity of pixels to represent the road. The larger the quantity of pixels, the greater the length of the road is in one tile grid.
As shown in Figure 3, traffic congestion status is denoted by different colors (green, orange and red) in the pixels which represent roads. According to the traffic volume of different congestion levels, different weights are assigned to the numbers of pixels in different colors (1, 2 and 5). In Figure 8, the weighted tcs value is calculated by formula a + 2b + 5c, where a is the number of pixels in green, b is the number of pixels in orange and c is the number of pixels in red.

POI Data
The category of POIs and their density in a region indicate the land use and the function of the region as well as the traffic patterns in the region, therefore contributing to the air quality inference of the region [24]. For example, shopping streets are more likely to gather more people than parks so

POI Data
The category of POIs and their density in a region indicate the land use and the function of the region as well as the traffic patterns in the region, therefore contributing to the air quality inference of the region [24]. For example, shopping streets are more likely to gather more people than parks so there will be more human-related air pollution sources like vehicles. Schools always have more green areas than factories so there are more plants to absorb the air pollutants. Therefore, POI distribution has a strong effect on air quality. These data also imply the significance of human activities in urban sensing systems. In this paper, the number of POI is counted in each tile grid. According to the searching results of Baidu maps and Google Map, the majority of POI are divided into ten categories. Table 2 shows the categories and Figure 9 presents the number of POI (F pn ) in each category. Shopping mall P10 Gas station ten categories. Table 2 shows the categories and Figure 9 presents the number of POI (Fpn) in each category.

Random Forest Classification
The Random Forest is a general term for ensemble methods using tree-type classifiers k h x  is a generated classifier [33]. It uses recursive partitioning to generate many trees and then aggregate the results. Each tree is independently constructed using a bootstrap sample of the training data, which subdivides the parameter set first into several parts depending on one of the parameters, and subsequently repeats the process for each part.

Bootstrap Aggregating (Bagging)
There is usually a single data sample in each class for training. A simple method is to divide the dataset into non-overlapping subsets and construct the trees independently. However, this requires a huge amount of data and it cannot always be guaranteed in different situations. A better way is sampling the original dataset with replacement for a certain times to produce a bootstrap sample. This method ensures that the samples' distributions are statistically identical with the original data sample [34]. There are n records in the original dataset and so the probability of each record is

Random Forest Classification
The Random Forest is a general term for ensemble methods using tree-type classifiers thpx, θ k q, k " 1, ..., u where the tθ k u are independent identically distributed random vectors and x is an input pattern, hpx, θ k q is a generated classifier [33]. It uses recursive partitioning to generate many trees and then aggregate the results. Each tree is independently constructed using a bootstrap sample of the training data, which subdivides the parameter set first into several parts depending on one of the parameters, and subsequently repeats the process for each part.

Bootstrap Aggregating (Bagging)
There is usually a single data sample in each class for training. A simple method is to divide the dataset into non-overlapping subsets and construct the trees independently. However, this requires a huge amount of data and it cannot always be guaranteed in different situations. A better way is sampling the original dataset with replacement for a certain times to produce a bootstrap sample. This method ensures that the samples' distributions are statistically identical with the original data sample [34]. There are n records in the original dataset and so the probability of each record is constantly 1/n. The probability of not selecting a certain record is (1 -1/n), which results in (1 -1/n) n when repeated n times.
Assuming the sample size tends to be infinite, the probability can be expressed as lim nÑ8 (1 -1/n) n which is equal to e´1. Therefore, the probability of selecting one record is (1 -e´1) « 2 / 3 . Thus, in each bootstrap sample there are about 2 / 3 original samples for training.

Tree Growing and Splitting
As we know, a decision tree starts with one root node. In the following process, the samples are split into different spaces using one of the features including monitoring station data (AQI), meteorology data(MD), traffic(TCS), road information(RI) and POI data.. Therefore, how to select the feature in each split is of great significance for the performance of a decision tree. Information gain [35] is usually used as the criterion for classifiers.
The features selection for each bootstrap sample is randomized. According to bagging theory, random forest is strong classifier based on multiple weak classifiers. Therefore, both the number of data and the number of features of the subset are smaller than original dataset's. We need T subsets with m features. According to Brieman's suggestions [33], m is much less than the number of all the features. Brieman suggests three possible values for m: In the evaluation section, we would show four features and 400 subsets are best for our model and dataset. When splitting the dataset, for each feature candidate, entropy is calculated as in Equation (3): where c i is the AQI level i which is specified in Table 1, the probability p(c i ) is calculated through Equation (2) where Ni is the quantity of records in different AQI level and k is the number of AQI levels. Therefore, the information gain is defined as shown in Equation (5): where f i represents records of the i th level of tree, f i j are records in jth node of the ith level of tree, and w is the number of nodes in this level. The process of splitting stops when: (a) the records in one node fall below the threshold value defined by users; (b) the node is pure which means all the records fall into one class. For the terminated node has unordered records, the percentage of different classes are calculated and so the predicted class is defined as in Equation (6):

Prediction
After all the trees are constructed, the unlabeled data are input into all decision trees. For each tree, p(c i ) is the estimated probability of the AQI level i. The final probability of the AQI level i p'(c i ) in the random forest is defined in Equation (7), where T is the number of decision trees as mentioned before: The final result is determined by Equation (8): The pseudocode of RAQ algorithm is described in Algorithm 1. choose maximum gain to split the dataset in the node; 6 remove used feature from feature candidates; 7 input unlabeled data into trees; 5 get predicted AQI level according to Equations (5) and (6);

Dataset
In the experiments, one-month data from 4 May 2015 to 5 June 2015 is collected and the following four datasets of Shenyang are used which are all available to the public. In our testing period, we use a total of 2701 data to test this algorithm and Shenyang is divided into 1258 grids corresponding to 34 rows and 37 columns. Because all the grids belong to the main city area, all data including meteorology data, traffic data, road information and POI data in these grids are accessible from our data sources. Air quality data is accessible in the areas covered by air monitoring stations.

Monitoring Station Data
The air quality information from the Shenyang monitoring stations includes AQI, the concentrations of CO, NO 2 , SO 2 , O 3 , PM 10 and PM 25 and timestamp. Table 3 shows the format of the monitoring station data. Table 4 shows the locations of all the monitoring stations. All the data are collected from the public website [36] whose data are produced by National Department of Environmental Protection. We use the Java programming language to access the API interface hourly and store all the data into a MySQL database.  We collect meteorological data including temperature, humidity, barometric pressure, wind speed and visibility from the public website [37]. As Table 5 illustrates, the data format is presented as temperature (F mt ), humidity (F mh ), barometric pressure (F mp ), wind speed (F mw ) and visibility (F mv ).  There are no public websites that offer statistical road and traffic data. Therefore, we cannot directly get available formatted data. However, most of the map service providers offer online maps and real-time traffic status. They do not publish public API interfaces for third party developers to access these data, but we can still get some useful tips through analyzing the map web http requests. From map services providers [38,39], we collect the traffic map tiles every hour.

POI
Thank to Baidu map and Google map service, we can easily get these data from a public interface. Each POI record contains name, latitude, longitude, tag and located tile grids. Figure 10 shows about 28,000 records in the MySQL database.

Evaluation Method
The most accurate criterion for air quality measure is the air quality information from monitoring stations. In this experiment, we use the AQI data from monitoring stations as the reference standard. To construct a random forest, we need to determine two parameters which are the numbers of trees and the number of features used to construct each tree. To choose the best parameters, we use OOB (Out-of-Bag) [33] error to compare RAQ accuracy based on different parameters pairs <#features, #trees> which means the number of features used to construct each tree and the number of trees that are constructed in the random forest. In random forests, the error is estimated internally during the construction of trees. Each tree is constructed using a different bootstrap sample from original data, which about one-third are left out of the bootstrap sample. The one-third sample is used as test cases to be input into the tree and get the classification of each test case. At the end of the run, take the class j that got most of the votes every time case n was oob [40]. The proportion of times that j is not equal to the true class of n averaged over all cases is the oob error estimate [40]. The smaller number of oob, the high accuracy of the model. For the number of features, we increase by one each time from 2 to 8 (total number of features is 8 specified in algorithm ). For the quantity of trees, we increase by 100 from 100 to 1000. Because of the time consumption with more number of trees, we ignore the trees number greater than 1000 and 100 gap is suitable to balance performance and accuracy. To compare this algorithm with others, we use cross-validation method to judge the performance.

Effects of Parameters on Prediction Error Rate
There are two important factors that affect the performance of a random forest, which are the number of trees and features. Figure 11 shows how the OOB error changes along with the number of features and trees. X-axis is the number of features and Y-axis is the number of trees.

Evaluation Method
The most accurate criterion for air quality measure is the air quality information from monitoring stations. In this experiment, we use the AQI data from monitoring stations as the reference standard. To construct a random forest, we need to determine two parameters which are the numbers of trees and the number of features used to construct each tree. To choose the best parameters, we use OOB (Out-of-Bag) [33] error to compare RAQ accuracy based on different parameters pairs <#features, #trees> which means the number of features used to construct each tree and the number of trees that are constructed in the random forest. In random forests, the error is estimated internally during the construction of trees. Each tree is constructed using a different bootstrap sample from original data, which about one-third are left out of the bootstrap sample. The one-third sample is used as test cases to be input into the tree and get the classification of each test case. At the end of the run, take the class j that got most of the votes every time case n was oob [40]. The proportion of times that j is not equal to the true class of n averaged over all cases is the oob error estimate [40]. The smaller number of oob, the high accuracy of the model. For the number of features, we increase by one each time from 2 to 8 (total number of features is 8 specified in algorithm ). For the quantity of trees, we increase by 100 from 100 to 1000. Because of the time consumption with more number of trees, we ignore the trees number greater than 1000 and 100 gap is suitable to balance performance and accuracy. To compare this algorithm with others, we use cross-validation method to judge the performance.

Effects of Parameters on Prediction Error Rate
There are two important factors that affect the performance of a random forest, which are the number of trees and features. Figure 11 shows how the OOB error changes along with the number of features and trees. X-axis is the number of features and Y-axis is the number of trees. Empirically, for our experiment, we choose integer as the number of features and 100 interval integer as the number of trees, so only the discrete coordinate values such as (2,100), (3,200) are meaningful in this graph. Different colors mean different OOB error values. The deeper the color is, the smaller the oob is. As the graph shows, the OOB errors reach the best when the parameters pairs are <4, 400> and <6, 1000>. Considering less time consumption, we choose <4,400> as the best parameters pair.

Comparison
For the contrast tests, Naïve Bayes, Logistic Regression, Single Decision Tree and ANN are chosen. Here we use Weka [41] as the tool to conduct all the comparison tests. For Naïve Bayes, there are eight features which are Fmt, Fmh, Fmp, Fmw, Fmv, Fri,,Ftcs, Fpn and six classification categories (C) which are specified in Table 1. In Weka, this algorithm is denoted as weka.classifiers.bayes.NaiveBayesMultinomial. For Logistic Regression, we choose Multinomial Logistic Regression because of the multi AQI levels. In Weka, this algorithm is denoted as weka.classifiers.functions.Logistic. For Single Decision Tree, we choose all the features to construct one single tree for classification. In Weka, this algorithm is denoted as weka.classifiers.trees.REPTree. For ANN, we choose back-propagation neural network with one hidden layer for its simplicity and generality. In Weka, this algorithm is denoted as weka.classifiers.functions.MultilayerPerceptron.
After realizing different algorithms, tests are carried out. Table 6 shows the results of the test cases in which Y means correct predictions and N means incorrect predictions. The precision is calculated by the formula Y/(Y + N) where Y is the number of correct predictions and N is the number of incorrect predictions. Figure 12 illustrates how the prediction precision changes as the data size changes. This figure shows RAQ performs steadily, even when the data size is relatively small. Other algorithms are less accurate at all time.  Empirically, for our experiment, we choose integer as the number of features and 100 interval integer as the number of trees, so only the discrete coordinate values such as (2,100), (3,200) are meaningful in this graph. Different colors mean different OOB error values. The deeper the color is, the smaller the oob is. As the graph shows, the OOB errors reach the best when the parameters pairs are <4, 400> and <6, 1000>. Considering less time consumption, we choose <4, 400> as the best parameters pair.

Comparison
For the contrast tests, Naïve Bayes, Logistic Regression, Single Decision Tree and ANN are chosen. Here we use Weka [41] as the tool to conduct all the comparison tests. For Naïve Bayes, there are eight features which are F mt , F mh , F mp , F mw , F mv , F ri , F tcs , F pn and six classification categories (C) which are specified in Table 1. In Weka, this algorithm is denoted as weka.classifiers.bayes.NaiveBayesMultinomial. For Logistic Regression, we choose Multinomial Logistic Regression because of the multi AQI levels. In Weka, this algorithm is denoted as weka.classifiers.functions.Logistic. For Single Decision Tree, we choose all the features to construct one single tree for classification. In Weka, this algorithm is denoted as weka.classifiers.trees.REPTree. For ANN, we choose back-propagation neural network with one hidden layer for its simplicity and generality. In Weka, this algorithm is denoted as weka.classifiers.functions.MultilayerPerceptron.
After realizing different algorithms, tests are carried out. Table 6 shows the results of the test cases in which Y means correct predictions and N means incorrect predictions. The precision is calculated by the formula Y/(Y + N) where Y is the number of correct predictions and N is the number of incorrect predictions. Figure 12 illustrates how the prediction precision changes as the data size changes. This figure shows RAQ performs steadily, even when the data size is relatively small. Other algorithms are less accurate at all time.  Besides the precision measurement, we also refer to other measurements including Recall, F-score, Relative Absolute Error (RAE) and Receiver Operating Characteristic (ROC). Recall is the proportion of instances classified as a given class divided by the actual total in that class. F-score is a combined measure for precision and recall calculated as 2 Precision Recall/(Precision + Recall) where Recall (also known as sensitivity) is the fraction of relevant instances that are retrieved. Relative absolute error is calculated by the following formula:    Besides the precision measurement, we also refer to other measurements including Recall, F-score, Relative Absolute Error (RAE) and Receiver Operating Characteristic (ROC). Recall is the proportion of instances classified as a given class divided by the actual total in that class. F-score is a combined measure for precision and recall calculated as 2˚Precision˚Recall/(Precision + Recall) where Recall (also known as sensitivity) is the fraction of relevant instances that are retrieved. Relative absolute error is calculated by the following formula: here θ i is the estimated value, r i is the real value, θ is the average value, N is the number of test cases. ROC shows how the number of correctly classified positive examples varies with the number of incorrectly classified negative examples [42].  Besides the precision measurement, we also refer to other measurements including Recall, F-score, Relative Absolute Error (RAE) and Receiver Operating Characteristic (ROC). Recall is the proportion of instances classified as a given class divided by the actual total in that class. F-score is a combined measure for precision and recall calculated as 2 Precision Recall/(Precision + Recall) where Recall (also known as sensitivity) is the fraction of relevant instances that are retrieved. Relative absolute error is calculated by the following formula:    Based on our dataset, these measurements also show that RAQ performs better than others in this specific problem. Table 7 shows the original data of the experiments result and Figure 13 illustrates these data in chart form.

Conclusions
In this paper, with the public data in the urban sensing system, our model predicts the AQI of all the regions in Shenyang based on the AQI published by 11 air quality monitoring stations, meteorology data reported by weather stations, road information and real-time traffic status collected from Baidu Map and Google Maps and the POI distributions provided by Baidu Map and Google Maps. We use a random forest algorithm to predict all the uncovered regions in the downtown area. In Shenyang, this algorithm finally results in an overall precision of 81% for AQI prediction. This experimental result outperforms that of Naïve Bayes, Logistic Regression, single decision tree and ANN. All of these data are directly or indirectly available on the Internet. This shows that the algorithm could be easily applied for other cities. RAQ makes use of historical data for model training but ignores the real-time data. Our work will be extended to support online learning so daily data can be used to improve the performance of the air prediction algorithm.