Study of the Effects of Air Pollutants on Human Health Based on Baidu Indices of Disease Symptoms and Air Quality Monitoring Data in Beijing, China

There is an increasing body of evidence showing the impact of air pollutants on human health such as on the respiratory, and cardio- and cerebrovascular systems. In China, as people begin to pay more attention to air quality, recent research focused on the quantitative assessment of the effects of air pollutants on human health. To assess the health effects of air pollutants and to construct an indicator placing emphasis on health impact, a generalized additive model was selected to assess the health burden caused by air pollution. We obtained Baidu indices (an evaluation indicator launched by Baidu Corporation to reflect the search popularity of keywords from its search engine) to assess daily query frequencies of 25 keywords considered associated with air pollution-related diseases. Moreover, we also calculated the daily concentrations of major air pollutants (including PM10, PM2.5, SO2, O3, NO2, and CO) and the daily air quality index (AQI) values, and three meteorological factors: daily mean wind level, daily mean air temperature, and daily mean relative humidity. These data cover the area of Beijing from 1 March 2015 to 30 April 2017. Through the analysis, we produced the relative risks (RRs) of the six main air pollutants for respiratory, and cardio- and cerebrovascular diseases. The results showed that O3 and NO2 have the highest health impact, followed by PM10 and PM2.5. The effects of any pollutant on cardiovascular diseases was consistently higher than on respiratory diseases. Furthermore, we evaluated the currently used AQI in China and proposed an RR-based index (health AQI, HAQI) that is intended for better indicating the effects of air pollutants on respiratory, and cardio- and cerebrovascular diseases than AQI. A higher Pearson correlation coefficient between HAQI and RRTotal than that between AQI and RRTotal endorsed our efforts.

An increasing body of evidence shows that air pollutants can significantly reduce lung functions and human body immune function and increase the prevalence of malignant tumors [16][17][18][19][20].
The acute health effects of short-term exposure to air pollutants were mainly studied using time-series studies, case-crossover studies, and panel studies [21]. There are some approaches to the

Study Area
This study was conducted in Beijing, the capital of China, where air pollution is severe. This area covers 16,410 km 2 , with a residential population of 21.7 million in 2016. The Gross Domestic Product (GDP) of the study area amounts to 3.35% of the whole country.
Unlike other cities where PM 2.5 is high only in the winter, in Beijing, air pollution is high in both the autumn and winter [36]. According to the long-term observations performed by the Chinese Academy of Sciences [37], the average annual value of PM 2. 5  This study was conducted in Beijing, the capital of China, where air pollution is severe. This area covers 16,410 km 2 , with a residential population of 21.7 million in 2016. The Gross Domestic Product (GDP) of the study area amounts to 3.35% of the whole country.
Unlike other cities where PM2.5 is high only in the winter, in Beijing, air pollution is high in both the autumn and winter [36]. According to the long-term observations performed by the Chinese Academy of Sciences [37], the average annual value of PM2.5 in Beijing was 92.7 μg/m 3  There are 1436 monitoring stations of air quality in total all over China with data available from the Ministry of Environmental Protection of the People's Republic of China (MEP) website (http://datacenter.mep.gov.cn/). Twelve of these stations are located in Beijing. The study area, its location in China, and the monitoring stations that provided pollutant data for our study are shown in Figure 1.

Data
There were three main types of data collected and preprocessed in our study: the AQI was considered as the causing factor, the Baidu index was taken as the effect factor, and meteorological observations were used as measures of condition factors.

Meteorological Observations
We obtained hourly series of three meteorological factors (mean wind level, mean air temperature, and mean relative humidity) during 1 March 2015-30 April 2017 from meteorological stations located in Beijing City. These meteorological factors are presumed to have control over the effects of air pollution on health. The missing data were filled with interpolated values from their neighbor observations (if any) on the same day through linear interpolation.

Air Quality Data
We downloaded hourly records of AQI and six individual pollutant concentrations: PM10, PM2.5, SO2, O3, NO2, and CO during 1 March 2015-30 April 2017 from the MEP website. The missing data

Data
There were three main types of data collected and preprocessed in our study: the AQI was considered as the causing factor, the Baidu index was taken as the effect factor, and meteorological observations were used as measures of condition factors.

Meteorological Observations
We obtained hourly series of three meteorological factors (mean wind level, mean air temperature, and mean relative humidity) during 1 March 2015-30 April 2017 from meteorological stations located in Beijing City. These meteorological factors are presumed to have control over the effects of air pollution on health. The missing data were filled with interpolated values from their neighbor observations (if any) on the same day through linear interpolation.

Air Quality Data
We downloaded hourly records of AQI and six individual pollutant concentrations: PM 10 , PM 2.5 , SO 2 , O 3 , NO 2 , and CO during 1 March 2015-30 April 2017 from the MEP website. The missing data were filled with interpolated values from their neighbor observations (if any) on the same day through linear interpolation.
According to the algorithm provided by the MEP Data Center, the AQI takes the maximum of the six individual air quality index (IAQI) values (SO 2 , NO 2 , PM 10 , PM 2.5 , O 3 , and CO). The IAQI was calculated as follows [38]: where IAQI i represents the individual air quality index of the i-th pollutant; C i is the concentration of the i-th pollutant; BP Hi and BP Li are the high and low values of the pollutant concentration limits closest to C i ; and IAQI Hi and IAQI Li are the individual air quality indices corresponding to BP Hi and BP Li . Table 1, which is from the Chinese government's Ambient Air Quality Standard (GB3095-2012) [39], can be used to look up the values of IAQI Hi , IAQI Li , BP Hi , and BP Li .  0  0  0  0  0  0  0  50  50  40  50  2  100  35  100  150  80  150  4  160  75  150  475  180  250  14  215  115  200  800  280  350  24  265  150  300  1600  565  420  36  800  250  400  2100  750  500  48  1000  350  500  2620  940  600  60 1200 500

Baidu Indices
Baidu is the largest search engine portal in China with a daily search volume up to six billion. The Baidu index is an evaluation indicator launched by Baidu Corporation (www.baidu.com) to reflect the search popularity of keywords from the search engine. It analyzes and calculates the weighted sum of the search times of keywords of interest by the network users from the Baidu Portal. Many researchers in China used Baidu index data in their studies [28][29][30][31][32][33][34][35]40]. Based on the results of the literature survey, we present 37 keywords that are assumed to be associated with air pollution-related diseases [17][18][19]. We obtained their Baidu indices from https://index.baidu.com/, where the search area was set to "Beijing all cities", the time period was set to 1 March 1 2015-30 April 2017, and the type was the overall trend. As some of the keywords were not included in Baidu index, we ultimately obtained the Baidu indices for 25 keywords ( Table 2). The Baidu index only includes the keywords that have a relatively high search volume, and 12 keywords (the English counterparts of the 12 Chinese keywords not included were ischemic heart disease, hypertensive heart disease, shortness of breath, myocardial disease, pericardial disease, stridor, whistle, acute and chronic rheumatic heart disease, cerebral atherosclerotic infarction, low-birth-weight baby, pulmonary failure, and other types of heart disease) were not included in the Baidu index, indicating that they are rarely entered due to very little use. These keywords are either statistically insignificantly correlated with short-term air pollution, or they have more commonly used alternatives in the 25 keywords. In summary, not including the 12 keywords had little effect on our study and was statistically negligible.
We divided the 25 keywords into two categories: "respiratory system" and "cardio-and cerebrovascular", and then we added all the indices of each category together and obtained the respiratory total index and cardio-and cerebrovascular total index.

Correlation Analysis
Pearson correlation and coplot were used to measure the association between air quality data and the Baidu search indices. Pearson correlation is the most commonly used statistic to depict the linear relationship between two variables [41], while coplot is an exploratory graphical method to investigate the relationship between a pair of variables (Y1 and Y2) conditioned on a third variable (X) [42,43]. Here, it was used in the exploration of how the relationship between AQIs and Baidu index varied across meteorological factors.

Pearson Correlation
For two variables, say, X and Y, which have observations [x n ] and [y n ], respectively, the Pearson correlation coefficient is defined as where Cov(X, Y) represents the covariance of X,Y; µ X and µ Y represent the means of X and Y; E[·] expresses a mathematical expectation; and σ X and σ Y are the standard deviations of X and Y, respectively. In practical use, the formula for calculating the Pearson correlation coefficient using the observations is where x(y) and s x (s y ) are the mean and variance of [x n ]([y n ]), respectively; r xy is in the range of [−1, 1].

Coplot
A coplot is essentially a composition of multiple scatterplots of two variables Y1 and Y2 conditional on a third variable X. The observations of Y1 and Y2 are divided into multiple groups according to the value intervals of X and scatterplotted. The value intervals can overlap. There are identical numbers of observations for each scatterplot. Generally, the scatterplots are arranged in a matrix from left to right and from bottom to top, corresponding to the ordering of the value intervals, and there is an additional component that is called the "Given" panel, which shows the value intervals of X. Since each scatterplot has the same number of observation samples, the sampling errors are homogeneous for each scatterplot. Thus, conditional correlation analysis between two variables can be visualized clearly with a coplot.

Statistical Modeling
The impact of air pollution on health is complex (nonlinear) and there may be cross effects between different pollutants. Pollutant concentration is also closely associated with weather factors and time. Considering these situations, we selected a GAM approach to study the relationship between air pollution and the Baidu index. The GAM is a semiparametric expansion of the generalized linear model (GLM) [44], which assumes that the functions are additive and that the composition of the functions is smooth. The GAM can better analyze this relationship because it can use the nonparametric smooth spline functions to fit the curve flexibly [45][46][47][48]. The basic formula of GAM is as follows: where E(Y) is the expectation of the response variable Y, and G(•) is the link function, the selection of which depends on the probability distribution of the response variable. Gaussian distribution and Poisson distribution are the most commonly used link functions in real-world applications, while f i (x i ), i = 1, 2, 3, . . . , m represents the smooth functions of the m explanatory variables. More complex forms of GAM models also incorporate additional linear variables or dummy variables. Considering the adjustment of meteorological factors for the effects of air pollution on morbidity as done in some studies [3,5,26,48,49], we incorporated wind, temperature, and humidity into the exploratory variables. For time-series observations, it is common practice to extract long-term trends and changes in the cycle of working days. Because daily search counts typically follow a Poisson distribution, a GAM with log link and Poisson error, combined with the basic assumption that an increase in symptoms of the concerned diseases leads to a simultaneous increase in internet searches, is expected to reasonably associate air quality with smooth fluctuations in daily morbidity. This treatment is also consistent with several other time-series studies [50][51][52]. We specified the following GAM model formula: Y t~P oisson(λ t ) logλ t = Intercept + βAQI t + DOW + WIND + S(Time,k 1 ) + S(Temp,k 2 ) + S(Humi,k 3 ) where Y t denotes the individual or total Baidu index; AQI t represents the air quality index and β is the corresponding regression coefficient; S(•) represents the smoothing splines, while k 1 , k 2 , and k 3 are the degrees of freedom of smoothing splines; Time is the calendar time; DOW is the day of the week representing the dummy variable of Monday to Sunday; WIND is the daily mean wind speed level (also a dummy variable); Temp is the daily mean temperature; and Humi is the daily mean relative humidity.
We mainly optimized the model from two aspects: (1) identification of time lags, and (2) removal of the indices with weak correlations. On one hand, although we were examining the short-time effects of air pollutants, the effects may not appear simultaneously, instead showing a lag effect. We used the AQI to represent the total condition of air pollution, and took RTI and CTI as the response variables. We changed the delay of the time series of the two indices, and found time lags with the highest value of β. On the other hand, respiratory, and cardio-and cerebrovascular diseases are generally inextricably linked to air pollution, but each sub-index cannot be significantly associated with air pollution. We took 25 sub-indices as the response variables, and we fit each one to the exploratory variables; then, we eliminated those indices with low correlations (judged by β, R 2 , and deviance explained). As a result, we recalculated RTI and CTI.
In summary, we specified and fitted a GAM to obtain the estimated log-relative βs of AQI following the basic steps of GAM: (1) determining the explanatory variables, (2) determining the link function, (3) optimizing the model, and (4) evaluating the results.
We selected the open-source software R (x64 Ver3.4.0) to carry out the GAM analysis (mainly using the "mgcv" package) [53,54]. To facilitate the comparison with existing studies, the results were presented as the percent change in daily searches per 10 ug/m 3 increase in AQI (or IAQIs).

Relative Risk (RR)
In epidemiology, relative risk (RR) is expressed as the ratio of risk of the outcome in one group compared with another group. It is worth noting that the risk ratio is different from the odds ratio, even though the latter is often interpreted as if it were the risk ratio [55]. In this study, based on the exposure-response coefficient β obtained from the GAM model, we calculated the logarithm of relative risk change (LRR, the natural logarithm of the RR) when the pollutant concentration changed by one unit. The LRR was then used to quantitatively measure the risk. Furthermore, the inter-quartile range (IQR) of the pollutant concentration was defined as the unit concentration. According to the above definition, the calculation formula of RR was RR = exp(β × IQR); correspondingly, the 95% CIs of RR were calculated as exp((β ± 1.96 SE) × IQR) [56]. This implies that the percentage change in the Baidu index was (RR − 1) × 100% for an increase of one IQR unit in pollutant concentration. Therefore, when the pollutant concentration changed by 10%, the percentage of the change in Baidu index was ((10/IQR) × (RR − 1)) × 100%.

Health AQI
We assumed that there was a regressed RR for each pollutant according to the GAM. Referring to the work of References [57][58][59][60], the short-term total exposure risk of the day can be defined as where i = 1, . . . , 6 (the number of pollutants under consideration), RR i and IQR i represent the relative risk and inter-quartile range for pollutant I, respectively, and c i is the corresponding day-averaged concentration.
For convenience, we defined a pollutant sub-index (PSI) to reflect the contribution of individual pollutants to the overall risk.
where the subscript j refers to the j-th pollutant, c j refers to the corresponding day-averaged concentration, and a j is directly proportional to the incremental risk values (RR i − 1). We then defined a new AQI as HAQI = max(PSI j ).
This new AQI focuses on effects of air pollution on health; thus, we called it the health AQI (HAQI).

Data Exploration
The hourly meteorological observations and pollutant concentration records from 1 March 2015, 12:00 a.m. to 30 April 2017, 11:00 p.m. were obtained. The number of missing data points for hourly air quality indices and meteorological observations was 535 (accounting for 2.8%) and 2819 (accounting for 14.8%). Figure 2 shows the numbers of valid observations every day during the study period. It is shown that AQIs had at least six records on any day for all days, and meteorological observations had no records for a few days (four days). Figure 3 shows the daily series of air quality indices and meteorological factors. During the study period, there was a summation of 4,314,272 Baidu indices of the selected 25 keywords. The RTI was 2,781,456 and the CTI was 1,532,816. There were about 5447 Baidu indices per day on average during the 792 days. Approximately, the RTI accounted for 64.5%. Figure 4 shows the daily series of RTI and CTI. As seen, there were obvious outliers in the beginning of June 2016. After checking the original data, we found that the outliers were during 6-11 June 2016. Since we did not know what caused these outliers, we excluded these outliers. We linearly interpolated these hourly data for missing points on the same day, and then integrated them into daily series through averaging the observations and interpolated values of all days. If there were no observations in a day (12:00 a.m. to 11:00 p.m.), the day was marked as having no data.        PM2.5   Table 3 shows the summary statistics of the obtained daily AQIs and meteorological observations. We dropped the data of the four days and finally took the time series with 788 daily data points to build the GAM. The monthly total search index of 25 selected keywords, the monthly AQI and six pollutant concentrations, and the monthly meteorological observations (obtained through averaging the daily data) are plotted in Figure 5. Figure 5a shows that almost all the trends of AQIs were similar except for O3. The curves approximately indicate high values in the winter and low values in the summer (except for O3, which had the inverse change). Figure 5b shows that temperature, relative humidity, and wind level had trends with one-year cycles. Figure 5c shows the respiratory-related search indices and indicates that all curves could be clearly divided into two groups. Among them, "bronchitis", "asthma", "lung cancer", "pneumonia", "rheum", and "cough" had a higher search volume, and almost all of their curves had peaks in the winter and valleys in the summer except 15 Table 3 shows the summary statistics of the obtained daily AQIs and meteorological observations. We dropped the data of the four days and finally took the time series with 788 daily data points to build the GAM. The monthly total search index of 25 selected keywords, the monthly AQI and six pollutant concentrations, and the monthly meteorological observations (obtained through averaging the daily data) are plotted in Figure 5. Figure 5a shows that almost all the trends of AQIs were similar except for O 3 . The curves approximately indicate high values in the winter and low values in the summer (except for O 3 , which had the inverse change). Figure 5b shows that temperature, relative humidity, and wind level had trends with one-year cycles. Figure 5c shows the respiratory-related search indices and indicates that all curves could be clearly divided into two groups. Among them, "bronchitis", "asthma", "lung cancer", "pneumonia", "rheum", and "cough" had a higher search volume, and almost all of their curves had peaks in the winter and valleys in the summer except asthma, which had no clear cycle. Figure 5d shows the cardio-and cerebrovascular-related search indices. "Coronary" and "myocardial" had the highest search volume.
asthma, which had no clear cycle. Figure 5d shows the cardio-and cerebrovascular-related search indices. "Coronary" and "myocardial" had the highest search volume. From the Pearson correlation coefficients shown in Table 4, the air quality indices and RTI had relatively high values of r, and the corresponding p-values were less than 0.01, which indicates significant correlation. In contrast, a correlation between pollution indices and CTI was not obvious in terms of the values of r and p-values. In all air quality indices, O3, NO2, CO, and RTI had the highest correlations. It is worth noting that O3 had a significant negative correlation with RTI.   Table 4, the air quality indices and RTI had relatively high values of r, and the corresponding p-values were less than 0.01, which indicates significant correlation. In contrast, a correlation between pollution indices and CTI was not obvious in terms of the values of r and p-values. In all air quality indices, O 3 , NO 2 , CO, and RTI had the highest correlations. It is worth noting that O 3 had a significant negative correlation with RTI.  Figure 6 shows scatter plots of AQI and RTI conditional on several meteorological factors, showing which curves were fitted to indicate the trends more clearly. The individual panels should be viewed from left to right, and bottom to top. Taking Figure 6c as an example, the lower left is the AQI and RTI scatter plot corresponding to the wind level ranging from 0.5-2.5, while the lower right is that from 1.5-2.5, and the upper left is that from 1.5-5.5. Meteorological factors were segmented based on the same number of samples per segment. Figure 6a shows that, when there is a higher humidity (Humi), RTI increases with AQI increasing. Figure 6b shows that the lower the temperature (Temp) is, the greater the impact is of AQI on the RTI. Figure 6c shows that the smaller the wind level is, the greater the impact is of AQI on the RTI. Synthetically, Humi had the greatest impact on AQI and RTI. be viewed from left to right, and bottom to top. Taking Figure 6c as an example, the lower left is the AQI and RTI scatter plot corresponding to the wind level ranging from 0.5-2.5, while the lower right is that from 1.5-2.5, and the upper left is that from 1.5-5.5. Meteorological factors were segmented based on the same number of samples per segment. Figure 6a shows that, when there is a higher humidity (Humi), RTI increases with AQI increasing. Figure 6b shows that the lower the temperature (Temp) is, the greater the impact is of AQI on the RTI. Figure 6c shows that the smaller the wind level is, the greater the impact is of AQI on the RTI. Synthetically, Humi had the greatest impact on AQI and RTI. According to the coplot graphs, the same air quality index had a much higher correlation with RTI than with CTI given the same meteorological factors. The results were consistent with the results of the Pearson correlation coefficient. In addition, all groups generally showed similar trends: (1) when Humi increased, RTI increased faster with AQI increasing; (2) the lower the temperature was, the clearer the impact was of AQI on RTI; and (3) the smaller the wind level was, the greater the impact was of AQI on RTI. Among the three meteorological factors, the impact of Humi was the greatest.

Health Impact Evaluation
Because time itself has a high correlation with air quality, the time smoothing functions with a high degree of freedom are sensitive to the short-term air quality changes and may lead to overfitting. In order to explore the long-term trend of effects of time on Baidu indices, in this study, we confined the degree of freedom of time smoothing functions to 1-4, as commonly proposed in some studies [26,48,50,52]. Through graphically analyzing the time smoothing functions with those different degrees of freedom, we found that the time smoothing function with three degrees of freedom had the best consistency with the observation series. Furthermore, based on published literature [4,61], three degrees of freedom (whole period of study) for mean air temperature and mean relative humidity could control well for the meteorological effects on mortality and, thus, it was chosen to be used in our models. In summary, we finally took k1 = k2 = k3 = 3 in Equation (3) to obtain the estimated log-relative rate β. According to the coplot graphs, the same air quality index had a much higher correlation with RTI than with CTI given the same meteorological factors. The results were consistent with the results of the Pearson correlation coefficient. In addition, all groups generally showed similar trends: (1) when Humi increased, RTI increased faster with AQI increasing; (2) the lower the temperature was, the clearer the impact was of AQI on RTI; and (3) the smaller the wind level was, the greater the impact was of AQI on RTI. Among the three meteorological factors, the impact of Humi was the greatest.

Health Impact Evaluation
Because time itself has a high correlation with air quality, the time smoothing functions with a high degree of freedom are sensitive to the short-term air quality changes and may lead to overfitting. In order to explore the long-term trend of effects of time on Baidu indices, in this study, we confined the degree of freedom of time smoothing functions to 1-4, as commonly proposed in some studies [26,48,50,52]. Through graphically analyzing the time smoothing functions with those different degrees of freedom, we found that the time smoothing function with three degrees of freedom had the best consistency with the observation series. Furthermore, based on published literature [4,61], three degrees of freedom (whole period of study) for mean air temperature and mean relative humidity could control well for the meteorological effects on mortality and, thus, it was chosen to be used in our models. In summary, we finally took k 1 = k 2 = k 3 = 3 in Equation (3) to obtain the estimated log-relative rate β.
Many researchers found that the health effects of air pollution on respiratory and cardiovascular diseases have a hysteresis of 0-6-day lags [25,48]. Therefore, we delayed the time series of the RTI and CTI and then carried out GAM regression analysis. The results are shown in Table 5. We can see that the βs of AQI varied significantly with the different lag periods. The regression coefficients reached a maximum when the time lag was three days. This lag is also consistent with the results obtained by other studies through hospital cases [25,48]. Therefore, the GAM analysis was performed with a time lag of three days. Comparing the results of GAM regressing the RTI and CTI on the AQI, we can see that the R 2 and the explained level of the RTI were much higher than those of the CTI, indicating that the effects of air pollution on respiratory diseases are more significant than those on cardio-and cerebrovascular diseases. Air pollution, in contrast, can only statistically explain a lesser part of the change in cardioand cerebrovascular incidences.
We also regressed the sub-indices with GAM and the results are shown in Table 6. We observed the regression curves of the AQI, meteorological factors, and time, and compared the contributions between them, so as to determine whether to retain the index. Eventually, 13 indices were retained, including seven respiratory indices, and six cardio-and cerebrovascular indices. Specifically, when β was larger than that of the total index, meaning that the influence of the sub-index was significant, the sub-index was retained. When β was small, and the R 2 and explained deviance was also small, it meant that, although the impact of the sub-index was smaller, it was more significant compared with meteorological factors and time and, thus, the sub-index should also be retained. The sub-index was removed when β was small, and the R 2 and explained deviance were less than 0.1, indicating that AQI, meteorological factors, and time were not significant. The sub-index was removed when β was small, and R 2 and explained deviance were high, indicating that meteorological factors and time were more significant than the sub-index. Table 6. Results with GAM considering a three-day lag, with AQI (and IAQI) as explanatory variables and Baidu indices (BIs) as response variables. The indices marked by "-" indicate insignificance and were removed; the number of asterisks indicates how well the index is explained by AQI (IAQI).

RR of Air Pollutants
Taking RTI and CTI as the response variables, and the concentrations of the six pollutants as explanatory variables, and considering meteorological factors and time, we carried out a GAM Poisson regression analysis with three-day lag. The p-values of the explanatory variables for each model were all less than 0.001, indicating significant contributions of these pollutants to total indices. The results are shown in Table 7, indicating that the RR values of the six pollutants for cardio-and cerebrovascular diseases were higher than those for respiratory diseases. However, the differences were not great; NO 2 and O 3 had the highest RR values, followed by PM 10 and PM 2.5 . Table 7. Exposure-response assessment of the six pollutants. While the unit of CO concentration is mg/m 3 , the units of the other pollutant concentrations are µg/m 3 . The fifth column shows relative risks when pollutant concentrations change by one unit of IQR. The sixth column shows the increases of relative risks of the health outcome per 10 µg/m 3 (per 1 mg/m 3 for CO) increase in pollutant concentrations.

Performance of HAQI
Next, we compared HAQI with AQI in terms of their capability of indicating health effects. We took PM 2.5 as a benchmark and selected the closest limit of 500 µg/m 3 (the maximum daily concentration of PM 2.5 during the study period was 478 µg/m 3 ); then, IAQI = PSI = 500, RR Total = (500/IQR PM2.5 ) × (RR PM2.5 − 1) + 1 = 1.223. Thus, we further established the relationship between RR Total and IAQI and PSI. If pollutants have the same RR total values at different concentrations, they have the same PSI values. The pollutant concentrations corresponding to different PSI values could be calculated through linearly interpolating RR total . The concentration values of the six pollutants corresponding to PSI and RR Total were calculated and they are shown in Table 8. It is worth noting that the breakpoints of PSI and RR Total were somehow a little arbitrary. We calculated pollutant concentrations at evenly spaced breakpoints for PSI and RR Total . Since the concentrations were calculated based on relative risk of individual pollutants, they should be different from the currently used AQI that reflects the comprehensive effects of air pollution on environment, ecology, and buildings, as opposed to health. Table 8. The calculated concentrations for the six pollutants at evenly spaced breakpoints of pollutant sub-index (PSI) ranging from 0 to 500, given that PSI = 500 corresponds to the PM 2.5 concentration of 500. a j was calculated according to Equation (7), which was used to calculate daily health AQI (HAQI) with Equation (8) To evaluate the effects of the currently used AQI and the HAQI in expressing health outcome, we plotted the three curves of AQI, HAQI, and the daily exposure risk RR Total , as shown in Figure 7, showing that the overall changes of the three curves were similar, but there were obvious differences in local places. For quantitative comparison, the Pearson correlations between AQI, HAQI, and RR were calculated. The correlation coefficient between AQI and RR was 0.86, with a p-value < 0.001. The correlation coefficient between HAQI and RR was 0.95, with a p-value <0.001. In this sense, the HAQI was a little better for representing the short-term risk of air pollution. To evaluate the effects of the currently used AQI and the HAQI in expressing health outcome, we plotted the three curves of AQI, HAQI, and the daily exposure risk RRTotal, as shown in Figure 7, showing that the overall changes of the three curves were similar, but there were obvious differences in local places. For quantitative comparison, the Pearson correlations between AQI, HAQI, and RR were calculated. The correlation coefficient between AQI and RR was 0.86, with a p-value <0.001. The correlation coefficient between HAQI and RR was 0.95, with a p-value <0.001. In this sense, the HAQI was a little better for representing the short-term risk of air pollution.

Discussion
In this paper, we focused on the evaluation of the short-term effect of air pollution on human health. An assumption was made that the internet searches of keywords of air pollution-related

Discussion
In this paper, we focused on the evaluation of the short-term effect of air pollution on human health. An assumption was made that the internet searches of keywords of air pollution-related diseases was positively correlated with some symptoms of diseases including respiratory, and cardioand cerebrovascular ones. The search data were used as indicators of the disease outcome instead of hospital outpatient data. These search data are supposed to avoid some disadvantages from hospital outpatient data such as high difficulty in data acquisition, small sampling range, and uneven sampling.
A GAM was employed to model the association between internet search and air pollution. Specifically, a log form of response variable with Poisson distribution was used, and we improved the model mainly in two aspects: (1) the time lag was explored and incorporated into the regression form, and (2) some non-significant indices were eliminated according to several evaluation indices (r, R 2 , p-value, and deviance explained). Through the analysis, we obtained the RRs of the six air pollutants for respiratory, and cardio-and cerebrovascular diseases. The results show that the risk of O 3 and NO 2 for all the concerned diseases in this study was higher than other pollutants. The risk of a certain pollutant was higher for cardio-and cerebrovascular diseases than for respiratory diseases. Furthermore, we proposed a RR-based health air quality index (HAQI), which is intended to provide an indicator for assessing the impact of air pollutants on human health. The comparison between the currently used Chinese AQI and the HAQI was made, showing HAQI to be a little better than AQI.
Pearson correlation coefficients between RTI and every air quality index except O 3 showed significant positive correlations. O 3 had a significant negative correlation with RTI, which can be easily confirmed from the curves of their daily series in Figures 2 and 3, since the curves of RTI and O 3 had approximately opposite changes. However, we found that the RR of O 3 for RTI was still greater than 1 when we examined the RRs of individual air pollutants. Pearson correlation only calculates r between two variables regardless of the effect of other variables; as a result, it cannot accurately depict a multi-variate relationship. In fact, when we regressed RTI on O 3 and took meteorological factors as control variables, the adverse effect of O 3 on health was presented.
The Baidu index only represents samples from the population since it is calculated according to the overall searches from regional netizens. It may have a bias because of some exceptional searching behaviors. It also cannot differentiate according to different age groups and sexes and, therefore, it is difficult to establish a direct link between the health loss and the index. We were, therefore, unable to evaluate the health loss in detail, which is useful for health risk assessment. On the other hand, our study required the keywords used for estimating search volume of the concerned diseases: respiratory and cerebro-and cardiovascular diseases were reasonably selected. We assured this as much as possible by duly and carefully picking the most frequently used Chinese words in depicting symptoms of these diseases through a literature survey. Nonetheless, our study was a statistical analysis of several sets of time series (air pollution, weather, and morbidity represented by internet search frequency); the qualitative and quantitative relationships between them were only data-driven and cannot be pronouncedly confirmed as causal evidences. Some studies reported little or no acute effects of air pollution on cerebrovascular diseases, whereas others showed that the acute effects of air pollution caused myocardial infarction, ischemic stroke, ischemic heart disease, and cardio-and cerebrovascular diseases, resulting in an increase in emergencies, outpatient intake, and death [62]. The reasons for the differences in these findings may be (1) cardio-and cerebrovascular diseases are a large class of diseases and air pollution may be associated with the morbidity of certain subtypes but not with another subclass of diseases, and (2) when these diseases are summed up to the major categories of cardio-and cerebrovascular disease, it may lead to no statistically significant result.
The RR values of respiratory, and cardio-and cerebrovascular diseases associated with NO 2 , SO 2 , and O 3 in our study are both larger than those reported on health impact assessment in the WHO European region [63]. However, the RR values associated with PM 2.5 and PM 10 are smaller than those reported by the WHO (PM 2.5 and PM 10 ) and several studies of air pollution in Europe [3,64] (PM 10 ).
We also compared the results with three studies for cities in China [65][66][67]. The first investigated the associations between ambient air pollution and adult respiratory mortality in 32 major Chinese cities, and the RR values of respiratory diseases associated with PM 10 were greater than ours. The second and third studies performed a nationwide analysis of associations between PM 2.5 and SO 2 concentrations and daily cause-specific mortality in 272 Chinese cities respectively, indicating that the RR values of cardiovascular and respiratory diseases associated with SO 2 and PM 2.5 were also greater than ours. A relative overall survey of effects of air pollution on health in Chinese populations was made by Reference [68], which also showed RR values of respiratory and cardiovascular diseases associated with NO 2 and PM 10 higher than ours, while those associated with O 3 and SO 2 were lower than ours. The RRs in those studies were derived from medical records of mortality or morbidity, whereas our study was based on an internet search of keywords representing disease symptoms, which may be quite different from each other. Next, because of the different physical characteristics of each person, the threshold of response to physical abnormalities is also different, along with their internet habits. Furthermore, people living in different regional environments (e.g., high-pollution environments like some Chinese cities) may have different response characteristics. As a result, a simple comparison between the results of this study and those of other studies is unreasonable. The benefit of our study is its incorporation of those who feel physical discomfort and mild symptoms, but choose not to seek medical advice. Our analysis is not subject to the specific criteria for morbidity or death and, thus, it was intended to provide a broader assessment of the risk of air pollution.
In our next study, we plan to obtain a more complete table of air pollution-related keywords through network opinion analysis and big data mining or to investigate the search intention in case of air pollution through a questionnaire survey. These data will be incorporated into the analysis process (e.g., by weighting the keywords). While it may be true that there are some disadvantages, hospital data do directly measure health, unlike the Baidu index. Internet search data and outpatient data are two important data sources for studying the impact of air pollutants on diseases and we should make comparisons between them and take their advantages together to get insight into the relationship between environment and diseases. For example, we can evaluate the validity of using the Baidu index to indicate the presence of disease through exploring the linear correlation between Baidu index and contemporaneous outpatient data. This will also be included in our future studies on inter-validation of the Baidu index and outpatient data as predictive of actual health outcomes.
Author Contributions: S.Z. principally conceived the idea for the study and wrote the initial draft of the manuscript. Z.Y. was responsible for downloading and preprocessing all the data, and setting up experiments. W.Z. was responsible for project administration and provided financial support. S.Z. and Z.Y. were responsible for revising and improving of the manuscript according to reviewers' comments.