The Influence of South East Asia Forest Fires on Ambient Particulate Matter Concentrations in Singapore: An Ecological Study Using Random Forest and Vector Autoregressive Models

Haze, due to biomass burning, is a recurring problem in Southeast Asia (SEA). Exposure to atmospheric particulate matter (PM) remains an important public health concern. In this paper, we examined the long-term seasonality of PM2.5 and PM10 in Singapore. To study the association between forest fires in SEA and air quality in Singapore, we built two machine learning models, including the random forest (RF) model and the vector autoregressive (VAR) model, using a benchmark air quality dataset containing daily PM2.5 and PM10 from 2009 to 2018. Furthermore, we incorporated weather parameters as independent variables. We observed two annual peaks, one in the middle of the year and one at the end of the year for both PM2.5 and PM10. Singapore was more affected by fires from Kalimantan compared to fires from other SEA countries. VAR models performed better than RF with Mean Absolute Percentage Error (MAPE) values being 0.8% and 6.1% lower for PM2.5 and PM10, respectively. The situation in Singapore can be reasonably anticipated with predictive models that incorporate information on forest fires and weather variations. Public communication of anticipated air quality at the national level benefits those at higher risk of experiencing poorer health due to poorer air quality.


Introduction
Biomass burning is the burning of living and dead vegetation, and it can occur naturally or due to human activities. [1][2][3]. Haze, generated by biomass burning, causes air pollution that affects local air quality as well as the air quality of distant places. Haze can have detrimental impacts on human health [4][5][6][7][8], climate, biodiversity, tourism and agricultural production [9] as well as degrade visibility [10].
In recent decades, biomass burning has become a recurring phenomenon in mainland Southeast Asia (SEA) and the islands of Sumatra and Borneo [10][11][12][13][14]. The majority of biomass burning in Southeast Asia occur due to human initiated activities such as land clearing for oil palm plantations, other causes of deforestation, poor peatland management, and burning of agriculture waste [15,16]. Haze can be felt even in downwind locations such as Singapore [17,18].
Several studies have shown that meteorological conditions have significant influence on the formation of haze [19][20][21][22][23][24]. In 2012, Reid et al. [25] investigated relationships between fire hotspot appearance and various weather phenomena as well as climate variabilities in different timescales and found that the influence of these factors on fire events varies over different parts of the Maritime Continent. Haze was also shown to be worse in El Niňo years [26]. In addition, a study in Singapore demonstrated that haze fluctuates according to localities and seasons and is also influenced by factors such as weather parameters and the extent of burning in the neighboring regions [10].
Studies have also shown that forest fires in one area can affect air quality in surrounding countries. For example, a study on the 1997 Indonesia forest fires reported aerosols being transmitted from Kalimantan to other countries in SEA, including Singapore [27]. Reports have also shown the differences in air quality within a country. For instance, Singapore reported that the concentration of particulate matters in haze measured across the different regions in Singapore varied, according to seasonality as well as relative contribution from various source regions [10]. The significant variations in haze concentration across a small city like Singapore stresses the importance of a need for spatial and temporal modelling. A haze forecast system was established by the Met Office (MO) and the Meteorological Service Singapore (MSS) to predict haze in Singapore [28]. The modelling system developed could accurately reproduce the haze conditions observed in the Maritime Continent and in mainland Southeast Asia in 2013 and 2014. However, to the best of our knowledge, there is no long term study on the seasonality of air quality in Singapore, and no predictive modelling that provides daily air quality predictions. A daily prediction of air quality will be useful for nationwide planning for community activities.
Researchers have used several machine learning techniques to predict air quality. A novel spatiotemporal deep learning based air quality prediction method was proposed by researchers in Beijing, and the study showed that the proposed method outperformed models using the artificial neural network, regression moving average, and support vector regression techniques [29]. Another study explored three methods: (i) laboratory univariate linear regression, (ii) empirical multiple linear regression, and (iii) machine-learning-based calibration models using random forests (RF) and concluded that combining RF models with carefully controlled state-of-the-art multipollutant sensor packages improves the performances of prediction models of air quality sensors [29]. Another study, focusing on forecasting urban air pollution, showed that using different features in multivariate modelling with the M5P algorithm yields the best forecasting performances [30].
In this present study, we examined the association between forest fires in SEA and air quality in Singapore using different statistical models. Daily air quality forecasts will help the community to be better prepared for outdoor activities, and is especially useful for vulnerable individuals.

Study Setting
We conducted our study in Singapore (1 • 17 N 103 • 50 E) (Figure 1), a city state with a land area of 724.2 square kilometer, and a population density of 7804 people per square kilometer, one of the highest population densities in the world [31]. Singapore experiences a tropical climate with abundant rainfall, high and uniform temperatures and high humidity all year round [32].

Climate Data
Daily mean temperature (in degrees Celsius), minimum temperature (in degrees Celsius), maximum temperature (in degrees Celsius), relative humidity (in percentage), mean wind speed (meters per speed), minimum wind speed (in meters per speed), maximum wind speed (meters per speed), wind direction (0 to 360 degrees) and total rainfall (in millimeter) from 2009 to 2018 recorded in Changi weather station in Singapore is obtained from MSS. MSS maintains a comprehensive network of specialized meteorological observing systems. It undertakes weather observation practices in accordance with international standards, and manages the long-term archive and quality control of national climate data [33].

Air Quality Data
Biomass burning contributes mainly to two pollutants; particulate matter 2.5 (PM 2.5 ) which are particles in the air that are 2.5 micrometers or less in aerodynamic diameter, and particulate matter 10 (PM 10 ), which are particles in the air that are 10 micrometers or less in aerodynamic diameter. These two pollutants are chosen for this study. The 24-h average of PM 2.5 and PM 10 for Singapore is recorded daily from 2009 to 2018 ( Figure 2). The units for both PM 2.5 and PM 10 are microgram per cubic meter. Air quality readings are obtained from the USEPA AQI (United States Environmental Protection Agency Air Quality Index) system, which has been supported as an appropriate measurement by the Advisory Committee [34,35].
10 (PM10), which are particles in the air that are 10 micrometers or less in aerodynamic diameter. These two pollutants are chosen for this study. The 24-h average of PM2.5 and PM10 for Singapore is recorded daily from 2009 to 2018 ( Figure 2). The units for both PM2.5 and PM10 are microgram per cubic meter. Air quality readings are obtained from the USEPA AQI (United States Environmental Protection Agency Air Quality Index) system, which has been supported as an appropriate measurement by the Advisory Committee [34,35].

Statistical Analyses
The outcome variables for this study are PM2.5 and PM10. The independent variables are (i) mean temperature, (ii) minimum temperature, (iii) maximum temperature, (iv) relative humidity (v) mean wind speed, (vi) minimum wind speed, (vii) maximum wind speed, (viii) wind direction, (ix) total rainfall, (x) counts of hotspots in Kalimantan, (xi) counts of hotspots in Sumatra, (xii) counts of hotspots in Sabah and Sarawak and (xiii) counts of hotspots in Peninsular Malaysia. Each independent variable has 31 variations, with lags from 0 days to 31 days (Supplementary Table 1).

Statistical Analyses
The outcome variables for this study are PM 2.5 and PM 10 . The independent variables are (i) mean temperature, (ii) minimum temperature, (iii) maximum temperature, (iv) relative humidity (v) mean wind speed, (vi) minimum wind speed, (vii) maximum wind speed, (viii) wind direction, (ix) total rainfall, (x) counts of hotspots in Kalimantan, (xi) counts of hotspots in Sumatra, (xii) counts of hotspots in Sabah and Sarawak and (xiii) counts of hotspots in Peninsular Malaysia. Each independent variable has 31 variations, with lags from 0 days to 31 days (Supplementary  Table S1). Correlation tests are carried out using the "corrr" package in the R statistical language [37] to determine the association between the outcome variables and each of the independent variables. We evaluated the trend and seasonality of the daily values of PM 2.5 and PM 10 in separate time series models using the "ts" and "decompose" package implemented in the R statistical language [37]. The Kwiatkowski-Phillips-Schmidt-Shin (KPSS) was used to test if the time series was stationary. KPSS test for both PM 2.5 and PM 10 showed they were both stationary over time (p-value < 0.05). Therefore, the subsequent models used for prediction in this study are appropriate.

Model Parameters and Evaluation
Several models such as backward stepwise multivariate regression model, Holtwinter's Time Series model, Seasonal Autoregressive Integrated Moving Average model, RF and VAR models were explored for the analyses. We chose RF and VAR model for the following reasons. The RF model was chosen due to the ease of interpreting results; predictors that affect the outcomes most can be easily interpreted based on the importance calculation. Comparing the different time series models, VAR stands out as we can incorporate multiple independent variables into the model, which is relevant for our dataset. All the independent variables were also stationary hence the model was also appropriate for the analyses. Separate statistical models using RF and VAR techniques were built for both PM 2.5 and PM 10 . The independent variables that were incorporated into the models can be found in Supplementary Table S1. All dataset (2009-2018) were randomly split into training (70%) dataset and testing (30%) dataset to evaluate the accuracy of the models. The accuracy of the models was tested by calculating the mean absolute percentage error (MAPE) for each model using the following equation, where n is the total number of fitted points: All data and statistical analyses were performed using R software version 3.6.1 [37]. Statistical significance was assessed at the 5% level. All results, where indicated, are computed for 95% confidence intervals (CI).

RF Model
RF is an ensemble machine learning method that uses an ensemble of decision trees [38]. In RF, several bootstrap samples are drawn from the training set data, and an unpruned decision tree is fitted to each bootstrap sample. At each node of the decision tree, variable selection is carried out on a small random subset of the predictor variables. The best split on these predictors is used to split the node.
To find the best split for the model, we plotted the Out of Bag Error estimates and the error calculated on the test set [39]. We chose the split that gives the lowest error. We also calculated the percentage mean squared error (MSE) for each independent variable to determine the importance of each variable. MSE is calculated by the following equation: Percentage MSE is computed by calculating the percent increase in MSE of the RF model when the data for each variable were randomly permuted. For each tree, the MSE on test is recorded. Then the same is done after permuting each predictor variable. The difference between the MSE on test and the MSE of the new model, from permuting each predictor variable, are then averaged over all trees, and normalized by the standard deviation of the differences. If the standard deviation of the differences is equal to 0 for a variable, the division is not done. The higher the difference is, the more important the variable. We categorized the top-ranked variables with a MSE of >10%. The predicted response is obtained by averaging the predictions of all trees. RF analyses are performed using the "Random Forest" package implemented in the R statistical language [37].

VAR Model
The VAR model extends the idea of univariate autoregression to multi time series regressions, where the lagged values of all series appear as regressors. The model can be seen as a linear prediction model that predicts the current value of a variable based on its own past value on the previous point in time and the past values of the other variables [40]. For example, the VAR model of two variables X t and Y t (k = 2) with the lag order p is defined as The βs and γs can be estimated using ordinary least squared on each equation [41]. Analyses are carried out under the assumption of normality of the data. The function "VARselect" is first used to select the maximum lag which has the lowest Akaike information criterion (AIC). The AIC is an estimator of out-of-sample prediction error and it estimates the quality of each model, relative to each of the other models. VAR analyses are conducted using the "vars" package implemented in the R statistical language [37].

Association between PM 2.5 and PM 10 with Climate and Hotspots Variables
The independent variables had weak correlation with PM 2.5 and PM 10 ; however, we noticed that for both PM 2.5 and PM 10 counts of hotspots in Kalimantan with lags between 1 to 18 days had an average correlation coefficient of 0.2, and p-value < 0.05. The correlation coefficients and corresponding p-values between the outcome variables (PM 2.5 and PM 10 ) and the climate and hotspot variables are listed in Supplementary Table S2.

Time-Series Analyses of Daily 24-h Average of PM 2.5 and PM 10
There are seasonal fluctuations in both PM 2.5 and PM 10 over the study period. We observed two annual peaks, one in the middle of the year and one at the end of the year for both PM 2.5 (y = −2 × 10 −9 x 4 + 3 × 10 −6 x 3 − 0.0013x 2 + 0.2445x − 12.161) and and PM 10 (y = −2 × 10 −9 x 4 + 3 × 10 −6 x 3 − 0.0013x 2 + 0.2316x − 11.364). There was no discerning trend, but we noticed two episodes of extremely poor air quality in mid-2013 and mid-2015, and these appeared to be outliers. Figure 4 shows the breakdown of the seasonality of PM 2.5 and PM 10 .

RF Model
The RF models are built using 500 trees, and the number of variable splits that gives the lowest error for model PM 2.5 and model PM 10 are 193 and 89, respectively. Among the independent variables, relative humidity with lags of 0, 1 and 2 days are top-ranked for PM 2.5 and PM 10 . In addition, counts of hotspots in Kalimantan with lags of 8 and 11 days are top-ranked for PM 2.5 , whilst counts of hotspots in Kalimantan with lags of 1, 8 and 9 days are top-ranked for PM 10 . The MSEs calculated for the rest of the variables are listed in Supplementary Table S3. Figure 5 shows graphical comparison of the predicted and actual values for PM 2.5 and PM 10 .

RF Model
The RF models are built using 500 trees, and the number of variable splits that gives the lowest error for model PM2.5 and model PM10 are 193 and 89, respectively. Among the independent variables, relative humidity with lags of 0, 1 and 2 days are top-ranked for PM2.5 and PM10. In addition, counts of hotspots in Kalimantan with lags of 8 and 11 days are top-ranked for PM2.5, whilst counts of hotspots in Kalimantan with lags of 1, 8 and 9 days are top-ranked for PM10. The MSEs calculated for the rest of the variables are listed in Supplementary Table 3. Figure 5 shows graphical comparison of the predicted and actual values for PM2.5 and PM10.

VAR Model
To get the lowest AIC, the VAR model for PM2.5 and PM10 was built using maximum lags of 8 and 9 respectively. The variables used in the models PM2.5 and PM10 are listed in Supplementary Table  4. Tables 1 and 2 summarize the coefficients of the variables that were significant (p < 0.05) for PM2.5 and PM10, respectively.

VAR Model
To get the lowest AIC, the VAR model for PM 2.5 and PM 10 was built using maximum lags of 8 and 9 respectively. The variables used in the models PM 2.5 and PM 10 are listed in Supplementary Table S4.  Tables 1 and 2 summarize the coefficients of the variables that were significant (p < 0.05) for PM 2.5 and PM 10 , respectively. Table 1. Coefficients for variables associated with PM 2.5 that are significant (p < 0.05) using vector autoregressive (VAR) model.

Variables Estimate (CI)
Mean temp with 2 days lag     Table 3 shows the MAPE values for each of the four models. From Table 3, we can see that VAR models have lower MAPE performance compared to that of the RF models for both PM2.5 and PM10 experiments.

Discussion
In this study, we sought to examine the association between forest fires and air quality in Singapore. We found a positive association between ambient air particulate concentrations in Singapore and counts of instances of forest fires. This association was observed with a 1 to 8 days' lag depending on the location of the forest fires. Our study findings were consistent with other  Table 3 shows the MAPE values for each of the four models. From Table 3, we can see that VAR models have lower MAPE performance compared to that of the RF models for both PM 2.5 and PM 10 experiments.

Discussion
In this study, we sought to examine the association between forest fires and air quality in Singapore. We found a positive association between ambient air particulate concentrations in Singapore and counts of instances of forest fires. This association was observed with a 1 to 8 days' lag depending on the location of the forest fires. Our study findings were consistent with other studies. Significant build-up of aerosol and black carbon concentrations was observed in the Tibetan plateau due to the occurrence of fires and transport of pollution from the nearby regions of Southeast Asia and the northern part of the Indian Peninsula [42]. Similarly, forest fires in Serbia resulted in air pollution through Mongolia, eastern China, down to the Korean peninsula [43]. This finding is not unexpected. Past research has shown that forest fire emissions were the largest contributors to the air pollution problem in regions tens of kilometers away from the fire source [44]. Our RF model picked up counts of hotspots in Kalimantan up to 11 days' lag as significant variable that affects PM 2.5 and PM 10 concentration in the air. A similar study on the 1997 Indonesia forest fires corroborates our results that Singapore was more affected by fires from Kalimantan compared to fires from other countries, due to the shifting of the monsoons [45]. Although Malaysia and Sumatra are closer to Singapore in terms of distance than Kalimantan [46], the models show that climatic factors are important in influencing the impact of forest fires on air quality.
Seasonality shows that the peaks in poor air quality in Singapore occurs twice, once in the middle of the year, and one at the end of the year. This finding corresponds with other studies that show that high values of PM 2.5 and PM 10 are reported in the middle of the year, which corresponds to the burning season [42]. Similarly, it is also reported that the burning season in SEA peaks from July to October [47]. High amounts of PM 2.5 and PM 10 not only aggravate health issues, but they also degrade visibility. Hence, these results can be used to guide tourism as well as large scale community programs.

•
Based on our RF model's importance plot, relative humidity is another significant variable that affects PM 2.5 and PM 10 concentration in the air. Other studies have also concluded that relative humidity is a key factor in influencing the distribution of air quality [48,49].

•
In contrast, the VAR models picked up mean temperature lagging PM 2.5 and PM 10 by one and two days having significant negative effect on the concentration of PM 2.5 and PM 10 in the air. The effect of mean temperature on air quality has, however, been inconsistent, with several studies showing conflicting findings. Some studies have observed a negative correlation between mean temperature and concentrations of PM 2.5 and PM 10 [50,51]. However, there are other studies that have shown that there is a combined effect of climatic factors on the concentration of PM 2.5 and PM 10 . For example, a study in Nagasaki, Japan concluded that temperature is positively correlated with PM 2.5 and PM 10 during monsoons and negatively correlated during other seasons [52].
Another study in Dhaka also showed variable response of relative humidity with air pollutants according to seasonal variation [53]. Hence, machine learning methods are relevant for the predictions of air quality, due to the mixed effects of climatic factors.

•
Comparing RF and VAR models, the VAR models performs slightly better with MAPE values being 0.8% and 6.1% lower for PM 2.5 and PM 10 , respectively. Hence, the VAR model can be reliably used for future predictions of the concentration of PM 2.5 and PM 10 in urban atmosphere in Singapore. To improve the communication of predictions to the community, we can categorize the predicted values according to the Table 4 [54]. Singapore uses this category to show the levels of pollutants in the air. It will be useful to release a daily prediction of PM 2.5 and PM 10 for community preparedness. There are several study limitations. The fire detection algorithm used to identify forest fires hotspots is based on higher emissions of mid-infrared radiation. The fire detection algorithm compares the values of suspected fire pixels against a set of absolute thresholds, and with values of surrounding pixels. We note that hotspot detected does not always correspond to actual land fires. Other than climatic factors, there are other factors that can affect the air quality in Singapore. The models did not account for other anthropogenic sources of PM such as those from industry and shipping. Data on these factors should be collected and included into the models, to see if they can improve the predictions. In addition, currently, the dataset for independent variables were collected from Changi Meteorological Station, which is the eastern meteorological station in Singapore. Daily news reports on pollutants have shown that different parts of Singapore can be affected by the biomass burning at different intensities [55]. It will be useful to provide predictions for the five areas in Singapore (north, south, east, west and central). In order to achieve this, we need to collect climate data in different meteorological stations around the island which is spatially representative, and also obtain the measurements from the hotspots to the stations as one of the variables. The models can be further developed for better spatial resolution. Lastly, analyses were done using average values for a daily prediction. It might be more useful to the community to predict the air quality on an hourly basis. Hence, moving forward we could collect hourly data and run the models.

Conclusions
There was a positive association between ambient air particulate concentrations in Singapore and counts of instances of forest fires, and Singapore was more affected by fires from Kalimantan compared to fires from other SEA countries. In addition, the peaks in poor air quality in Singapore occurs twice, once in the middle of the year, and one at the end of the year. VAR models performed better than RF model in predicting air quality. Our study findings suggest that air quality in Singapore can be reasonably anticipated with predictive models that incorporate information on forest fires and weather variations. The public communication of anticipated air quality at the national level benefit who are at higher risk of experiencing poorer health due to poorer air quality.
Supplementary Materials: The following are available online at http://www.mdpi.com/1660-4601/17/24/9345/s1, Table S1: List of independent variables, Table S2: The correlations coefficients and corresponding p-values between the outcome variables (PM 2.5 and PM 10 ) and the climate and hotspot variables, Table S3: The percentage Mean Squared Errors and the corresponding increase in node purity of the variables, Table S4: The variables used in the models PM 2.5 (Excel Sheet labelled PM 2.5 ) and PM 10  Funding: The study was funded by NEA, Singapore. The funding sources of this study had no role in the study design, data collection, data analysis, data interpretation, writing of the report, or in the decision to submit the paper for publication.