Development of Multiple Linear Regression for Particulate Matter (PM10) Forecasting during Episodic Transboundary Haze Event in Malaysia

Malaysia has been facing transboundary haze events every year in which the air contains particulate matter, particularly PM10, which affects human health and the environment. Therefore, it is crucial to develop a PM10 forecasting model for early information and warning alerts to the responsible parties in order for them to mitigate and plan precautionary measures during such events. Therefore, this study aimed to develop and compare the best-fitted model for PM10 prediction from the first hour until the next three hours during transboundary haze events. The air pollution data acquired from the Malaysian Department of Environment spanned from the years 2005 until 2014 (excluding years 2007–2009), which included particulate matter (PM10), ozone (O3), nitrogen oxide (NO), nitrogen dioxide (NO), carbon monoxide (CO), sulfur dioxide (SO2), wind speed (WS), ambient temperature (T), and relative humidity (RH) on an hourly basis. Three different stepwise Multiple Linear Regression (MLR) models for predicting the PM10 concentration were then developed based on three different prediction hours, namely t+1, t+2, and t+3. The PM10, t+1 model was the best MLR model to predict PM10 during transboundary haze events compared to PM10,.t+2 and PM10,t+3 models, having the lowest percentage of total error (28%) and the highest accuracy of 46%. A better prediction and explanation of PM10 concentration will help the authorities in getting early information for preserving the air quality, especially during transboundary haze episodes.


Introduction
Haze pollution has become an international issue as it affects the local, regional, and global air quality, especially in Southeast Asian (SEA) countries, including Malaysia, Singapore, Indonesia, Brunei Darussalam, and Southern Thailand [1][2][3]. Haze occurs annually due to periodic biomass burning, such as forest fires, peatland combustion, and agricultural burning in Sumatra and Kalimantan, Indonesia. The influence of unpredictable monsoon rain and El-Nino phenomenon both worsens the transboundary haze due to the prolonged and drier weather conditions, especially during the southwest monsoon season [3][4][5][6][7][8]. These conditions increase the severity of air quality due to it containing a noxious mix of air pollutants, such as particulate matter and noxious gases, which are collectively influenced by meteorological factors like ambient temperature, relative humidity, and wind speed [9,10]. Malaysia has been facing this awful phenomenon for almost over a decade, which started in the 1980s up until recent years [5,11,12]. Haze is detected when the atmosphere contains a higher concentration of particulate matter suspended in the air, which either can or cannot be observed by the naked eye when the relative humidity is lower than 80% and the visibility is reduced to 10 km [5,[13][14][15]. Daily particulate matter with 10 micro aerodynamic diameters (PM 10 ) exceeding 150 µg/m 3 for 24 h of permissible limit is thus notified as haze, which is set under the new Malaysian Ambient Air Quality Standard (NAAQS) by the Department of Environment, Malaysia.
Biomass peat soil combustion generally emits the particulate matter and gaseous pollutants, as well as their components such as anions, cations, heavy metals, and levoglucosan [16][17][18][19]. Carbon monoxide (CO) is higher during haze events compared to other gases due to the characteristics of peatland areas, which are rich in carbon materials. It is emitted through the incomplete combustion process of carbonic materials in the presence of a minimum amount of oxygen. Meanwhile, PM 10 contains a high concentration of NO 3 − anion produced through bacterial nitrification and oxidation process during combustion, while SO 4 2− cation is also found in a high amount due to pesticide application in the peat soil, thereby slowing down the accumulation of sulfur [18]. Those ions originate from sulfurous and nitrogenous materials as they will be converted to such particles via gas-to-particle conversion. Similarly, heavy metals from the peatland soil are converted to particulate matter by the combustion process in which the characteristics of heavy metals will be contained in the ash suspended in the air [17,18]. The isomeric anhydrous sugar levoglucosan is found in the fine particulate matter during transboundary haze as it is known as a specific and general indicator of biomass emission composition directly emitted into the atmosphere. It is yielded by some anthropogenic activities such as wood burning and biogenic activities, as well as the by-product of secondary photochemical oxidation of organic precursors [16]. Every time the haze phenomenon hits the country, it will inhibit the economic and social developments; the country will need to bear higher medical costs due to drastically increased outpatient and inpatient numbers. This was estimated to be approximately MYR273,000 (USD91,000) based on the data from 2005 until 2009 for 14 haze-related illnesses suffered by the public [5,20]. Moreover, when the haze event worsens and it is declared unsafe, hundreds of schools in the affected areas are closed, communication becomes limited, and people will need to reduce outdoor activities in order to minimize their exposure to it [21,22]. Therefore, the tourism sector is also affected during these events in terms of the economy, human perception, and the environment. Similarly, huge losses occur when the atmospheric visibility decreases; many flights are cancelled due to unsafe conditions to fly, which decreases the number of tourism activities. Furthermore, the number of extinct plants, animals, and biodiversity will increase due to extreme environmental events, while the circumstances can also change tourism moods due to unsatisfactory environmental conditions and lead to a refusal to revisit [5,10,21,22].
Toxic particles in haze harm human health either in cumulative or acute effects like respiratory mortality, cardiovascular illness, and lung cancer to all groups of people, such as adults and children [9,20]. The small particulate size contained in haze can easily penetrate the lung; at worst, it also goes into the bloodstream [22]. Based on the previous study by [4], long-term exposure to air pollution causes an increase in neurological diseases such as cerebrovascular disease and stroke, along with headache and migraine symptoms. Skin is one of the organs that are directly in contact with air pollution, especially during transboundary haze events. The small particulate matter enters through the skin via the pores, causing allergic symptoms and skin damage, such as skin irritation Atmosphere 2020, 11, 289 3 of 14 and inflammation, damaged skin barrier function, increased rate of ageing, and influences on the composition of skin lipid [23][24][25]. In the worst-case scenario, long-term exposure to air pollution causes an increase in the degree of skin sensitivity and causes skin defect, as well as atopic dermatitis [23].
During transboundary haze events, particulate matter, especially the harmful PM 10 pollutant, needs to be predicted for determining and understanding its dispersion behavior in the atmosphere. This can provide information to the concerned people and their awareness to reduce outdoor activities at the affected areas [26]. The Multiple Linear Regression (MLR) model is globally and widely used over many years as a method for air pollution forecasting, which can help to attempt the uncertainty of the future simply by relying on past and current data for decision-making [27][28][29]. The fundamental basis of this model represents the relationship between the dependent variable and several independent variables, such as meteorological factors and gaseous pollutants, by uncomplicated computation and easy implementation [30,31]. Previous studies in Malaysia have developed the MLR model to predict PM 10 concentration, specifically in the East Coast Peninsular Malaysia, based on different site classifications of rural, suburban, urban, and industrial areas and during different types of monsoon to determine its variation during non-haze periods [27,32,33]. Moreover, [34] used the MLR model for PM 10 forecasting during haze events that hit Malaysia based on the data from the affected Air Quality Monitoring Station (AQMS). Therefore, this study aims to determine the annual variation of PM 10 concentration during transboundary haze events and develop the models to investigate the relationship linking PM 10 with meteorological factors and gaseous pollutants, which are contributing to the high PM 10 concentration. PM 10 is used in this study as the dependent parameter for the prediction purposes as the Malaysian Department of Environment had started to monitor the PM 2.5 concentration in 2018. However, the episodic haze events need a long period of dataset, which is the main reason to use PM 10 in this study. These research findings will help the responsible parties in providing early warning information to come out alongside mitigation and precautionary plans to improve the air quality during haze events, as well as protect the public health.

Study Area and Data Collection
Malaysia faces a periodic transboundary haze event due to the large-scale biomass combustion in Sumatra, Indonesia, which happens almost throughout the year and affects the Peninsular Malaysia, Sabah, and Sarawak regions [6]. The haze event usually occurs during the months of January to February and June to August every year [22]. Figure 1 and Table 1  in Microsoft Excel Spreadsheet ® 2016 and analyzed using Statistical Packages for Social Sciences (SPSS ® ) version 25. The parameters used in this study were particulate matter with an aerodynamic diameter less than 10 µm (PM 10 , µg/m 3 ), ambient temperature (T, • C), relative humidity (RH, %), wind speed (WS, km/hr), ground-level ozone (O 3 , ppm), nitrogen oxide (NO, ppm), nitrogen dioxide (NO 2 , ppm), carbon monoxide (CO, ppm), and sulphur dioxide (SO 2 , ppm) to gain a better picture of PM 10 variability during transboundary haze. The relationship between these parameters was examined using bivariate correlation analysis in which the PM 10 concentration was measured using the β-ray attenuation mass monitor (BAM-1020) as manufactured by Met One Instruments Inc. [27]. Meanwhile, the meteorological parameters were measured using Met One 062 sensor for ambient temperature, Met One 083D sensor for relative humidity, and Met One 010C sensor for wind speed [35]. The Teledyne API Model 400/400E instrument hourly through UV absorption (Beer-Lambert) method with a 0.4ppb detection limit and 0.5% of precision level and Model 200A NO/NO 2 /NO x analyzer, which applies chemiluminescence detection principles was used to detect O 3 , NO 2 , and NO concentrations in ambient air, respectively [36,37]. The SO 2 and CO concentrations were monitored and measured by the Teledyne API Model 100A/100E and Teledyne API Model 300/300E, with the lowest detection level of SO 4 at 0.04 ppb by the UV fluorescence method. Meanwhile, the CO level was measured with 0.5% precision and the lowest detection was found at 0.04 ppm by the non-dispersive and infrared absorption (Beer-Lambert) method [37]. Therefore, all instruments were set to have daily calibration using zero air and standard gas concentration to guarantee the quality control and assurance of data monitoring and validation processes, which were revised before they were transferred to the DOE [38]. Missing data might occur due to equipment malfunction and the calibration program, which were thus removed to reduce the bias in the prediction and conservative results [39].
Atmosphere 2020, 11, x FOR PEER REVIEW 4 of 13 concentrations were monitored and measured by the Teledyne API Model 100A/100E and Teledyne API Model 300/300E, with the lowest detection level of SO4 at 0.04 ppb by the UV fluorescence method. Meanwhile, the CO level was measured with 0.5% precision and the lowest detection was found at 0.04 ppm by the non-dispersive and infrared absorption (Beer-Lambert) method [37]. Therefore, all instruments were set to have daily calibration using zero air and standard gas concentration to guarantee the quality control and assurance of data monitoring and validation processes, which were revised before they were transferred to the DOE [38]. Missing data might occur due to equipment malfunction and the calibration program, which were thus removed to reduce the bias in the prediction and conservative results [39].

Multiple Linear Regression
Stepwise Multiple Linear Regression (MLR) statistical models were developed in this study with a 95% confidence interval and the dataset was separated according to 7:3 ratios for model development and validation, respectively [40,41]. Based on the data tabulated in the Microsoft Excel Spreadsheet ® Atmosphere 2020, 11, 289 5 of 14 2016, 70% of the earlier data in this spreadsheet were used for model development, while the remaining 30% were used for model validation. This statistical model can relate the single dependent variable with two or more independent variables and the equation of the MLR model is stated in Equation (1) [27].
where, b i is the regression coefficient (independent variables), ε is the stochastic error associated with the regression.
In the air pollution modelling, the dataset consisted of different types of units and it required normalizing before the development of models in order to improve the accuracy of the numeric computation, which was scaled within the range of 0 to 1. Hence, these normalization steps can interpret all relationships in the data precisely and reduce bias [27,31]. Equation (2) shows the normalization equation used in this study.
where x = (x 1 , . . . , x n ) and z i is normalised data. Variance Inflation Factor (VIF) is the multicollinearity assumption together with regression output, whereby the average of VIF values must be below ten to indicate no multicollinearity problem between the independent variables [27,42]. The VIF equation is shown in Equation (3).
where, VIF i is the variance inflation factor with ith predictor, R i 2 is the determination in a regression of the ith predictor on all other predictors. The Durbin-Watson (DW) test was used to determine the autocorrelation ability of PM 10 concentration during the current hour to predict PM 10 in the subsequent hour. The range values of the test must be between 0 and 4 with a value of 2, showing that the residuals are uncorrelated [27,42]. The DW equation is shown per Equation (4).
where, n is the number of observations, ei = yi − yi; (yi = observed values and yi is the predicted values).
Coefficient of Determination (R 2 ) was one of the indicators used to determine whether the data provided enough evidence to indicate that the overall models contributed sufficient information for the prediction of PM 10 concentrations. It also acts as an indicator to measure the extent to which the prediction models fit the data [27,42]. The R 2 equation is shown per Equation (5).
where, n = total number measurements at a particular site, P i = predicted values, O i = observed values, P = mean of predicted values, O = mean of observed values, S pred = standard deviation of predicted values, and S obs = standard deviation of the observed values.

Performance Indicator
The models were evaluated based on the model's error and accuracy by using several performance indicators, namely Root Mean Square Error (RMSE), Normalized Absolute Error (NAE), and Prediction of Accuracy (PA). The best-fitted model is chosen when it has high accuracy in which the PA is closer Atmosphere 2020, 11, 289 6 of 14 to 1 while the minimal error (i.e., RMSE and NAE) is close to 0 [32,35,40]. Equations (6)-(8) show the performance indicators' formula used in this study.
(a) Root Mean Square Error (c) Prediction of Accuracy where, n = total number of data; P i = predicted values; O i = observed values; P = mean of predicted values; S pred = standard deviation of predicted values; S obs = standard deviation of observed values

Results and Discussion
The maximum daily average of PM 10 concentration during a transboundary haze event in Malaysia in the years of 2014 and 2013 was 995 µg/m 3 , while the minimum value of PM 10 concentration was 150 µg/m 3 in all years as this was the minimum concentration for notification as haze [5,40,43]. Table 2 [33,44]. Combustion activities due to the agriculture sector such as biomass and peatland combustion in Indonesia for land clearing thus creates haze pollution that has further affected the neighboring countries, especially Malaysia. Altogether, the higher concentration of hazardous PM 10 are transported thousands of kilometers by the wind and monsoon [13,17]. The transboundary haze events frequently happen during the dry season in July to October and are extended to southwest monsoon in February to March, which prolongs the combustion activities due to less amount of rainfall and drier condition of the land [5,13].   The Spearman bivariate analysis was applied as the data were not normally distributed and non-parametric in order to see the relationship linking PM 10 with meteorological factors and other gaseous pollutants during the periodic haze events in Malaysia as tabulated in Table 3. The CO (r = 0.512, p < 0.01) showed a strong and positive correlation by increasing the concentration of PM 10 Atmosphere 2020, 11, 289 7 of 14 at the atmosphere. Meanwhile, T (r = 0.055, p < 0.05), SO 2 (r = 0.131, p < 0.01), NO 2 (r = 0.059, p < 0.01), and O 3 (r = 0.046, p < 0.05) were weakly and positively correlated to PM 10 concentration. Similarly, RH was weakly and negatively correlated with r = −0.076, p < 0.01. Open burning from land clearing activities at Sumatra (Indonesia) has been contributing to a high concentration of carbon dioxide in the atmosphere during such periodic haze events [3]. The incomplete combustion of carbon thus generates CO 2 concentration via some chemical reactions with another atmospheric constituent [3,5]. Hence, high emission of CO also causes a high emission of particulate matter and toxic organic air contaminants, which have an adverse impact on human health [45,46]. A statistical model to precisely forecast the PM 10 concentrations during haze events using MLR was founded using 70% of the dataset, while the remaining 30% were used for model validation. The MLR models were developed based on three different prediction hours, which were to predict the first hour of PM 10 concentration up until the next three hours of PM 10 concentration. A summary of the models is tabulated in Table 4. All models were developed from the PM 10 concentration during the transboundary haze event occurring in Malaysia. Based on the models, the first prediction hour of PM 10 , t+1 had a higher coefficient of determination (R 2 ) at 0.638 compared to the second (PM 10 , t+2 ) and third (PM 10 , t+3 ) prediction hour at 0.452 and 0.353, respectively. Furthermore, the VIF values for all three models (i.e., range between 1.006 and 2.149) were lower than 10, which indicated no multicollinearity problem present between the independent variables. Moreover, the DW test showed that the range values for all models were 2.125, 1.201, and 0.932 for PM 10,t+1 , PM1 0, t+2 , and PM 10,t+3 , respectively. Hence, it indicated that all of the models did not have any first-order autocorrelation problem as the range values were still within 0-4 [27,42]. Thus, the PM 10,t+1 forecasting model, which was considered as the best-fit model in this study, revealed acceptable values of R 2 as discovered in the previous study of PM 10 forecasting models by [27,40,47] and higher R 2 values compared to the other two models. For the first prediction hour of model forecasting, the significant predictors for PM 10 , t+1 was CO and NO. The predicted PM 10 , t+1 concentration increased by 0.673 unit when the PM 10 variables increased by one unit, an increase of 0.230 unit by one unit of CO, and a decrease of 0.057 unit by Atmosphere 2020, 11, 289 8 of 14 one unit of NO. Meanwhile, this was also applicable for PM 10 , t+2 : the same variable influenced it with PM 10 , t+1 in which one unit of PM 10 increased one unit of PM 10 , t+2 . There was an increase of 0.141 and decrease of 0.053 PM 10,t+2 by one unit of CO and NO, respectively. Moreover, PM 10 , t+3 was influenced by three different variables, namely PM 10 , RH, and O 3 . The increase in unit for PM 10 and one unit decrease of RH and O 3 led to the increase of 0.568 unit and decrease of 0.047 and 0.046 units of PM 10 , t+3 , respectively. Generally, the biomass combustion activities emit several compounds such as aerosols in the form of organic carbon, black carbon, and inorganic carbon; greenhouse gases such as carbon dioxide (CO 2 ); and photochemical reactive gases like CO and nitrogen oxide (NO x ) [48]. The O concentration represents the dominant species after carbon dioxide and it is used as a tracer of biomass burning plumes in the remote troposphere in the form of smoke particles [19,48].
The normal distribution of residuals is illustrated with zero mean and constant variances for all three models in Figure 2a-c. Meanwhile, the fitted values plot with PM 10 prediction model's residuals show that the residuals are uncorrelated due to the residuals accumulating around the horizontal band, further indicating that the variance is constant and uncorrelated as depicted in Figure 2d-f. Unfortunately, more residuals dispersed away from the horizontal band as the prediction hour increased to the second to third hour of PM 10 prediction time. Hence, this condition would hinder the best fit of the models, which reduced the precision of PM 10 concentration forecasting. For the first prediction hour of model forecasting, the significant predictors for PM10,t+1 was CO and NO. The predicted PM10,t+1 concentration increased by 0.673 unit when the PM10 variables increased by one unit, an increase of 0.230 unit by one unit of CO, and a decrease of 0.057 unit by one unit of NO. Meanwhile, this was also applicable for PM10,t+2: the same variable influenced it with PM10,t+1 in which one unit of PM10 increased one unit of PM10,t+2. There was an increase of 0.141 and decrease of 0.053 PM10,t+2 by one unit of CO and NO, respectively. Moreover, PM10,t+3 was influenced by three different variables, namely PM10, RH, and O3. The increase in unit for PM10 and one unit decrease of RH and O3 led to the increase of 0.568 unit and decrease of 0.047 and 0.046 units of PM10,t+3, respectively. Generally, the biomass combustion activities emit several compounds such as aerosols in the form of organic carbon, black carbon, and inorganic carbon; greenhouse gases such as carbon dioxide (CO2); and photochemical reactive gases like CO and nitrogen oxide (NOx) [48]. The O concentration represents the dominant species after carbon dioxide and it is used as a tracer of biomass burning plumes in the remote troposphere in the form of smoke particles [19,48].
The normal distribution of residuals is illustrated with zero mean and constant variances for all three models in Figure 2a  The predicted daily PM10 concentration was plotted against the observed PM10 concentration by prediction hour based on the 30% of haze chronology event data. Figure 3a−c illustrates the goodnessof-fit of the PM10 forecasting models in Malaysia. The regression lines drawn showing 95% confidence interval and most of the points were accumulated in between lines A and B, which were the upper and lower ranges of 95% confidence interval for the regression model. The accuracy of the predicted model for the next hour (PM10,t+1) was thus proven by having the highest R 2 values of 0.447 compared to models PM10,t+2 (0.186) and PM10,t+3 (0.129). Hence, the increase of prediction hours of the models reduced their respective performance by increasing some of the errors of the prediction models, rendering it less precise [29,49]. The predicted daily PM 10 concentration was plotted against the observed PM 10 concentration by prediction hour based on the 30% of haze chronology event data. Figure 3a-c illustrates the goodness-of-fit of the PM 10 forecasting models in Malaysia. The regression lines drawn showing 95% confidence interval and most of the points were accumulated in between lines A and B, which were the upper and lower ranges of 95% confidence interval for the regression model. The accuracy of the predicted model for the next hour (PM 10,t+1 ) was thus proven by having the highest R 2 values of 0.447 compared to models PM 10,t+2 (0.186) and PM 10,t+3 (0.129). Hence, the increase of prediction hours of the models reduced their respective performance by increasing some of the errors of the prediction models, rendering it less precise [29,49]. The predicted daily PM10 concentration was plotted against the observed PM10 concentration by prediction hour based on the 30% of haze chronology event data. Figure 3a−c illustrates the goodnessof-fit of the PM10 forecasting models in Malaysia. The regression lines drawn showing 95% confidence interval and most of the points were accumulated in between lines A and B, which were the upper and lower ranges of 95% confidence interval for the regression model. The accuracy of the predicted model for the next hour (PM10,t+1) was thus proven by having the highest R 2 values of 0.447 compared to models PM10,t+2 (0.186) and PM10,t+3 (0.129). Hence, the increase of prediction hours of the models reduced their respective performance by increasing some of the errors of the prediction models, rendering it less precise [29,49]. The RMSE, NAE, and PA components were chosen in this study to measure the performance of the models. RMSE and NAE are considered as suitable indexes for determining model accuracy in which the model is noted as having a high accuracy when their values are close to zero, while the PA value is nearest to 1 [32,50,51]. According to a previous study [32], a comparison of the best statistical PM10 forecasting methods with the lowest values of RMSE and NAE and the highest PA value has been conducted to select the best-fit prediction model.   The RMSE, NAE, and PA components were chosen in this study to measure the performance of the models. RMSE and NAE are considered as suitable indexes for determining model accuracy in which the model is noted as having a high accuracy when their values are close to zero, while the PA value is nearest to 1 [32,50,51]. According to a previous study [32], a comparison of the best statistical PM 10 forecasting methods with the lowest values of RMSE and NAE and the highest PA value has been conducted to select the best-fit prediction model. Table 5 depicts the performance indicator values for all PM 10 forecasting models in this study. It showed that the PM 10,t+1 model yielded the lowest value of RMSE (126.728) and NAE (0.325), while it had the highest PA value (0.668) compared to the PM 10,t+2 , and PM 10,t+3 models. Based on the results from this performance error and accuracy measures, the PM 10,t+1 model provided better PM 10 concentration forecasting capacity compared to model PM 10,t+2 , and model PM 10,t+3 in Malaysia during transboundary haze events.

Conclusions
The maximum PM 10  Next, the Spearman correlation analysis showed that CO was strongly and positively correlated with r = 0.512, p < 0.01. This was due to its higher emission through large-scale biomass combustion from neighboring countries. Furthermore, the best-fitted MLR model for PM 10 prediction during transboundary haze events was PM 10, t+1 , which had a higher R 2 of 0.447 compared to PM 10,t+2 (0.186) and PM 10,t+3 (0.129). On the other hand, PM 10,t+1 was the best-fitted model with a higher accuracy of 0.668 and lower values of RMSE (126.728) and NAE (0.325) compared to models PM 10,t+2 and PM 10,t+3 . The development of this model may thus aid the responsible parties in obtaining early information that can improve or initiate strategies in mitigating for better air quality.