4.1. Ordinary Least Squares (OLS) Regression Analysis
Ordinary Least Squares (OLS) regression Analysis has been widely applied by various researchers in the study of air pollution [
28,
29,
30,
31,
32], whether to study the relationship between the pollutants and the environments or to acquire the estimates for the pollutants.
OLS regression was employed for analyzing the collected data due to its straightforward and effective approach. This statistical technique is used to assess the relationship between a dependent variable and multiple predictor variables. The results of the OLS regression are presented as coefficients, which serve as estimates of the impact each independent variable has on the dependent variable. By utilizing OLS regression, the analysis quantifies the extent to which each predictor variable contributes to variations in the dependent variable, providing a clearer understanding of the underlying relationships within the dataset [
33].
where
y = The predicted value of the dependent variable.
b0 = The y-intercept, representing the value of y when all independent variables are zero.
b1 × 1 = The regression coefficient (B1) of the first independent variable (X1), indicating the impact of X1 on the predicted y value.
bnXn = The regression coefficient of the last independent variable, showing its influence on y.
ε = The model error, representing the unexplained variation in the predicted y value.
In this study, particulate matter (PM) serves as the dependent variable, while temperature, wind, and PCE count act as the independent variables. The B values represent the regression coefficients for temperature, wind, and PCE count, indicating their respective influences on PM levels. Predictor variables were not standardized prior to analysis. This approach was chosen to maintain the physical interpretability of the OLS regression coefficients (representing the change in PM concentration per unit change in the predictor) and because the primary comparison model, Random Forest, is invariant to monotonic feature scaling.
All pairwise correlations fall well below established thresholds associated with multicollinearity (|r| ≥ 0.7), with the highest coefficient observed between wind speed and traffic volume (r = 0.24), reflecting only a weak association (
Table 6). Similarly, temperature and humidity demonstrated only a modest negative correlation (r = −0.20). Collectively, these findings confirm that the degree of shared variance among predictors is minimal and does not compromise the robustness or interpretability of the OLS regression estimates.
Certain statistical measures, including the coefficient of determination (R2) and the Root Mean Square Error (RMSE), are employed to assess the performance of the model. The corresponding formulas are presented below:
R
2 represents the overall proportion of variance in the dependent variable that is explained by the combination of all predictor variables [
33].
where
;
;
.
The Root Mean Squared Error (RMSE) represents the standard deviation of residuals.
where
;
;
;
A series of multiple linear regressions were performed, where PM1, PM2.5, and PM10 were the dependent variables, and temperature, wind speed, humidity, and traffic volume were the independent variables. It is important to note that these models were run without an intercept. While an intercept typically represents background concentrations in source apportionment, its interpretation becomes ambiguous when predictors like Temperature and Humidity are included, as ‘zero’ values for these parameters (0 °C, 0%) are physically impossible in the study context (the minimum observed temperature was 14.4 °C). A calculated intercept would, therefore, represent a mathematical extrapolation to non-existent atmospheric conditions rather than a tangible background level. By suppressing the intercept, we focus the regression strictly on the covariance between the observed predictors and PM levels. This approach was deemed appropriate given that the OLS models serve primarily as a baseline to demonstrate the limitations of linear assumptions compared to the non-linear Random Forest model.
The estimated weights of the independent variables are calculated, and the t statistics are included in parenthesis in Equations (1)–(3).
The results indicate that temperature and humidity have a strong influence on PM levels, showing high significance across all models. Wind speed is significant for PM1 and moderately significant for PM2.5 and PM10. Traffic volume is statistically significant for PM2.5 and PM10, but its effect on PM1 is weaker.
The t-statistics and p-values provide critical insights into the significance and strength of these relationships. Higher t-values denote stronger relationships, while negative t-values reflect inverse correlations, emphasizing the importance of considering both magnitude and significance in environmental modeling.
In the PM1 model, temperature (t = 10.58, p = 0.000) and humidity (t = 6.82, p = 0.000) exhibit highly significant positive effects, while wind speed (t = −3.46, p = 0.001) significantly reduces PM1 concentrations. Traffic volume (t = 1.90, p = 0.058) shows a weaker association due to its marginal p-value.
In the PM2.5 model, temperature (t = 7.28, p = 0.000), humidity (t = 8.08, p = 0.000), and traffic volume (t = 2.77, p = 0.006) are significant contributors, while wind speed (t = −1.81, p = 0.071) suggests a potential inverse relationship despite not meeting the 0.05 significance level.
For PM10, temperature (t = 6.08, p = 0.000), humidity (t = 9.23, p = 0.000), and traffic volume (t = 3.45, p = 0.001) are significant predictors, whereas wind speed (t = −1.72, p = 0.087) indicates a possible negative effect. Lower p-values (< 0.05) provide strong evidence against the null hypothesis, confirming the significance of these predictors on PM levels.
4.2. Random Forest Regression Analysis
Random Forest is a supervised machine learning algorithm used for classification and regression. It builds multiple decision trees from random subsets of data and features, then combines their results to improve accuracy and reduce overfitting. Based on the bagging (Bootstrap Aggregating) technique, it trains each tree independently on bootstrapped samples. The final prediction is an average (for regression) or majority vote (for classification) of all trees. The flow diagram of the Random Forest algorithm is provided in
Figure 5. Random Forest does not assume linear relationships, making it effective for complex, non-linear datasets. Its key steps include bootstrap sampling, random feature selection, tree construction, and aggregation of predictions to produce a robust ensemble model. The key Random Forest parameters employed in this study are presented in
Table 7.
Random Forest has proven highly effective in environmental and emissions modeling, particularly for air quality prediction, pollutant dispersion, and traffic emission estimation. It surpasses traditional linear models by capturing complex, non-linear interactions among environmental variables. The model’s performance depends on several key parameters, including the number and depth of trees, the number of features per split, the minimum number of samples for nodes, bootstrap sampling, out-of-bag validation, and random states, all optimized to balance complexity, accuracy, and computational efficiency.
Its application is demonstrated in several studies. Random Forest has been applied to estimate high-resolution PM
2.5 levels in the North China Plain with remarkable accuracy, even for historical data [
34]. Similarly, it has been utilized to predict daily PM
2.5 concentrations in urban areas at a 1 × 1 km resolution, outperforming traditional approaches [
35]. In a related study, Random Forest was compared with land use regression models for elemental PM components and found superior accuracy in capturing pollutant–land use relationships [
36]. These studies highlight the algorithm’s robustness and adaptability in air quality modeling.
Recent research has also employed other advanced machine learning methods—particularly XGBoost, Artificial Neural Networks (ANNs), and Long Short-Term Memory (LSTM) networks—to improve air-quality forecasting. A study from Malaysia in 2023 demonstrated that multi-layer feedforward neural networks can achieve highly accurate PM
2.5 predictions, with the Levenberg–Marquardt–trained FBNN model attaining an R
2 of 0.98 and the lowest associated error metrics among the evaluated algorithms [
37]. A two-stage feature engineering framework employed in a study in UK, combining correlation-inspired feature construction with Variational Mode Decomposition, substantially enhanced the performance of an LSTM-based forecasting model for multiple pollutants (NO
2, O
3, SO
2, PM
2.5, and PM
10), yielding a 13% improvement in R
2 and the lowest RMSE and MAE values among the tested configurations [
38].
Further demonstrating the power of these techniques, a spatially local XGBoost (SL-XGB) framework integrating high-resolution SARA AOD with locally optimized machine-learning models achieved markedly improved urban-scale PM
2.5 estimation in Beijing (R
2 ≈ 0.88) relative to standard XGBoost and GWR, demonstrating enhanced capacity to capture both non-linear relationships and spatial heterogeneity in areas with sparse monitoring coverage [
39]. A recent study from the Middle East demonstrated that Multilayer Perceptron (MLP) neural networks outperform multiple linear regression in predicting seasonal and intra-annual PM
10 and PM
2.5 concentrations using meteorological variables and AOD, achieving correlations up to 0.81 and highlighting the strong seasonal dependence and dominant influence of relative humidity on particulate-matter levels [
40]. A recent study for Shanghai developed an enhanced XGBoost-based forecasting framework that integrates empirical mode decomposition, model fusion, and spatial optimization techniques, achieving a 17% improvement in goodness of fit and a 28% reduction in RMSE for PM
2.5 prediction, while also revealing strong seasonal patterns and clear urban–rural gradients in particulate-matter concentrations [
41].
The partial dependence plots derived from Random Forest meteorological-normalization models provide interpretable insights into the physical and chemical drivers of PM
10 variability, revealing distinct regimes associated with poor dispersion and secondary aerosol formation that help explain long-term particulate-matter trends in Switzerland [
42]. Furthermore, partial dependence analyses were conducted to observe the possibility of non-linear relationships between the predictor variables and particulate matter concentrations.
Figure 6 presents the PDPs for PM
1 across the meteorological and traffic variables, and similar response patterns were observed for PM
2.5 and PM
10.
Temperature exhibited a distinctly non-linear pattern across all PM fractions (PM1, PM2.5, and PM10), with concentrations decreasing around 22–25 °C and subsequently stabilizing or increasing at higher temperatures, likely reflecting the combined influence of atmospheric mixing and evaporation. Humidity showed a characteristic U-shaped response, where low humidity levels were associated with reduced PM concentrations due to enhanced atmospheric dispersion, while very high humidity corresponded to elevated PM levels, consistent with hygroscopic particle growth. Wind speed demonstrated a consistently negative non-linear effect on all PM metrics, indicating that higher wind speeds promote pollutant dilution and dispersion. Traffic volume displayed a saturating non-linear pattern.
PM levels increased with traffic load up to approximately 2000 veh/hr, beyond which concentrations plateaued or slightly decreased, suggesting that emissions dominate under moderate traffic, whereas intensified turbulence under very high traffic density enhances dispersion. These findings underscore the ability of the Random Forest model to capture complex, non-linear atmospheric and emission-driven dynamics that are not adequately represented by linear regression approaches.
Figure 7 shows the R
2 values for Random Forest models predicting PM
1, PM
2.5, and PM
10 as the number of trees increases from 50 to 500.
PM10 exhibits the highest R2 values across all tree counts, showing a consistent upward trend and reaching its peak (~0.93) at 500 trees. PM2.5 starts at a relatively high R2 (~0.93) with minor fluctuations but generally improves as more trees are added, also peaking at around 0.93. PM1, which starts with the lowest R2 (~0.917) at 50 trees, displays a steady improvement, reaching approximately 0.926 at 500 trees. The most significant improvement for PM1 was observed between 50 and 150 trees before the values stabilized.
Overall, increasing the number of trees enhances the model’s R2 values for all PM metrics, though the improvements diminish beyond 200–300 trees. Notably, PM10 predictions benefit the most from a higher number of trees, while PM1 shows the most noticeable improvement with increasing tree counts. The results suggest that using around 300–400 trees strikes a balance between computational cost and model accuracy, as the R2 gains become marginal beyond this range.
It is important to note that the R
2 values shown in
Figure 7 were calculated on the training dataset to illustrate model performance as the number of trees increased.
Using a manually configured high-capacity Random Forest regressor (800 trees, unrestricted depth, and a square-root feature selection strategy), the model achieved high coefficients of determination across all pollutants (R2 = 0.93–0.94) when evaluated on the training dataset. These results indicate that the model can explain over 92% of the variance in PM1, PM2.5, and PM10 concentrations. The low RMSE values (6–7 µg/m3) further reflect a strong in-sample fit. This high-capacity configuration was chosen deliberately to approximate previously observed high training R2 values (~0.93), allowing the model to fully exploit nonlinearities in the data without any hyperparameter tuning for the cross-validation stage.
However, these performance values are inherently optimistic, as the model was evaluated on the same data used for training. The hyperparameter choices—particularly the large number of trees and absence of depth constraints—enabled the model to effectively memorize the training set. Therefore, these R2 values represent the upper bound of model performance rather than a realistic measure of predictive accuracy on unseen data.
Overall, the results show that Random Forest regression provided higher R
2 values compared to OLS, indicating better performance in capturing the variability in PM levels. This method is suitable when non-linear interactions and complex dependencies exist in the dataset. The RMSE values confirm that the Random Forest model has significantly lower errors compared to OLS, further demonstrating its superior accuracy and performance. The analysis also includes side-by-side plots (
Figure 8) comparing estimated PM levels from OLS and Random Forest models to actual PM levels for each PM type, clearly demonstrating the improved accuracy of Random Forest predictions over OLS (see
Table 8).
To obtain an unbiased estimate of the model’s out-of-sample predictive performance, a 5-fold cross-validation strategy was employed for each particulate matter (PM) metric. The entire dataset was randomly partitioned into five mutually exclusive and approximately equal-sized folds using a randomized K-fold procedure (KFold with five splits, shuffling enabled, and a fixed random seed for reproducibility).
For each iteration, the model was trained on four folds (80% of the data) and evaluated on the remaining held-out fold (20%), ensuring no overlap between training and validation samples. Predictions were generated for each held-out fold, and the coefficient of determination (R
2) and root mean square error (RMSE) were computed. This procedure was repeated five times so that each observation was used exactly once for validation. The resulting R
2 and RMSE values were aggregated across the folds, and both the mean values and fold-specific distributions were reported. This procedure yielded fold-wise R
2 and RMSE values, whose distributions were summarized using boxplots (
Figure 9) to assess predictive stability and variance across folds. To obtain unbiased estimates of generalization performance, a 5-fold cross-validation procedure was later employed, as reported in
Table 9.
This approach provides a more reliable evaluation than a single train–test split, as it mitigates optimistic bias associated with training-set evaluation and reduces variance arising from arbitrary data partitioning. Cross-validation is particularly appropriate for environmental datasets such as PM measurements, where the number of observations may be limited and temporal or meteorological fluctuations can affect model generalization. By applying the same high-capacity Random Forest configuration across multiple folds, the reported R2 values reflect the model’s genuine ability to generalize to unseen subsets of the data distribution rather than its capacity to memorize the training data.
Table 9 presents the results of the Random Forest (RF) model under 5-fold cross-validation and 100% training evaluation, while
Table 8 provides a comparative summary of the RF and Ordinary Least Squares (OLS) regression models. While hyperparameter tuning (e.g., limiting tree depth) could reduce this gap, we deliberately retained the high-capacity configuration to explicitly demonstrate the contrast between model fitting potential and true generalization power, thereby highlighting the necessity of cross-validation in environmental modeling.
Together, these results highlight both the superior in-sample fitting ability of RF compared to OLS and the discrepancy between training and cross-validated performance for the RF model.
In
Table 8, the RF model achieves high R
2 values on the training data for all PM metrics (0.93 for PM
1, 0.93 for PM
2.5, and 0.94 for PM
10) with low RMSE values (6.02–7.03 µg/m
3), indicating an excellent fit to the training dataset. However, the corresponding cross-validated R
2 values are substantially lower (0.47 for PM
1, 0.45 for PM
2.5, and 0.51 for PM
10), and the RMSE values increase to 16.44–19.51 µg/m
3. This gap between training and cross-validation performance reflects the overfitting tendency of the high-capacity RF configuration: while the model is able to explain over 93% of the variance in the training data, its ability to generalize to unseen data is more modest, with R
2 values in the 0.45–0.51 range.
The implication for model generalizability is that the reported training metrics (R2 > 0.9) represent an upper theoretical bound of explanatory power, effectively capturing local noise and specific traffic-meteorology interactions. In contrast, the cross-validated metrics (R2 ≈ 0.5) serve as the realistic indicator of operational predictive performance on unseen data. Consequently, while the unconstrained Random Forest model proves superior to OLS in detecting non-linear signals, any practical deployment for future forecasting would strictly require the cross-validated performance estimates to be used as the baseline for accuracy expectations.
When compared with OLS results, the Random Forest clearly outperforms OLS in terms of in-sample predictive accuracy. OLS exhibits low R2 values (0.16 for PM1, 0.11 for PM2.5, and 0.13 for PM10) and high RMSE values (20.61–25.68 µg/m3), indicating a weak ability to model the complex relationships between meteorological, traffic, and pollutant concentration variables. In contrast, the RF model achieves R2 values above 0.93 and reduces the RMSE by approximately 70% across all PM metrics. This performance gap illustrates the strength of nonlinear ensemble methods such as Random Forest in capturing complex, non-additive interactions that linear models cannot accommodate.
Interestingly, even under cross-validation, the RF model maintains substantially higher R2 values than OLS, indicating that despite some overfitting, the RF model generalizes better than OLS to unseen subsets of the data. Among the three pollutants, PM10 shows the highest cross-validated R2 (0.51), suggesting a relatively more structured relationship between the predictors and PM10 levels compared to PM1 and PM2.5.
Overall, these findings underscore two key points:
Random Forest provides significant performance improvements over OLS, both in terms of R2 and RMSE, demonstrating its capacity to model nonlinear environmental processes more effectively.
Cross-validation is essential for obtaining realistic performance estimates, as training-set results alone can be misleadingly optimistic due to overfitting, particularly in flexible models like RF.
4.3. Wind Speed Threshold Analysis
The wind speed threshold was identified using piecewise regression, where the dataset was split at various wind speeds, and the best threshold was chosen based on the lowest mean squared error.
The relationship between wind speed and particulate matter (PM) concentrations was examined using a threshold analysis to identify critical wind speeds beyond which pollutant levels decrease markedly. This analysis was conducted in two stages: an initial exploratory visual inspection, followed by a formal piecewise linear regression to estimate the wind speed threshold in a data-driven manner. Scatter plots of PM
1, PM
2.5, and PM
10 concentrations against wind speed revealed a non-linear pattern, with concentrations remaining relatively stable at low wind speeds and decreasing sharply beyond approximately 3 m/s. To statistically determine this breakpoint, a piecewise linear regression model was fitted iteratively for each unique wind speed value T in the dataset. Specifically, the model assumes that the relationship between wind speed (x) and PM concentration (y) can be represented as two linear segments joined at the threshold T:
where β
0 is the intercept, β
1 and β
2 are the slopes below and above the threshold, respectively, and ε
i is the error term for observation i. For computational implementation, wind speed was decomposed into two predictor variables (“xbelow” and “xabove”) defined as
allowing the model to be expressed in a linear form suitable for ordinary least squares estimation:
For each candidate threshold T, the model was fitted, and its performance was evaluated using the Mean Squared Error (MSE) criterion,
where
denotes the predicted value under the model with threshold T. The optimal threshold was selected as the value of T that minimized the MSE, thus identifying the breakpoint that best explained the observed non-linear relationship. This procedure was repeated separately for PM
1, PM
2.5, and PM
10, yielding optimal thresholds of 3.0 m/s, 3.2 m/s, and 3.2 m/s, respectively (
Table 10). These results indicate that when wind speeds exceed approximately 3 m/s, the concentrations of all measured PM fractions tend to decrease significantly, reflecting the physical process by which increased wind speeds enhance atmospheric mixing and pollutant dispersion. This threshold-based approach provides a robust statistical framework for identifying dispersion thresholds, which can inform both air quality modeling and regulatory strategies by delineating wind conditions under which pollutant accumulation is likely to occur.
The plots (
Figure 10) show the PM levels against wind speed, with a red dashed line indicating the identified threshold wind speeds. PM
1 levels tend to stabilize and decrease when the wind speed surpasses 3.0 m/s. PM
2.5 levels show a noticeable drop after wind speeds exceed 3.2 m/s. Similarly, PM
10 levels significantly decrease once the wind speed goes beyond 3.2 m/s.
When wind speeds exceed these threshold values, the concentration of particulate matter (PM1, PM2.5, or PM10) tends to decrease significantly. This aligns with the physical expectation that higher wind speeds disperse pollutants more effectively. Below the thresholds, PM levels are more dispersed and higher, indicating that lower wind speeds are insufficient to disperse airborne particles effectively. Above these thresholds, PM levels consistently decrease, suggesting that wind speeds at or above ~3.0–3.2 m/s are effective in clearing particulate matter from the air.
For validation purposes, the threshold analysis was visually inspected using the daily variation in PM levels and changes in wind speed.
Figure 11. illustrates the hourly variation in PM
1, PM
2.5, and PM
10 levels alongside wind speed at Shahrah-e-Faisal, revealing an inverse relationship between wind speed and particulate matter concentrations. PM levels peak during the morning hours, particularly around 8:00–9:00 AM, coinciding with high traffic congestion and relatively low wind speeds. Throughout the day, PM levels fluctuate moderately but begin to decline significantly after 1:00 PM, when wind speed increases sharply. Wind speed peaks around 2:20 PM and continues to show periodic surges in the evening, contributing to a noticeable reduction in PM concentrations. This trend validates the threshold analysis, demonstrating that wind speeds above approximately 3–4.5 m/s effectively disperse airborne particulate matter, highlighting the crucial role of wind speed in mitigating air pollution at this arterial road.
In conclusion, these threshold values could serve as benchmarks for urban air quality management, indicating that maintaining wind speeds above these levels (through strategic city planning or monitoring) can significantly improve air quality.