# Conceptual Model for Determining the Statistical Significance of Predictive Indicators for Bus Transit Demand Forecasting

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Literature Review

## 3. Methodology of Research

- Structural analysis of available big data systems: In order to identify relevant statistical indicators that can potentially be used for passenger demand modelling on the Prizren–Zagreb international bus route, a structural analysis of three big data systems was conducted, including the World Bank database [2], the International Monetary Fund’s World Economic Outlook database [3], and the Kosovo Agency of Statistics’ ASKdata platform [1].
- Extraction of relevant statistical indicators from available data sources: Based on the conducted structural analysis of the considered big data systems, 51 relevant statistical indicators and 1 output variable (annual number of passengers transported on the Prizren–Zagreb international bus line) were identified. The data for these statistical indicators were extracted from the big data systems and stored in the MS Excel data file.
- Creation and preparation of the database with relevant statistical indicators for import into the Rstudio programming environment [15]: The data exported from the big data systems were structured into the following five data columns: (1) Predictor ID; (2) Name of predictor; (3) Year; (4) Absolute value of predictor variable; and (5) Relative rate of change of predictor variable. The extracted database (Table S1 in Supplementary Material) contained a total of 928 records with absolute values and annual relative rates of change of observed/predicted statistical indicators for the period between 1953 and 2061.
- Procedure for filtering and aggregating data: The created database was filtered to include only the values of statistical indicators for the period between the years 2015 and 2021, for which the data on the annual number of passengers transported on the Prizren–Zagreb international bus line were available. Based on the first iteration of the data filtering procedure, an output dataset (Table S2 in Supplementary Material) with 34 data column vectors for potentially independent variables and 1 column vector for the dependent variable was extracted from the original database, containing a total of 201 numerical value. The filtered data columns were then aggregated into 11 statistical groups according to the type (transportation, geospatial, demographic or socioeconomic) of considered statistical indicators.
- Partial linear regression and correlation analysis between each independent and dependent variable: In order to determine whether each of the statistical indicators considered has a linear correlation with the demand for bus transport, a total of 33 partial linear models were first created in the RStudio programming environment using the lm() function. The relevant results of the partial correlation and regression analysis, including the values of Pearson’s correlation coefficient (r), coefficient of determination (r
^{2}) indicating the strength of the partial correlation and p-value indicating the statistical significance of the created linear models, were then extracted separately for each group of models using the summary() function and stored in separate data vectors. Finally, based on the defined logical rules and the min() and max() functions, the partial regression models with the lowest p-values and the highest correlation coefficients were determined. The statistical indicators with a significantly high determination coefficient (r^{2}) and linear regression p-value used in the regression models were selected for the further steps of statistical analysis. All other statistical indicators with non-significant or weak linear correlation were removed from the input database at this stage. - Multicollinearity test: All statistical indicators found to be potentially significant were tested for multicollinearity to check whether any of the independent predictor variables considered were not highly correlated with each other. To test all potential predictor variables for multicollinearity, the intercorrelation and variance inflation factor matrices (VIS) were created by calling the functions cor() and vif(), respectively. To identify predictor variables that were highly correlated with each other, threshold values of 0.90 for the intercorrelation matrix and 5 for the variance inflation matrix were used. The multicollinear predictors with the highest correlation strength and statistical significance with respect to the dependent variable (bus travel demand) were selected for the further steps of the statistical analysis.
- Homoscedasticity test: The remaining potential predictor variables were tested for homoscedasticity using the Studentized Breush–Pagan statistic, which performs an auxiliary regression analysis between each predictor considered and its squared residuals. Predictor variables can be considered homoscedastic if their variance is equal or similar over the entire range of their possible values. The homoscedasticity test was performed in the Rstudio [15] programming environment using the lmtest::bptest() function. The predictor values, for which the p-value of the Breush–Pagan test was greater than 0.05, were selected for the further steps of the statistical analysis.
- Autocorrelation test: Based on the autocorrelation test, the correlation strength between the individual values of the predictor variables measured at different points in time was determined to identify their degree of periodicity, i.e., the patterns or trends across the time series of considered statistical indicators. The predictor variables that can be used efficiently in the multiple correlation model should not be highly correlated with their historical values. The autocorrelation test for each of the considered statistical indicators was performed based on the Durbin–Watson statistic by calling the durbin-WatsonTest() function in the Rstudio programming environment [15]. Based on the performed Durbin–Watson test, only those potential predictor variables, for which the test yielded a p-value greater than 0.05, were considered statistically significant and included in the further steps of analysis.
- Multivariate normality test: Multivariate normality exists when the residuals determined for linear regression models developed based on individual predictor variables are normally distributed. To test the considered statistical indicators for multivariate normality, the Shapiro–Wilk normality test was performed using the shapiro.test() function. After this test, all potential predictor variables for which the p-value was above the threshold of 0.05 were considered statistically significant and included in the final stage of analysis.
- Stepwise regression procedure: To determine the optimal mathematical formulation of the multiple regression model, a stepwise forward regression procedure was used. This procedure starts with the empty multiple regression model with no predictor variables and then iteratively adds the predictor variables that were found to be the most important in terms of contributing to the overall precision and confidence of the predictive model. The iterative addition procedure was terminated when the performance of the multiple regression model could not be improved significantly by adding new predictor variables to the regression equation. The stepwise forward regression was initiated with the ols_step_forward_p() function, with specified starting parameters of the blank model and the p threshold of 0.05.
- Establish a prioritized list of statistically significant predictor variables and determine primary (optimal) and alternative mathematical formulations of passenger demand prediction models based on multiple linear regression: All predictor variables found to be statistically significant in the conducted statistical tests were prioritized based on their level of significance and stored in a separate table in the R programming environment. The primary (optimal) and alternative mathematical formulations of the multiple regression model obtained based on the stepwise forward regression were also stored in a separate data table and prioritized according to the relevant performance parameters of the model, including coefficients of determination, adjusted determination coefficients, Akaike information criterion (AIC) values, root mean square error (RMSE) values, and p values.

## 4. Discussion of the Results

^{2}) and the p-value, respectively. All predictor variables that did not meet these conditions were considered nonlinearly correlated and unsuitable for the development of a linear multiple regression model and were therefore removed from the input data set. The results of the partial correlation and regression analysis are presented both in tabular form and in the form of comparative scatter plots. The coefficient of determination values and p-values obtained for each potential predictor variable were grouped into 11 statistical groups according to the type of statistical indicators (transportation, spatial, demographic and socioeconomic indicators) and stored in separate data vectors. Finally, in order to select the most significant predictor within each of the defined statistical groups, the created data vectors containing r

^{2}and p values were analysed to identify statistical indicators with the highest coefficient of determination and the lowest p-value. Based on this procedure, the five most significant potential predictors were selected for further steps of statistical analysis. Examples of comparative scatter plots with fitted linear regression functions between selected predictor variables and dependent variables are shown in Figure 1.

## 5. Conclusions

## Supplementary Materials

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- Kosovo Agency of Statistics ASKdata. Available online: https://askdata.rks-gov.net/pxweb/en/ASKdata/ (accessed on 21 December 2022).
- World Bank The World Bank DataBank. Available online: https://databank.worldbank.org/ (accessed on 21 December 2022).
- International Monetary Fund World Economic Outlook Database. Available online: https://www.imf.org/en/Publications/WEO/weo-database/2022/April/download-entire-database (accessed on 21 December 2022).
- Lyu, T.; Xu, M.; Zhang, J.; Wang, Y.; Yang, L.; Gao, Y. Influential Factor Analysis and Prediction on Initial Metro Network Ridership in Xi’an, China. J. Adv. Transp.
**2022**, 2022, 1–18. [Google Scholar] [CrossRef] - Toole, J.L.; Colak, S.; Sturt, B.; Alexander, L.P.; Evsukoff, A.; González, M.C. The Path Most Traveled: Travel Demand Estimation Using Big Data Resources. Transp. Res. Part C Emerg. Technol.
**2015**, 58, 162–177. [Google Scholar] [CrossRef] - Bernardin, V.L.; Ferdous, N.; Sadrsadat, H.; Trevino, S.; Chen, C.-C. Integration of National Long-Distance Passenger Travel Demand Model with Tennessee Statewide Model and Calibration to Big Data. Transp. Res. Rec.
**2017**, 2653, 75–81. [Google Scholar] [CrossRef] - Molloy, J.; Moeckel, R. Improving Destination Choice Modeling Using Location-Based Big Data. ISPRS Int. J. Geo-Inf.
**2017**, 6, 291. [Google Scholar] [CrossRef][Green Version] - Llorca, C.; Ji, J.; Molloy, J.; Moeckel, R. The Usage of Location Based Big Data and Trip Planning Services for the Estimation of a Long-Distance Travel Demand Model. Predicting the Impacts of a New High Speed Rail Corridor. Res. Transp. Econ.
**2018**, 72, 27–36. [Google Scholar] [CrossRef] - Xiang, Y.; Xu, C.; Yu, W.; Wang, S.; Hua, X.; Wang, W. Investigating Dominant Trip Distance for Intercity Passenger Transport Mode Using Large-Scale Location-Based Service Data. Sustainability
**2019**, 11, 5325. [Google Scholar] [CrossRef][Green Version] - Ye, Y.; Chen, L.; Xue, F. Passenger Flow Prediction in Bus Transportation System Using ARIMA Models with Big Data. In Proceedings of the 2019 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC), IEEE, Guilin, China, 17–19 October; pp. 436–443.
- Cyril, A.; Mulangi, R.H.; George, V. Bus Passenger Demand Modelling Using Time-Series Techniques and Big Data Analytics. Open Transp. J.
**2019**, 13, 41–47. [Google Scholar] [CrossRef][Green Version] - Zhao, Y.; Zhang, H.; An, L.; Liu, Q. Improving the Approaches of Traffic Demand Forecasting in the Big Data Era. Cities
**2018**, 82, 19–26. [Google Scholar] [CrossRef] - Khunsri, K.; Panichpapiboon, S. A Big Data Analysis on Efficiency of Bangkok Taxi System. In Proceedings of the ECTI-CON 2021—2021 18th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology: Smart Electrical System and Technology, IEEE Proceedings, Chiang Mai, Thailand, 19–22 May 2021; pp. 39–42. [Google Scholar]
- Xiong, G.; Li, Z.; Wu, H.; Chen, S.; Dong, X.; Zhu, F.; Lv, Y. Building Urban Public Traffic Dynamic Network Based on CPSS: An Integrated Approach of Big Data and AI. Appl. Sci.
**2021**, 11, 1109. [Google Scholar] [CrossRef] - RStudio Team. RStudio: Integrated Development Environment for R 2022. Available online: https://www.rstudio.com/categories/integrated-development-environment/ (accessed on 21 December 2022).

**Scheme 1.**Flow-chart of developed conceptual procedural framework intended for testing the statistical significance of potential predictor variables (statistical indicators) contained in big data systems and develop a multiple regression prediction model.

**Figure 1.**Example of comparative scatter plots with fitted linear regression functions obtained based on the partial correlation and regression analysis performed between potential predictor variables and passenger demand on the observed bus route.

**Figure 2.**Multivariate normality test results: Comparative quantile–quantile plots produced for selected potential predictor variables.

**Figure 3.**The comparative plots between actual (with the COVID-19 pandemic) and simulated values (without the COVID-19 pandemic) of the two selected predictors and output variable.

**Figure 4.**Regression plane between the two most statistically significant predictor variables and passenger demand on the Prizren–Zagreb international bus route.

**Table 1.**Summary of results of partial linear correlation and regression analysis conducted between potential predictor variables (statistical indicators) and travel demand on the observed bus route.

Potential Predictor | R-Square | Adjusted R-Square | Residual Standard Error | Significance p-Value | Status |
---|---|---|---|---|---|

Population (Kosovo) | 0.6418 | 0.5701 | 1158 | 0.03035 | Selected |

Population (Priština) | 0.4714 | 0.2952 | 1371 | 0.2004 | Eliminated |

Population (Prizren) | 0.5432 | 0.391 | 1275 | 0.1553 | Eliminated |

Population (Priština + Prizren) | 0.4967 | 0.329 | 1338 | 0.1838 | Eliminated |

Gross Income (Kosovo) | 0.8049 | 0.7659 | 854.5 | 0.006161 | Selected |

Net Income (Kosovo) | 0.8104 | 0.7724 | 842.4 | 0.005724 | Selected |

Net Income Primary | 0.5606 | 0.4727 | 1282 | 0.05282 | Eliminated |

Net Income Secondary | 0.6223 | 0.5468 | 1189 | 0.03498 | Selected |

Net Income Tertiary | 0.6709 | 0.6051 | 1110 | 0.02419 | Selected |

Gasoline price | 0.3957 | 0.2749 | 1504 | 0.1302 | Eliminated |

Diesel price | 0.4309 | 0.3171 | 1459 | 0.1093 | Eliminated |

Registered vehicles | 0.7586 | 0.7104 | 950.3 | 0.0107 | Selected |

Border crossings | 0.8948 | 0.8597 | 611.9 | 0.01498 | Selected |

Border crossings (cars) | 0.7796 | 0.7062 | 885.4 | 0.04721 | Selected |

Border crossings (bus) | 0.9452 | 0.9269 | 441.7 | 0.005542 | Selected |

Tourists (Kosovo) | 0.4405 | 0.254 | 1173 | 0.2219 | Eliminated |

Tourists (Prizen) | 0.481 | 0.308 | 1129 | 0.194 | Eliminated |

Night stays (Kosovo) | 0.5723 | 0.4297 | 1025 | 0.1389 | Eliminated |

Night stays (Priština) | 0.2574 | 0.00986 | 1129 | 0.3829 | Eliminated |

Night stays (Prizren) | 0.3774 | 0.1699 | 1237 | 0.2703 | Eliminated |

Foreign Tourists (Kosovo) | 0.3765 | 0.1686 | 1238 | 0.271 | Eliminated |

Foreign Tourists (Priština) | 0.5925 | 0.4566 | 1001 | 0.128 | Eliminated |

Foreign Tourists (Prizren) | 0.3883 | 0.1845 | 1226 | 0.2614 | Eliminated |

Night stays foreign (Kosovo) | 0.5183 | 0.3578 | 1088 | 0.1702 | Eliminated |

Night stays foreign (Priština) | 0.7392 | 0.6522 | 800.6 | 0.06171 | Eliminated |

Night stays foreign (Prizren) | 0.3073 | 0.07646 | 1305 | 0.3322 | Eliminated |

Croatian Tourists (Kosovo) | 0.1564 | 0.1249 | 1440 | 0.51 | Eliminated |

Croatian Tourists Night stays | 0.4763 | 0.3018 | 1134 | 0.1971 | Eliminated |

GDP (Kosovo) | 0.852 | 0.8224 | 744.1 | 0.003027 | Selected |

GDP per capita (Kosovo) | 0.8528 | 0.8234 | 742.1 | 0.002985 | Selected |

Consumer Price Index (CPI) | 0.8526 | 0.8232 | 742.6 | 0.002994 | Selected |

Bank Transactions (Kosovo) | 0.9199 | 0.8398 | 702.8 | 0.1827 | Selected |

**Table 2.**Intercorrelation matrix produced based on the results of the correlation analysis conducted between pairs of selected potential predictor variables (statistical indicators).

Population (Kosovo) | Net Income (Kosovo) | Registered Vehicles | Border Crossings (Bus) | GDP Per Capita | |
---|---|---|---|---|---|

Population (Kosovo) | 1 | 0.8461 | 0.585 | 0.7707 | 0.8894 |

Net Income (Kosovo) | 0.8461 | 1 | 0.7902 | 0.8663 | 0.9313 |

Registered vehicles | 0.585 | 0.7902 | 1 | 0.8283 | 0.8421 |

Border crossings | 0.7707 | 0.8663 | 0.8283 | 1 | 0.8874 |

GDP per capita | 0.8894 | 0.9313 | 0.8421 | 0.8874 | 1 |

**Table 3.**Variance Inflation Factor (VIF) matrix produced based on the results of the correlation analysis conducted between pairs of selected potential predictor variables (statistical indicators).

Population (Kosovo) | Net Income (Kosovo) | Registered Vehicles | Border Crossings (Bus) | GDP Per Capita | |
---|---|---|---|---|---|

Population (Kosovo) | - | 3.5206 | 1.5203 | 2.4629 | 4.7848 |

Net Income (Kosovo) | 3.5206 | - | 2.6624 | 4.0082 | 7.5339 |

Registered vehicles | 1.5203 | 2.6624 | - | 3.1855 | 3.4387 |

Border crossings | 2.4629 | 4.0082 | 3.1855 | - | 4.704 |

GDP per capita | 4.7848 | 7.5339 | 3.4387 | 4.704 | - |

Potential Predictor | BP Parameter | Degrees of Freedom df | Significance p-Value | Status |
---|---|---|---|---|

Population (Kosovo) | 3.726 | 1 | 0.05357 | Homoscedastic variable |

Registered vehicles | 0.036639 | 1 | 0.8482 | Homoscedastic variable |

Border crossings (bus) | 0.066562 | 1 | 0.7964 | Homoscedastic variable |

GDP per capita | 1.1798 | 1 | 0.2774 | Homoscedastic variable |

Potential Predictor | DW Parameter | Auto Correlation | Significance p-Value | Status |
---|---|---|---|---|

Population (Kosovo) | 1.62382 | −0.01776807 | 0.374 | Non-autocorrelated variable |

Registered vehicles | 2.239692 | −0.3358286 | 0.976 | Non-autocorrelated variable |

Border crossings (bus) | 1.802849 | −0.05925684 | 0.984 | Non-autocorrelated variable |

GDP per capita | 3.295577 | −0.714349 | 0.114 | Non-autocorrelated variable |

Potential Predictor | W Parameter | Significance p-Value | Status |
---|---|---|---|

Population (Kosovo) | 0.92769 | 0.5314 | Normally distributed variable |

Registered vehicles | 0.95276 | 0.7547 | Normally distributed variable |

Border crossings (bus) | 0.85237 | 0.2021 | Normally distributed variable |

GDP per capita | 0.93679 | 0.61 | Normally distributed variable |

Iteration | Selected Predictor | R-Square | Adjusted R-Square | AIC | RMSE |
---|---|---|---|---|---|

I. variable entered | Border crossings (bus) | 0.9452 | 0.9269 | 78.5406 | 441.6576 |

II. variable entered | Population (Kosovo) | 0.9867 | 0.9735 | 73.4471 | 266.1115 |

**Table 8.**Summary of results obtained for the initial version of multiple regression model (with COVID-19 scenario), selected by forward stepwise regression procedure.

Coefficients: | Estimate | Standard Error | t-Value | PR (>|t|) |
---|---|---|---|---|

(Intercept) | 114,800 | 45,620 | 2.516 | 0.1283 |

Border crossings (bus) (with COVID-19) | 0.0298 | 0.003126 | 9.534 | 0.0108 |

Population (Kosovo) (with COVID-19) | −0.06449 | 0.02577 | −2.503 | 0.1294 |

R-squared | AdjustedR-squared | Residualstandard error | F-statistic | Significancep-value |

0.9867 | 0.9735 | 266.1 | 74.35 | 0.01327 |

**Table 9.**Summary of results obtained for a calibrated version of multiple regression model (without COVID-19 scenario), obtained by combining the actual values of the selected predictors and output variable, measured in the period between years 2015 and 2018 with the simulated values of these variables for the period between 2019 and 2021 (during COVID-19 pandemic).

Coefficients: | Estimate | Standard Error | t-Value | PR (>|t|) |
---|---|---|---|---|

(Intercept) | −5755 | 90,850 | −0.063 | 0.955 |

Border crossings (bus) (without COVID-19) | 0.0165 | 0.006381 | 2.586 | 0.123 |

Population (Kosovo) (without COVID-19) | 0.004338 | 0.005132 | 0.085 | 0.940 |

R-squared | AdjustedR-squared | Residualstandard error | F-statistic | Significancep-value |

0.9867 | 0.9571 | 466.1 | 45.66 | 0.02143 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Jovanović, B.; Shabanaj, K.; Ševrović, M. Conceptual Model for Determining the Statistical Significance of Predictive Indicators for Bus Transit Demand Forecasting. *Sustainability* **2023**, *15*, 749.
https://doi.org/10.3390/su15010749

**AMA Style**

Jovanović B, Shabanaj K, Ševrović M. Conceptual Model for Determining the Statistical Significance of Predictive Indicators for Bus Transit Demand Forecasting. *Sustainability*. 2023; 15(1):749.
https://doi.org/10.3390/su15010749

**Chicago/Turabian Style**

Jovanović, Bojan, Kamer Shabanaj, and Marko Ševrović. 2023. "Conceptual Model for Determining the Statistical Significance of Predictive Indicators for Bus Transit Demand Forecasting" *Sustainability* 15, no. 1: 749.
https://doi.org/10.3390/su15010749