Next Article in Journal
Computer Vision 360 Video Analysis in Sports: 3D Athlete Pose and Rig Motion Estimation in Olympic Sailing
Previous Article in Journal
A Hybrid Temporal–Spatial Framework Incorporating Prior Knowledge for Predicting Sparse and Intermittent Item Demand
Previous Article in Special Issue
The Use of Deep Neural Networks (DNN) in Travel Demand Modelling
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Data-Driven Approach to Estimating Passenger Boarding in Bus Networks

by
Gustavo Bongiovi
1,†,
Teresa Galvão Dias
2,†,
Jose Nauri Junior
3,† and
Marta Campos Ferreira
2,*,†
1
Faculdade de Engenharia, Universidade do Porto, Rua Dr. Roberto Frias, s/n, 4200-465 Porto, Portugal
2
Instituto de Engenharia de Sistemas e Computadores, Tecnologia e Ciência (INESC TEC), Faculdade de Engenharia, Universidade do Porto, Rua Dr. Roberto Frias, s/n, 4200-465 Porto, Portugal
3
Agência Reguladora de Serviços Públicos Delegados do Estado do Ceará (Arce), Av. Gen. Afonso Albuquerque Lima, Cambeba, Fortaleza 60822-325, Brazil
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Appl. Sci. 2026, 16(3), 1384; https://doi.org/10.3390/app16031384
Submission received: 19 December 2025 / Revised: 9 January 2026 / Accepted: 27 January 2026 / Published: 29 January 2026

Abstract

This study explores the application of multiple predictive algorithms under general versus route-specialized modeling strategies to estimate passenger boarding demand in public bus transportation systems. Accurate estimation of boarding patterns is essential for optimizing service planning, improving passenger comfort, and enhancing operational efficiency. This research evaluates a range of predictive models to identify the most effective techniques for forecasting demand across different routes and times. Two modeling strategies were implemented: a generalistic approach and a specialized one. The latter was designed to capture route-specific characteristics and variability. A real-world case study from a medium-sized metropolitan region in Brazil was used to assess model performance. Results indicate that ensemble-tree-based models, particularly XGBoost, achieved the highest accuracy and robustness in handling nonlinear relationships and complex interactions within the data. Compared to the generalistic approach, the specialized approach demonstrated superior adaptability and precision, making it especially suitable for long-term and strategic planning applications. It reduced the average RMSE by 19.46% (from 13.84 to 11.15) and the MAE by 17.36% (from 9.60 to 7.93), while increasing the average R2 from 0.289 to 0.344. However, these gains came with higher computational demands and mean Forecast Bias (from 0.002 to 0.560), indicating a need for bias correction before operational deployment. The findings highlight the practical value of predictive modeling for transit authorities, enabling data-driven decision making in fleet allocation, route planning, and service frequency adjustment. Moreover, accurate demand forecasting contributes to cost reduction, improved passenger satisfaction, and environmental sustainability through optimized operations.

1. Introduction

Urban mobility is a critical factor in the development of modern cities. As urban populations expand, the challenges of ensuring efficient, reliable, and accessible public transportation become increasingly significant. Public transportation plays a critical role in promoting sustainability in urban environments. It aids in reducing traffic congestion, lowering greenhouse gas emissions, and improving air quality, thereby contributing towards the United Nations’ Sustainable Development Goal (SDG) 13: Climate Action [1,2].
Moreover, public transportation provides equitable access to mobility, allowing individuals from various socioeconomic backgrounds to commute conveniently [3]. This access is crucial for fostering inclusive urban development and supporting economic growth by connecting people to jobs, education, and other essential services [4].
The accurate estimation of passenger boarding patterns in public transportation systems is essential for optimizing service operations, enhancing passenger comfort, reducing operational costs, and improving the daily lives of millions of commuters [2,5]. By optimizing routes and schedules based on actual demand, research helps to ensure that everyone, including those with lower income, has reliable access to essential services, reduced time to commute, and to access education, promoting social equity and contributing towards SDG 10: Reduced Inequality.
Highly populated cities exemplify the need for a detailed analysis of passenger boarding demand. As urban populations grow, the complexity of public transport networks increases, requiring careful planning to ensure that operations effectively meet passengers’ mobility needs. Understanding and addressing the unique challenges faced by densely populated cities is therefore crucial for developing data-driven strategies that enhance the efficiency, equity, and sustainability of public transportation systems worldwide.
This study focuses on developing and comparing different predictive methods, including those proposed in the literature, for estimating passenger boarding demand in public bus transportation. The primary goal is to identify the most accurate and efficient modeling approaches to support demand forecasting. Such forecasts can optimize operations by reducing overcrowding, improving passenger comfort, and increasing service efficiency. By developing reliable models for predicting where and when passengers board, transport authorities can improve route planning and allocate resources more effectively. The research is applied to a case study in a medium-sized metropolitan region in Brazil, providing practical insights into the applicability of these methods in real-world contexts.
The contributions and originality of this research lie not only in the application and comparative evaluation of multiple predictive algorithms to estimate passenger boarding demand in public bus transportation, but also by employing two distinct modeling approaches to examine the trade-offs between estimation accuracy and computational efficiency.
The first is a generalistic strategy, where a single predictive model is trained on data from a representative, high-coverage route and then deployed across the entire network for prediction. This approach offers operational simplicity, low computational cost, and ease of maintenance, as it requires developing and updating only one model. However, it may fail to capture the unique flow patterns and operational singularity of individual routes. The second is a specialized strategy, which involves developing and training a dedicated model for each route and travel direction. This approach is designed to capture route-specific characteristics and variability, potentially yielding higher accuracy, but at the cost of increased computational demands for training and maintaining multiple models.
By employing different methodological frameworks alongside with a broad suite of algorithms, from linear regression to Deep Learning, this study seeks to identify the most effective techniques for accurately estimating boarding demand while balancing model performance and resource requirements.
The remainder of the article is structured as follows: the next section explores the state of the art regarding Automated Fare Collection systems and forecasting methods. Section 3 details the methodological approach followed to conduct this study. Section 4 presents the results of the study and Section 5 presents the main conclusions.

2. State of the Art

This section introduces Automated Fare Collection systems and presents the state of the art of the most common methods for forecasting passenger demand in public transport.

2.1. Automated Fare Collection Systems

Automated Fare Collection (AFC) systems have emerged as pivotal components in modernizing public transportation networks. Their integration within Intelligent Transportation Systems (ITSs) has transformed the landscape of urban mobility by enhancing data accuracy and service efficiency, as they automatically detect the number of passengers boarding through the use of intelligent cards [6].
The advancement from traditional manual counting methods, such as travel surveys, offers significant benefits, not only in terms of data collection efficiency but also in the accuracy and reliability of the data captured. These methods were labor-intensive and prone to human error, and are now being replaced by automated systems that ensure consistent data collection across extensive networks. This shift is crucial in urban centers where managing peak-time congestion and optimizing fleet allocation can dramatically improve service quality and passenger satisfaction [7].
Furthermore, AFC systems synergize well with Automated Passenger Counting (APC) systems, facilitating a more integrated approach to transportation management; while AFC systems streamline fare collection and improve financial accountability, APC systems enrich the dataset with passenger flow dynamics. This combined data resource is invaluable for conducting comprehensive transportation studies, developing predictive models, and refining service routes based on actual usage patterns [8,9].
Current research continues to explore the potential of APC systems in reducing operational costs and enhancing passenger experience. For instance, studies have shown that real-time passenger count data can be used to dynamically adjust service frequency, allocate resources more efficiently during peak hours, and support targeted marketing strategies by identifying high-usage segments [10].

2.2. Forecasting Methods

In this subsection, a comprehensive analysis of some of the most common methods for forecasting passenger demand in public transport in the literature was carried out. Table 1 provides an overview of the methods’ applications, advantages, and limitations.

2.2.1. Linear Regression

Linear regression remains a foundational statistical tool for predicting passenger boarding due to its straightforward implementation and interpretability. Its utility in quantifying relationships between passenger demand and influencing factors, such as the last observed passenger count, headway deviation, and environmental conditions like temperature and precipitation, has been demonstrated in numerous public transport studies. For instance, Sun et al. [11] highlighted the model’s adaptability to the complexities of transit data, while ref. [12] expanded its application to real-time APC systems, using direct and interactive variables to forecast short-term passenger trends and enhance operational planning.
Despite its simplicity, its predictive performance is highly competitive, often rivaling or even surpassing sophisticated machine learning algorithms in certain contexts [13]. However, the application can produce unrealistic outputs, such as negative boarding values, and its efficacy diminishes in environments characterized by high variability and inconsistent predictor influences. Nevertheless, comparative analyses have shown that standard linear regression can be more suitable for boarding data than specialized count regression models, solidifying its status [12].

2.2.2. Elastic Net

Elastic Net (EN) combines Lasso (L1) and Ridge (L2) regularization, addressing the issues of multicollinearity and overfitting commonly found in transport demand data [14]. By balancing variable selection and coefficient shrinkage, it improves model stability and interpretability when predictors are highly correlated.
Geçici and Gürkaş-Aydin [15] applied the EN model to forecast hourly passenger numbers in Istanbul’s multimodal network, demonstrating its robustness against feature correlation and its ability to provide accurate and interpretable predictions. Likewise, Porat et al. Further study has incorporated EN within a micromobility demand framework [16], confirming its reliability for handling mixed spatial–temporal data and maintaining competitive accuracy with more complex algorithms.
Despite its linear structure, Elastic Net remains a strong baseline model for transport forecasting, particularly when feature relationships are numerous and moderately correlated.

2.2.3. Random Forest

Random Forest (RF) is a technique that builds multiple decision trees and merges them to obtain more accurate and stable predictions. In the context of passenger demand, this model addresses the limitations often found in traditional regression models by effectively processing large datasets and accommodating the random variability in bus data. Its ensemble approach, which constructs numerous decision trees and aggregates their predictions, ensures robust performance against overfitting. This methodology minimizes the influence of outliers and noise, enhancing the predictive accuracy of demand estimates [17].
The effectiveness of the RF model in forecasting passenger numbers on buses was explored [12], using real-time data from bus operations, such as APC systems and weather information. The RF approach excels in managing complex, nonlinear data interactions that frequently characterize transit data, making it suitable for capturing intricate patterns of passenger flow.
Wood et al. [12] demonstrated the application of the Random Forest model in the transportation sector through comprehensive analysis. The algorithm not only predicts future bus boarding but also adapts to new patterns as data become available. This adaptability is crucial in the dynamic environment of public transportation, where passenger behaviors and external conditions like weather can swiftly alter demand dynamics.

2.2.4. XGBoost

XGBoost is a decision-tree-based ensemble machine learning algorithm that uses a gradient-boosting framework, designed to optimize both computational speed and model performance [18]. XGBoost has been employed to classify the demand levels of trains based on query logs and external data sources, as demonstrated by [19]. This approach leverages the nonlinear and complex relationships inherent in the varied datasets typically used in transport studies, including timestamps, station codes, and external conditions such as weather.
The algorithm can handle large datasets and its robustness against overfitting makes it particularly suited for the dynamic and often-unpredictable patterns observed in public transit usage data.
Vandewiele et al. [19] had trained the XGBoost model on historical data to predict demand levels, processing features extracted from query logs, and incorporating additional data sources such as time of day, routes, and passenger feedback. This prediction model aids in anticipating demand rates, enhancing the management of train capacities, and potentially leading to improved service quality and reduced overcrowding.

2.2.5. LightGBM

LightGBM is an advanced gradient-boosting framework that stands out for its efficiency in processing large data volumes. It builds on decision tree algorithms by optimizing traditional gradient-boosting methods, making it particularly suitable for scenarios where the dataset features nonlinear relationships and complex dependencies [20].
One study used the LightGBM to analyze various data inputs [21], including real-time vehicle locations, passenger counts from APC, and external factors such as time and weather conditions. The algorithm’s robustness against overfitting and its capability to handle large datasets and complex feature interactions efficiently offer a sophisticated approach to predicting vehicle demand.
Furthermore, Gallo et al. [21] illustrated the application of LightGBM within a network-wide public transport occupancy prediction framework, focusing on the Zurich transport system. The proposed framework integrates LightGBM to predict demand across multiple lines, accounting for interactions between them, which enhances prediction accuracy significantly. The success of this method in Zurich’s complex urban transport network highlights LightGBM’s potential to enhance operational intelligence and support the development of more responsive public transportation systems.

2.2.6. LSTM

Long Short-Term Memory (LSTM) models are a type of Recurrent Neural Network, a Deep Learning approach that is particularly well-suited for time series prediction [22]. They are very effective in capturing long-term dependencies and patterns in time series data due to their ability to retain information over extended periods [23], offering robust performance in scenarios involving nonlinear and long-range temporal dependencies [24,25].
LSTM models excel at handling temporal variability, which is crucial for analyzing public transit arrival and departure events, and are generally quite accurate and can predict occupancy across multiple steps in a time series. However, their main disadvantages include a complicated training procedure and a significant lack of interpretability, often requiring surrogate models to explain their “black box” results [26].

3. Methodological Approach

The methodology used in this study was adapted from the Cross-Industry Standard Process for Data Mining (CRISP-DM) methodology, a structured, cyclical approach for planning, organizing, and implementing data mining projects proposed by [27]. The process ensures a comprehensive and systematic workflow, enhancing the reliability and validity of the research outcomes.

3.1. Data Understanding

The data used here are derived from the AFC system of the bus transport system of a medium-sized metropolitan region in Brazil: Fortaleza. With a population of over 3 million inhabitants, its network structure with six zones supports the rigorous evaluation of the proposed modeling strategies, as the dataset integrates tap-on validations from the metropolitan fare system (“Bilhete Único”) with a General Transit Feed Specification (GTFS) schedule and route geometry which enables the mapping of boarding to fare segments, trips separated by direction, and test models on routes with different footprints (single-zone and multi-zone). These characteristics are particularly important to allow the analysis of specialized approach adopted on this work, as the multi-zone lines test the model’s ability to capture spatial heterogeneity across zones; meanwhile, bidirectional trip data allow the assessment of direction-specific demand dynamics.
The GTFS data of the selected bus lines were composed of four datasets, providing a standardized structure for schedules, routes, trips, and stop and zone distribution used for further estimation of the zone where the boarding took place. The combination of AFC and GTFS data provides both the spatial granularity and temporal volume needed to evaluate whether specialized models materially outperform the generalistic model in contexts with route heterogeneity and strong directional effects.
Initially, the data comprised 15 AFC datasets, each corresponding to one high-demand bus line, totaling 11,141,341 raw validation records (approximately 1.11 GB). Each validation corresponds to a single boarding event recorded when a passenger (or fare collector) validated access to the bus. Each dataset contains transactional and operational variables, including the following: (most relevant) temporal information (transaction timestamp; trip opening and closing times; service date); (ii) operational identifiers (trip direction, route id); (iii) boarding counters (trip initial and final turnstile readings); and (iv) fare and passenger attributes (fare; type of passenger; passenger ID).
From this initial universe, a strategic sample of six bus lines was selected. Table 2 presents their corresponding summary statistics. The selection criteria were designed to ensure diversity in operational characteristics and robust coverage of the network, based on three key metrics: (1) passenger demand volume, (2) number of scheduled trips, and (3) number of zones covered. This sample represents 21.3% of the entire network’s passenger demand and 17.8% of all bus trips within the metropolitan area, and include lines that collectively cover all zones, ensuring broad representativeness. Out of the six lines, five operates in multiple zones while one operates in a single zone: the center of the metropolitan area.
Since each line operates in two travel directions, and each direction was treated as an independent dataset, resulting in 12 datasets. This representation of lines and directions was treated as “route”. For instance, Route 10 corresponds to line 1 with travel direction 0, while Route 11 corresponds to line 1 with travel direction 1. In contrast, line 1 encompasses both directions of the same line.
After the data preparation, described in the following subsections, the final aggregated datasets comprised 291,460 records, each corresponding to the number of boarding events in a given zone for a specific trip, which constitute an effective sample size used for model training and evaluation.
Figure 1 shows, as expected, the busiest days are the regular work days, Monday–Friday, with a noticeable drop during the weekend, especially on Sunday. Lines 5 and 6 exhibit the highest number of boarding instances, peaking around 30; however, it is important to notice that line 5 operates in only three zones and line 6 only on one. In contrast, line 2, that operates in six zones, has the lowest and relatively stable boarding numbers throughout the week. The data highlights passenger trends across different lines, indicating higher demand on weekdays and a substantial decline during weekends. This consistent pattern of fluctuation across all bus lines suggests a seasonality that covers the entire network analyzed, legitimizing the use of a generalistic model.
Figure 2 reveals the peak-period of demand across routes. During the morning peak (04:30–07:00), Lines 1, 5 and 6 display the highest boarding levels, with sharp increases relative to off-peak periods, indicating strong commuter-oriented demand. Line 5 shows a particularly morning ramp-up, consistent with inter-zonal work-related travel, while Line 6, despite operating within a single zone, also concentrates substantial demand during this period. In the evening peak (15:00–18:30), Lines 1, 5 and 6 again display an absolute increase in passenger volume, suggesting a return-flow symmetry. Line 2, by contrast, maintain relatively even boarding levels across both peak and non-peak periods, as the only line that covers all zones. Each line has overall unique fluctuations, with different tendencies in the deviations. These distinct patterns justify the use of a specialized approach to capture the unique characteristics of each line.

3.2. Data Preparation

The data preparation phase involved tasks such as data integration, data cleaning, data transformation, and feature engineering, which are detailed in the following subsections.

3.2.1. Data Integration

Since the datasets were missing zone of boarding information, it was required that we estimate it based on the GTFS data, which provide information on the expected time each bus passes through stops within designated zones. The routes data were merged based on common identifiers with the GTFS to construct a robust dataset with the zone of the boarding based on the timestamp.

3.2.2. Data Cleaning

Missing values were detected for each bus route and for each variable individually applying the dropna function, which drops rows with any null values, However, it was only detected for the variable cartao_xml in all datasets. This column refers to the card used to validate the entrance to the bus: some passengers pay for the entrance with money instead of using their card, and, in this case, the fare collector uses their general card to allow the entrance, which does not appear on this column.
Illogical values were identified only on the total_turns variable. This column refers to the total number of boardings in the trip, and negative values were detected as a result of the turnstile system resetting after reaching 99,999. These values were corrected by adding 99,999 to the negative counts, thereby restoring an accurate number of boardings.
To ensure numerical stability during model training and to limit the influence of extreme sparsity and unusually large aggregate values, a heuristic filtering step was applied, retaining rows with values between 0.25 times and 2 times the mean (and, in alternative configurations, between 0.15 times and 2.5 times the mean). These thresholds were not derived from a theoretical distributional model but were adopted as practical bounds to reduce the impact of highly atypical observations that may arise from data noise, reporting irregularities or rare operational conditions. The purpose of this filtering was to improve model robustness rather than to enforce strict statistical assumptions. Consequently, results should be interpreted within the context of these preprocessing choices, and future work should explore data-driven or adaptive thresholding strategies.
Finally, the analysis of outliers focused on the trip_time variable. Extreme values were identified; to address them, to ensure numerical stability during training, and to limit the influence of extreme sparsity, a heuristic filtering step was applied for all the lines. Values with over 0.25 times the mean and under 2 times the mean were maintained, except for line 4, since it refers to a most urbanized line influenced highly by traffic effects: the limits were adjusted to between 0.15 and 2.5 times the mean.
Furthermore, some occasional bus trips that had taken place outside the programmed schedules with a very low number of boardings were detected, suggesting a technical or logistic operation. Thus, a threshold was defined that assumed that trips without the programmed schedule and with less than 50 occurrences—fewer than approximately one trip per week over the year—were excluded from the dataset. In this process, on average, 1.57% of rows were removed from the datasets.
It is important to emphasize that these preprocessing thresholds are heuristic in nature and were selected to support model stability rather than to represent universal or optimal values.

3.2.3. Data Transformation

Techniques to convert data to the correct types, encode categorical variables, and finally encode all integer variables to a int32 format ensuring the datasets are in the optimal format to optimize processing. To streamline the dataset and reduce its dimensionality, redundant variables were eliminated by selecting only the relevant columns. This process involved carefully selecting the most important variables that contribute significantly to the model’s predictive power.

3.2.4. Feature Engineering

The variable target total_boardings was developed by grouping trip, zone, and date–hour of departure to count the number of passengers boarding at each zone. By grouping the data, it was possible to identify the number of passengers by zone for each trip, stating the target variable used for prediction.
Furthermore, period_bus_time was created to simplify the analysis of bus time data. The bus times were categorized into half-hour intervals, instead of continuous-time data. This process involves converting each bus time entry into a categorical variable that represents a specific time interval, such as 00:00–00:30, 00:30–01:00, and so on until 23:30–00:00.
The granular components of the date were extracted: the day of the month, the month, the quarter of the year, the weekday and weekend. These components helped to capture different seasonal patterns for bus demand. To capture other temporal dependencies through intra-month variations, lag features were created for the number of passengers per zone: based on the total_boardings, the number of passengers boarding 7 and 14 days prior to the current date were calculated. This step involved the development of boardings_zone_lag_7 and boardings_zone_lag_14.
Finally, the selection included the variables shown in Table 3.

3.3. Modeling

For the study, the six predictive models commonly used in the literature, as mentioned in Section 2, were adopted based on their effectiveness in handling time series data applied for public transportation systems.
The time series cross-validation (TSCV) technique was adopted to avoid overfitting and enhance the model’s ability to handle unseen data. Specifically, an expanding window strategy with five splits (K = 5) was used. Respecting the temporal chronology, TSCV divides the data into K folds, each representing a different time segment. In each iteration, the model is trained on past data and tested on future data.
To ensure the best model configuration and optimal performance, TSCV was combined with hyperparameter tuning. Tuning optimizes the hyperparameters that define the predictive model, and the search for the best configuration was conducted through a predefined distribution of hyperparameters stated in Table 4 to find the combination that yields the best performance, using the RandomizedSearchCV from the library sklearn with a total of 50 iterations [28].
Two approaches to developing the models were adopted, one generalistic approach and one specialized. The first uses data from Route 20 (line 2, direction 0), the largest route that operates in all six zones, to create a model that can be deployed to any bus route. This approach simplifies model deployment and maintenance, enhances efficiency, and requires fewer computational resources.
The generalistic was trained using data from Route 20 (line 2, direction 0), which was selected as a representative reference route based on operational considerations. Line 2 is the only one that operates across all six spatial zones of the network, ensuring exposure to the full range of spatial demand patterns present in the area. Furthermore, exploratory data analysis revealed that line 2 exhibits low variability in boardings across time and zones, indicating more stable demand patterns when compared to other routes. This reduced dispersion minimizes the influence of route-specific anomalies. Route 20 was selected due to the higher number of validations present in comparison with Route 21.
On the other hand, the specialized approach involves developing individual models for each route. This allows us to tailor the model to the specific usage patterns of each route, potentially leading to more accurate and context-specific predictions.
The LSTM network architecture comprises a single LSTM layer followed by a fully connected output layer. The LSTM layer uses the ReLU activation function, and includes a tunable number of hidden units and batch size as described on Table 4. A dense layer with one neuron is applied to produce the final boarding demand prediction. No dropout or additional recurrent layers were employed, limiting model complexity and reducing the risk of overfitting. The model was trained using the MSE loss function with the Adam or RMSprop optimizers during hyperparameter tuning.
The LSTM model was not used on the generalistic approach: due to its ability to capture specific detailed temporal patterns, it can easily fit only to the specific characteristics of the data it is trained on.
The algorithm utilized to train, test, evaluate and deploy the models for each approach followed a structured workflow to ensure the optimal configuration and performance of the predictive models.

3.4. Evaluation

To evaluate the performance of the applied models, four metrics were adopted: Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), R-squared (R2), and Forecast Bias. Each provides distinct insights into the model’s accuracy.
RMSE evaluates the square root of the average squared discrepancies between the predicted and actual values, giving more weight to larger errors. According to [29], this metric is particularly crucial in contexts where significant errors may lead to severe implications, such as public transportation planning.
RMSE = 1 n i = 1 n ( y i y ^ i ) 2
where n is the number of observations, y i are the observed values, and y ^ i are the predicted values.
MAE estimates the average absolute differences between the predicted and actual values. It is robust and provides a straightforward average magnitude of errors, while being less sensitive to higher deviations.
MAE = 1 n i = 1 n | y i y ^ i |
where n is the number of observations, y i are the observed values, and y ^ i are the predicted values.
R2, or the coefficient of determination, estimates the proportion of variance in the dependent variable, ideally ranging from 0 to 1, where a value closer to 1 indicates that the model explains a high proportion of the variability, suggesting a better model performance. However, values below 0 are possible when the model performs worse than a simple mean model, indicating that the predictive power is very poor.
R 2 = 1 i = 1 n ( y i y ^ i ) 2 i = 1 n ( y i y ¯ ) 2
where y i are the observed values, y ^ i are the predicted values, and y ¯ is the mean of the observed values.
Forecast Bias measures the average difference between predicted and actual values, providing insights into the overall tendency of the model to over-predict or under-predict. A Forecast Bias close to 0 indicates unbiased predictions, meaning the model does not systematically deviate from the true values.
Forecast Bias = 1 n i = 1 n ( y ^ i y i )
where y ^ i represents the predicted values, y i the observed values, and n is the total number of observations.

4. Results

This section presents the results obtained through the application of the proposed methodology for the generalistic and specialized approaches, to be further compared in the next section. These indicators allow assessing not only the overall accuracy of the models but also their ability to generalize across different operational contexts.

4.1. Generalistic Performance

Table 5 summarizes the results obtained from the generalistic approach, in which a single model was trained on the largest route and then deployed to all routes. After a comprehensive hyperparameter tuning process, the best configuration for each model was selected according to validation performance.
ElasticNet and XGBoost achieved the most consistent results across different metrics. Although the R2 scores remained negative, the lower RMSE and MAE values of ElasticNet (7.76 and 5.83, respectively) highlight its superior ability to minimize prediction errors on average. This suggests that the regularization effect introduced by the ElasticNet penalty effectively balanced bias and variance, enhancing the model’s generalization capacity in a heterogeneous dataset.
The results also indicate that more complex ensemble models, such as Random Forest, did not outperform linear methods. This may be attributed to the limited availability of explanatory features or the presence of high noise levels in the input data, which constrain the capacity of tree-based methods to identify meaningful nonlinear relationships. The small absolute bias values observed for the best models (below ±1) confirm that systematic over- or under-prediction tendencies were minimal, demonstrating a relatively balanced forecast behavior.
On Table 5, the R2 scores indicates that, on the test set of Route 20 itself, the predictions of each tuned model are less accurate than using the simple mean of Route 20’s training data as a constant forecast. This phenomenon occurs when a model is too regularized or when the time series has very low explainable variance relative to its mean. Route 20, as the most stable and multi-zonal line, presents a challenging forecasting scenario where the global mean is a strong, hard-to-beat baseline. However, this result is specific to the model selection process on this particular route. Crucially, when the selected best model (ElasticNet) is subsequently deployed as the generalistic approach to other routes, as shown in Table 6, it yields positive R2 values for all 12 routes. This demonstrates that the patterns learned from Route 20, while not surpassing the mean on its own test set, contain transferable knowledge that provides a better-than-mean prediction for routes.
The best model, namely ElasticNet, was deployed for the individual routes, resulting in Table 6, which summarizes the key metrics. The R2 results shows low–moderate explanatory performance across the different routes, with Route 11 exhibiting the highest score of 0.41, suggesting that the model explains 41% of the variance in boardings for this route, representing the best fit among all routes. Additionally, the bias values across the routes are close to zero.

4.2. Specialized Performance

In this approach, twelve independent models were created, one for each bus route, allowing for greater adaptation to the specific operational and behavioral patterns of each line. The performance results obtained through the execution of Algorithm 1 and the manual selection of the best-performing models are summarized in Table 7.
Algorithm 1 Predictive models development
1: results ← [ ]
2: for each model do
3:    define all possible hyperparameter combinations H
4:    model_results ← [ ]
5:    for i ← 1 to 50 do //hyperparameter Tuning
6:    randomly selects a hyperparameter combination hH
7:    create five folds from the trainset and testset
8:    fold_metrics ← [ ]
9:    for each fold do //time-series cross validation
10:     fit trainset with hyperparameters h
11:     make predictions on the testset
12:     fold_metricsfold_metrics ∪ {performance metrics}
13:    end for
14:    model_resultsmodel_results ∪ {average(fold_metrics)}
15:    end for
16:    resultsresults ∪ {model_results}
17: end for
18: Return results
Overall, the ensemble-tree-based models frequently emerged as the stronger performers across the different lines, with the XGBoost being the best model in 8 out of the 12 routes. The variability in model performance across lines highlights the importance of tailoring the choice of model to the specific characteristics and challenges of each dataset. These results are in line with the findings demonstrated in [19].
Their effectiveness varies depending on the specific characteristics of each route, for instance, Route 20 had the LSTM model as the best, while the regression models consistently under-perform in the remaining routes, indicating they are not very effective in predicting the number of passengers for the routes on the analyzed bus network.
Compared to the generalistic approach (Table 6), the specialized models generally achieved a substantial improvement in the key metrics. For instance, Routes 10, 11, 30, and 60, showed gains of more than 0.25 (R2) in explanatory power, confirming that route-specific training enhances model adaptability and precision and reduced the error metrics.
However, it is important to highlight that in this approach Bias has increased considerably in some routes, particularly for Routes 40, 41 and 50, that exhibits high spatial heterogeneity in demand across zones and strong temporal asymmetry between peak and off-peak periods. This effect is amplified in routes with uneven zone-level boarding distributions and limited representation of low-demand intervals in the training data.
Although the selection procedure identifies the best-performing model based on aggregated error metrics, the results for Route 50 indicate that a bias-aware model selection may be beneficial. For instance, in this case, the linear regression model achieved a substantially lower bias (bias: 2.17) while maintaining competitive accuracy (RMSE: 17.79, MAE: 13.76), suggesting that constraining model selection by acceptable bias thresholds can effectively mitigate systematic Forecast Bias without significantly degrading overall predictive performance.
Based on the results, Route 10 demonstrated one of the best performances. As shown in Figure 3, the predictions tend to match the actual values, highlighting the model’s ability to estimate the number of boardings during peak and off-peak trends. There is a noticeable pattern of high demand during the regular weekdays, especially on the first trips of the day, which the model captures well. During non-peak times, the model also maintains a reasonable level of accuracy, though with slight deviation from the actual values.
Figure 4 exhibits the aggregated actual values and predictions for each day of the month. It shows a consistent cyclical pattern, indicating regular tied to weekly cycles, such as higher values on weekdays and lower on weekends. The alignment of peaks and troughs suggests that the model captures the timing of these fluctuations with some discrepancies of predictions slightly over or underestimating actual values.

5. Discussion

ElasticNet had the best overall results for the generalistic approach, indicating that it not only minimizes the error metrics but also maintains a relative bias in its forecasts. However, for all models, R-squared did not perform well, suggesting that the model struggled to explain the variance in the data. The results indicate that predictions vary significantly between different lines and directions, likely due to differences in passenger patterns and demand trends. These performances suggest that, while the model is somehow robust, there may be opportunity for adjustments to further enhance its reliability.
In methodological terms, the relatively strong performance of ElasticNet compared to more complex ensemble methods reinforces the success of this model suggests that linear relationships, when properly regularized, can still yield competitive predictive accuracy [12]. This finding supports the notion that lightweight, transparent models can serve as effective baselines in transportation analytics as it offers a balance of reasonable accuracy, computational efficiency, and interpretability.
As supported by Table 8, specialized route-level approaches were able to successfully capture the local trends that might be lost in the generalistic framework, as shown in the results. This improvement is especially relevant for practical implementation in small- and medium-scale transportation systems, where operational heterogeneity across lines can undermine the reliability of a single unified model [30].
The performance variability across lines highlights the importance of tailoring the modeling approach to the unique characteristics of each route and direction, as the literature suggest that tailoring travel direction provides higher accuracy [31]. For instance, while Route 10 achieved a R2 of 0.733, indicating strong explanatory power, other routes such as Routes 41 and 50 presented low or even negative R2 values, suggesting that additional contextual or temporal features may be required to improve predictive accuracy. Such discrepancies may also stem from route-specific factors such as irregular schedules, inconsistent passenger behaviors, or seasonal anomalies.
Furthermore, the bias values in Table 7 reveal interesting patterns: while some routes maintain near-zero bias, indicating well-balanced forecasts, others exhibit significant positive bias, reflecting a consistent overestimation of demand. This suggests that, although the specialized models capture general trends effectively, they may still benefit from a bias correction or calibration stage before real-world deployment [32]. As mentioned in Section 4.2, a bias-aware model selection would be beneficial to mitigate bias while maintaining prediction accuracy. Alternatively, a quantile-regression-based calibration could be employed to align predicted and observed demand distributions [33]; while these techniques were not implemented in the present study, they represent promising extensions for improving the robustness of specialized approach.
Error metrics and R2 reveal the superior performance of the specialized approach over the generalistic, with better R2 metrics for eight bus routes, an RMSE reduction of 19.46%, and an MAE reduction of 17.36%. However, this approach exhibited more bias, particularly for Routes 40, 41, and 5. For instance, Route 50 had a Forecast Bias of 6.69, compared to only 0.0015 with the generalistic model, highlighting a tendency to overestimate or underestimate values. Despite the improved accuracy in error metrics and R2, the specialized models showed significant bias issues.
Specialized models are fine-tuned for the specific characteristics of each route, offering more accurate predictions of passenger flow compared to generalized models [34]. However, it is important to highlight that the development of specialized models requires a significantly longer computation time. In a practical scenario where this specialized approach would need to be applied to numerous bus lines and directions, the cumulative computation time would increase substantially and, therefore, so would costs [12]. This extended processing period is critical, especially when scaling the model to cover the entire public bus transportation network.
A potential strategic application is a hybrid framework: employing specialized, high-accuracy models for core, high-demand routes where optimization yields the greatest operational benefit, while using the efficient generalistic approach for peripheral or lower-volume services.
The findings from the evaluation of various predictive models for estimating passenger boarding demand in public bus transportation offer several significant implications for operations and strategic planning. Accurate forecasts derived from choosing model that best fits a route enables transport operators to optimize resource allocation and enhance service planning. By accurately estimating the expected number of boardings per route, direction, and time period, operators can adjust service levels to better match demand, reducing overcrowding and minimizing underutilization that currently affects public transport [35,36]. This improved alignment between supply and passenger needs enhances the efficiency and responsiveness of the transportation.
Furthermore, precise demand forecasting contributes to cost optimization. With a clearer understanding of boarding patterns, transport managers can make informed decisions on fleet deployment, vehicle scheduling, and resource distribution, reducing operational expenses such as fuel, maintenance, and idle time while improving asset utilization. These efficiencies translate directly into an enhanced passenger experience, including shorter wait times, reduced crowding, and more reliable service, which are key factors in increasing the attractiveness of public transport for daily commuters.
Beyond operational improvements, predictive modeling supports strategic and long-term decision making. Insights into evolving demand patterns enable proactive planning for route expansion, service redesign, or the introduction of new lines. Forecasting also provides a foundation for managing fluctuations in passenger volumes, allowing authorities to implement adaptive measures such as dynamic capacity allocation, special-event scheduling, or variable pricing strategies.
A key advantage of this approach is its adaptability and scalability. Developing tailored, route-specific models significantly enhances prediction accuracy, while maintaining the potential to scale across an entire transport network. This adaptability allows agencies to deploy data-driven solutions that address local variations in demand while contributing to overall network efficiency.
Finally, integrating predictive modeling into public transport operations promotes environmental sustainability [30,37]. Optimized service alignment reduces unnecessary trips, fuel consumption, and emissions, supporting broader urban sustainability goals. Additionally, adopting predictive analytics fosters a data-informed organizational culture, driving further innovation in service management, passenger engagement, and operational excellence.
Within this context, the predictive performance obtained in this study is consistent with values reported in the literature. Several studies in the literature report strong predictive performance for passenger demand forecasting tasks. For instance, Ref. [38] reported forecasting results for Thane and Mumbai, with MAE values between 4.338 and 5.561 and RMSE between 8.752 and 11.267 using LightGBM and XGBoost models on large metropolitan datasets. Similarly, Ref. [39] achieved MAE and RMSE values of 3.13 and 4.78, respectively, for station-level predictions in Salamanca,. Comparable results have been reported for large Chinese cities, such as Guangzhou and Dalian, where station-level models achieved RMSE values between 3.58 and 4.76 [40,41]. Spanos et al. [42] reported on their study RMSE values ranging from approximately 8.6 to 29.8, depending on network complexity and city size—Tampere, Frankfurt, Carinthia and Trikala. Similarly, a study in the large network of Qingdao reported MAE and RMSE values of 14.91 and 19.80, respectively [43]. The results obtained with the generalistic approach in this study fall within this range, with average RMSE and MAE values of 13.84 and 9.60 across all routes. More importantly, the specialized approach reduced these errors to average values of 11.15 (RMSE) and 7.93 (MAE), corresponding to reductions of approximately 19.5% and 17.4%, respectively.
However, these results are obtained under substantially different conditions, including different demand volumes, spatial conditions, travel patterns and other characteristics. As the present study focuses on a medium-sized metropolitan region characterized by heterogeneous routes, multi-zone operations, and pronounced variability in passenger behavior, these contextual differences directly affect the scale and distribution of the target variable, making the direct numerical comparisons of MAE and RMSE values across studies with different localities and data inherently problematic.
The main contribution of this work lies in demonstrating—through quantitative evidence under identical data and evaluation conditions—that a specialization approach yields measurable and systematic performance gains over transferable generalistic approach. This controlled comparison demonstrates that specialized models systematically improve explanatory power, with R2 values increasing for eight out of twelve routes, and the reduction in prediction error. Results also reveal that improved accuracy may come at the cost of increased Forecast Bias and computing times for specific routes.
In summary, incorporating data-driven approaches to estimate bus passenger boarding in public transport systems strengthens operational performance, passenger satisfaction, and sustainability. By embedding these models into both day-to-day management and strategic planning, transport authorities can build more efficient, equitable, and resilient mobility systems that better serve the evolving needs of urban populations.

6. Conclusions

This study demonstrates the potential of predictive modeling to enhance the understanding and management of passenger boarding demand in public bus systems. Among the evaluated approaches, specialized models proved especially valuable despite their higher computational cost, as they capture route-specific dynamics more effectively and weigh the most relevant variables for each case. For long-term applications and strategic planning aimed at improving passenger satisfaction, these tailored models offer greater precision and reliability. Their success highlights the importance of adopting context-sensitive approaches in public transportation, where recognizing route-level variability is essential to efficient service delivery.
The comparative analysis of predictive algorithms revealed that ensemble-tree-based methods, particularly XGBoost, consistently outperformed other models across the evaluated routes. Their ability to model nonlinear relationships and complex interactions makes them well suited for routes with highly variable passenger demand.
The implications of these findings extend beyond technical performance. Accurate passenger demand estimation enables transit authorities to adjust service frequency, select appropriate vehicle types, and allocate resources based on actual ridership patterns and seasonal trends. Such precision planning can reduce overcrowding during peak hours, minimize operational costs during off-peak periods, and improve both passenger comfort and overall service efficiency. Ultimately, this contributes to a more adaptive and sustainable public transport system.
The work, while sufficient to demonstrate that specialized approach outperform the generalistic, the feasibility of scaling the specialized to a full network requires careful consideration. Developing and maintaining hundreds of individual route models entails significant computational and administrative overhead, as noted in Section 5. This challenge highlights the practical value of the generalistic approach as a scalable solution.
Future work should explicitly test on a larger, more representative sample of routes, the performance degradation of a generalistic approach and possibilities to improve computational efficiency gains of the specialized approach. A direct comparison under these scaled conditions would provide clearer, actionable guidance for transit agencies.
Future research should also address the absence of alighting and station-level stop-by-stop boarding data. This lack prevents construction of full Origin–Destination (OD) matrices and prevents direct measurement of in-vehicle occupancy levels. This missing information may overestimate demand in zones that are actually alighting-dominated and underestimate demand where passengers concentrate transfers. Prediction accuracy is reduced, as it prevents modeling of trip chaining effects and induce systematic bias in zones that behave primarily as origins or sink (terminal with high alighting or residential zones with high boardings).
Therefore, future refinement should prioritize the integration of detailed stop-level boarding data and contextual variables, such as weather conditions, traffic states, special events, holidays, and school schedules, to better capture demand drivers and reduce unexplained variability. In addition, could also study and test trip-chaining assumptions and probabilistic OD estimation techniques, supported by auxiliary data sources, would allow partial reconstruction of alighting flows and in-vehicle occupancy patterns, mitigating spatial bias. Finally, extending the proposed framework to multimodal transport networks would enable the analysis of intermodal transfers and network-wide demand propagation, offering a more comprehensive representation of urban mobility dynamics.
In summary, this work highlights the value of data-driven, route-specific modeling for optimizing public transport operations. By leveraging predictive analytics, authorities can move towards more efficient, equitable, and sustainable urban mobility systems.

Author Contributions

G.B.: investigation, formal analysis, methodology, writing—original draft. J.N.J.: data curation, validation and writing—review and editing. T.G.D.: conceptualization, methodology, validation, supervision, writing—review and editing. M.C.F.: conceptualization, methodology, validation, supervision, writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets analyzed during the current study are not publicly available due to institutional restrictions.

Conflicts of Interest

The authors declare that they have no competing interests.

References

  1. Garus, A.; Mourtzouchou, A.; Suarez, J.; Fontaras, G.; Ciuffo, B. Exploring Sustainable Urban Transportation: Insights from Shared Mobility Services and Their Environmental Impact. Smart Cities 2024, 7, 1199–1220. [Google Scholar] [CrossRef]
  2. Karjalainen, L.E.; Juhola, S. Framework for Assessing Public Transportation Sustainability in Planning and Policy-Making. Sustainability 2019, 11, 1028. [Google Scholar] [CrossRef]
  3. Litman, T. Evaluating Public Transportation Health Benefits; Victoria Transport Policy Institute for the American Public Transportation Association: Victoria, BC, Canada, 2015. [Google Scholar]
  4. Glaeser, E.L.; Kahn, M.E. The greenness of cities: Carbon dioxide emissions and urban development. J. Urban Econ. 2010, 67, 404–418. [Google Scholar] [CrossRef]
  5. Cervero, R. Transport infrastructure and Global Competitiveness: Balancing mobility and livability. Ann. Am. Acad. Political Soc. Sci. Ann. 2009, 626, 210–225. [Google Scholar] [CrossRef]
  6. Pelletier, M.P.; Trépanier, M.; Morency, C. Smart card data use in public transit: A literature review. Transp. Res. Part C Emerg. Technol. 2011, 19, 557–568. [Google Scholar] [CrossRef]
  7. Barry, J.J.; Newhouser, R.; Rahbee, A.; Sayeda, S. Origin and Destination Estimation in New York City with Automated Fare System Data. Transp. Res. Rec. 2002, 1817, 183–187. [Google Scholar] [CrossRef]
  8. Wang, W.; Attanucci, J.P.; Wilson, N.H. Bus passenger Origin-Destination estimation and related analyses using automated data collection systems. J. Public Transp. 2011, 14, 131–150. [Google Scholar] [CrossRef]
  9. Zhao, J.; Rahbee, A.; Wilson, N.H.M. Estimating a rail passenger trip Origin-Destination Matrix using automatic data collection systems. Comput.-Aided Civ. Infrastruct. Eng. 2007, 22, 376–387. [Google Scholar] [CrossRef]
  10. Baratti, L. Automated Passenger Counting (APC) Systems and Their Use in Transport Companies. Benchmarking Among Different APC Systems: Technical, Functional and Economic Analysis. Ph.D. Thesis, Politecnico di Torino, Turin, Italy, 2021. [Google Scholar]
  11. Sun, W.X.; Song, T.; Zhong, H. Study on Bus Passenger Capacity Forecast Based on Regression Analysis including Time Series. In Proceedings of the 2009 International Conference on Measuring Technology and Mechatronics Automation, Zhangjiajie, China, 11–12 April 2009; pp. 381–384. [Google Scholar] [CrossRef]
  12. Wood, J.; Yu, Z.; Gayah, V.V. Development and evaluation of frameworks for real-time bus passenger occupancy prediction. Int. J. Transp. Sci. Technol. 2023, 12, 399–413. [Google Scholar] [CrossRef]
  13. Kuhn, M.; Johnson, K. Applied Predictive Modeling; Springer: New York, NY, USA, 2013. [Google Scholar] [CrossRef]
  14. Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol. 2005, 67, 301–320. [Google Scholar] [CrossRef]
  15. Geçici, E.; Aydin, Z.G. Estimation of future number of passes and optimization of number of trips based on Istanbul hourly public transportation data. Istanb. Univ.—J. Electr. Electron. Eng. 2024, 24, 238–246. [Google Scholar] [CrossRef]
  16. Porat, O.; Fire, M.; Ben-Elia, E. A Comprehensive Machine Learning Framework for Micromobility Demand Prediction. arXiv 2025, arXiv:2507.02715. [Google Scholar] [CrossRef]
  17. Biau, G.; Scornet, E. A random forest guided tour. TEST 2016, 25, 197–227. [Google Scholar] [CrossRef]
  18. Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16), San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar] [CrossRef]
  19. Vandewiele, G.; Colpaert, P.; Janssens, O.; Van Herwegen, J.; Verborgh, R.; Mannens, E.; Ongenae, F.; De Turck, F. Predicting Train Occupancies based on Query Logs and External Data Sources. In Proceedings of the 26th International Conference on World Wide Web Companion (WWW ’17 Companion), Perth, Australia, 3–7 April 2017; pp. 1469–1474. [Google Scholar] [CrossRef]
  20. Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. LightGBM: A highly efficient gradient boosting decision tree. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  21. Gallo, F.; Sacco, N.; Corman, F. Network-Wide Public Transport Occupancy Prediction Framework with multiple line interactions. IEEE Open J. Intell. Transp. Syst. 2023, 4, 815–832. [Google Scholar] [CrossRef]
  22. Pasini, K.; Khouadjia, M.; Same, A.; Ganansia, F.; Oukhellou, L. LSTM Encoder-Predictor for Short-Term Train Load Forecasting; Springer: Cham, Switzerland, 2020; pp. 535–551. [Google Scholar] [CrossRef]
  23. Hochreiter, S.; Schmidhuber, J. Long Short-Term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
  24. Gers, F.A.; Schmidhuber, J.; Cummins, F. Learning to Forget: Continual Prediction with LSTM. Neural Comput. 2000, 12, 2451–2471. [Google Scholar] [CrossRef] [PubMed]
  25. Graves, A. Supervised Sequence Labelling with Recurrent Neural Networks; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar] [CrossRef]
  26. Monje, L.; Carrasco, R.A.; Rosado, C.; Sánchez-Montañés, M. Deep Learning XAI for bus passenger forecasting: A use case in Spain. Mathematics 2022, 10, 1428. [Google Scholar] [CrossRef]
  27. Shearer, C. The CRISP-DM model: The new blueprint for data mining. J. Data Warehous. 2000, 5, 13–22. [Google Scholar]
  28. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  29. Chai, T.; Draxler, R.R. Root mean square error (RMSE) or mean absolute error (MAE)? – Arguments against avoiding RMSE in the literature. Geosci. Model Dev. 2014, 7, 1247–1250. [Google Scholar] [CrossRef]
  30. Hoppe, J.; Schwinger, F.; Haeger, H.; Wernz, J.; Jarke, M. Improving the prediction of passenger numbers in public transit networks by combining Short-Term forecasts with Real-Time occupancy data. IEEE Open J. Intell. Transp. Syst. 2023, 4, 153–174. [Google Scholar] [CrossRef]
  31. Baro, J.; Khouadjia, M. Passenger flow forecasting on transportation network: Sensitivity analysis of the spatiotemporal features. In Proceedings of the 2021 International Conference on Data Mining Workshops (ICDMW), Virtual, 7–10 December 2021; pp. 734–741. [Google Scholar] [CrossRef]
  32. Mikkelsen, L.M.; Schwefel, H.P.; Madsen, T.K.; Burggraf, A. Comparison of WLAN Probe and Light Sensor-Based Estimators of bus occupancy using live deployment data. Sensors 2022, 22, 4111. [Google Scholar] [CrossRef]
  33. Feldman, S.; Bates, S.; Romano, Y. Calibrated multiple-output quantile regression with representation learning. J. Mach. Learn. Res. 2023, 24, 1–48. [Google Scholar]
  34. Lv, W.; Lv, Y.; Ouyang, Q.; Ren, Y. A Bus Passenger Flow Prediction Model Fused with Point-of-Interest Data Based on Extreme Gradient Boosting. Appl. Sci. 2022, 12, 940. [Google Scholar] [CrossRef]
  35. Cantwell, M.; Caulfield, B.; O’Mahony, M. Examining the Factors that Impact Public Transport Commuting Satisfaction. J. Public Transp. 2009, 12, 1–21. [Google Scholar] [CrossRef]
  36. Patra, S.S.; Vanajakshi, L. Application of Low-Cost IoT sensors for smart public transportation. Transp. Dev. Econ. 2024, 10, 25. [Google Scholar] [CrossRef]
  37. Gupta, S.; Khanna, A.; Talusan, J.P.; Said, A.; Freudberg, D.; Mukhopadhyay, A.; Dubey, A. A Graph Neural Network Framework for Imbalanced Bus Ridership Forecasting. In Proceedings of the 2024 IEEE International Conference on Smart Computing (SMARTCOMP), Osaka, Japan, 29 June–2 July 2024; pp. 14–21. [Google Scholar] [CrossRef]
  38. Patel, M.; Patel, S.B.; Swain, D.; Shah, S. Unleashing the potential of boosting techniques to optimize Station-Pairs passenger flow forecasting. Procedia Comput. Sci. 2024, 235, 32–44. [Google Scholar] [CrossRef]
  39. Mariñas-Collado, I.; Sipols, A.E.; Santos-Martín, M.T.; Frutos-Bernal, E. Clustering and Forecasting Urban Bus Passenger Demand with a Combination of Time Series Models. Mathematics 2022, 10, 2670. [Google Scholar] [CrossRef]
  40. Zhai, H.; Tian, R.; Cui, L.; Xu, X.; Zhang, W. A novel hierarchical hybrid model for Short-Term bus passenger flow Forecasting. J. Adv. Transp. 2020, 2020, 7917353. [Google Scholar] [CrossRef]
  41. Zou, L.; Shu, S.; Lin, X.; Lin, K.; Zhu, J.; Li, L. Passenger Flow Prediction Using Smart Card Data from Connected Bus System Based on Interpretable XGBoost. Wirel. Commun. Mob. Comput. 2022, 2022, 5872225. [Google Scholar] [CrossRef]
  42. Spanos, G.; Lalas, A.; Votis, K.; Tzovaras, D. Principal component Random forest for passenger demand forecasting in cooperative, connected, and automated mobility. Sustainability 2025, 17, 2632. [Google Scholar] [CrossRef]
  43. Han, Y.; Wang, C.; Ren, Y.; Wang, S.; Zheng, H.; Chen, G. Short-Term prediction of bus passenger flow based on a hybrid optimized LSTM network. ISPRS Int. J. Geo-Inf. 2019, 8, 366. [Google Scholar] [CrossRef]
Figure 1. Average boardings by day of the week across lines in 2023.
Figure 1. Average boardings by day of the week across lines in 2023.
Applsci 16 01384 g001
Figure 2. Average boardings by hour across lines in 2023.
Figure 2. Average boardings by hour across lines in 2023.
Applsci 16 01384 g002
Figure 3. Route 10 actual vs. predicted boardings by trip for the second week of May 2023.
Figure 3. Route 10 actual vs. predicted boardings by trip for the second week of May 2023.
Applsci 16 01384 g003
Figure 4. Route 10 actual vs. predicted boardings by day for May 2023.
Figure 4. Route 10 actual vs. predicted boardings by day for May 2023.
Applsci 16 01384 g004
Table 1. Comparison of predictive models.
Table 1. Comparison of predictive models.
MethodApplicationsKey Limitations
Linear RegressionShort-term forecasts; impact analysis of operational factors; computationally efficient; high interpretable.Assumes linearity; sensitive to outliers/collinearity; can yield illogical predictions.
Elastic NetForecasting with many correlated features (e.g., spatial–temporal data); automatic variable selection and regularization.Inherits linearity constraint; may not capture complex nonlinear passenger flow patterns.
Random ForestReal-time occupancy prediction; modeling nonlinear interactions; robust to deal with outliers and overfitting.Low interpretability; computationally intensive with large forests.
XGBoostHigh-accuracy; complex, mixed data types (including handling missing data).Extensive tuning; computationally heavy; low interpretability.
LightGBMLarge-scale, network-wide forecasting; applications requiring fast training and low memory use.Prone to overfitting on small datasets; sensitive to hyperparameter configuration.
LSTMCapturing long-term, complex temporal dependencies.Very low interpretability; data-hungry; computationally heavy.
Table 2. Bus lines’ summary statistics.
Table 2. Bus lines’ summary statistics.
BoardingsTrip Time
LineZonesAverageAnnualAverageMedian
1417.32855 k1.201.20
2610.05913 k1.971.95
3314.96871 k1.191.18
4326.031121 k1.431.47
5413.90742 k1.191.17
6127.19597 k0.900.90
Table 3. Variables selected for training.
Table 3. Variables selected for training.
VariableDescriptionType
period_bus_timePeriod of bus operation timeint *
dayDay of the month (1 to 31)int
monthMonth of the year (1 to 12)int
quarterQuarter of the year (1 to 4)int
weekdayWeekday (1 to 7)int *
weekendWhether the day is a weekendboolean
zone_idID of the zoneint
boardings_zone_lag_7Boardings in the zone lagged 7 daysint
boardings_zone_lag_14Boardings in the zone lagged 14 daysint
total_boardingsBoardings in the zoneint
direction **Direction of trip (0—Outward; 1—Back)boolean
* Categorical variable. ** Variable used only to split data frames in two by travel direction.
Table 4. Hyperparameter distributions for tuning.
Table 4. Hyperparameter distributions for tuning.
ModelHyperparameterDistributions
ElasticNetregressor_alphauniform (0.001, 100)
regressor_l1_ratiouniform (0.1, 0.9)
regressor_fit_intercept[True, False]
regressor_normalize[True, False]
regressor_max_iterrandint (1000, 5000)
regressor_toluniform (1 × 10−4, 1 × 10−2)
regressor_warm_start[True, False]
Random Forestn_estimators[100, 200]
max_depth[10, 20, None]
XGBoostn_estimatorsrandint (100, 500)
learning_rateuniform (0.001, 0.2)
max_depthrandint (3, 13)
min_child_weightrandint (1, 11)
gammauniform (0, 0.2)
subsampleuniform (0.6, 0.4)
colsample_bytreeuniform (0.6, 0.4)
reg_alphauniform (0, 1)
reg_lambdauniform (0, 1)
LightGBMn_estimatorsrandint (100, 500)
learning_rateuniform (0.001, 0.2)
num_leavesrandint (31, 128)
max_depth[−1, 10, 20, 30]
min_data_in_leafrandint (20, 201)
feature_fractionuniform (0.6, 0.4)
bagging_fractionuniform (0.6, 0.4)
bagging_freqrandint (0, 11)
LSTMunits[10, 50, 100]
optimizer[‘adam’, ‘rmsprop’]
batch_size[10, 50, 100]
Table 5. Generalistic approach model selection.
Table 5. Generalistic approach model selection.
ModelRMSEMAER2Bias
Linear Regression8.0655.949−0.266−0.837
ElasticNet7.7615.825−0.1880.076
Random Forest8.8016.451−0.7161.021
XGBoost7.7336.182−0.2321.936
LightGBM7.8336.174−0.2741.698
Bold values indicate the two best results for each metric, while the bold model name denotes the overall best-performing model.
Table 6. Performance metrics for generalistic approach.
Table 6. Performance metrics for generalistic approach.
RouteRMSEMAER2Bias
1013.0238.8080.3930.001
1112.9549.2520.4100.000
208.1985.8830.1320.000
219.0225.8310.237−0.001
3014.9099.4950.3450.021
3112.5948.7630.1660.001
4011.7318.2750.351−0.003
4112.6838.9510.2040.001
5019.30813.4720.3590.002
5116.80612.4920.239−0.003
6018.05711.6360.3580.000
6116.80212.3510.272−0.001
Table 7. Performance metrics for specialized approach.
Table 7. Performance metrics for specialized approach.
RouteBest ModelRMSEMAER2Bias
10XGBoost8.5536.1730.7330.618
11XGBoost9.7747.2750.5770.899
20LSTM7.6845.774−0.1840.965
21XGBoost7.5874.9890.345−0.328
30XGBoost11.2277.0460.621−0.931
31Random Forest8.8996.1620.3970.134
40XGBoost8.9266.7450.1071.996
41XGBoost12.4628.4590.027−2.439
50XGBoost16.28912.936−0.0326.689
51LightGBM15.19811.0420.3460.031
60XGBoost13.8649.0890.631−0.023
61LightGBM13.3099.5160.555−0.886
Table 8. Generalistic and specialized results comparison.
Table 8. Generalistic and specialized results comparison.
Generalistic ModelSpecialized Model
RMSEMAER2BiasRMSEMAER2Bias
Sum166.086115.2083.4670.019133.77295.2064.1236.723
Mean13.8419.6010.2890.00211.1487.9340.3440.560
Median12.9899.1010.3080.00010.5007.1600.3710.083
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bongiovi, G.; Dias, T.G.; Junior, J.N.; Campos Ferreira, M. A Data-Driven Approach to Estimating Passenger Boarding in Bus Networks. Appl. Sci. 2026, 16, 1384. https://doi.org/10.3390/app16031384

AMA Style

Bongiovi G, Dias TG, Junior JN, Campos Ferreira M. A Data-Driven Approach to Estimating Passenger Boarding in Bus Networks. Applied Sciences. 2026; 16(3):1384. https://doi.org/10.3390/app16031384

Chicago/Turabian Style

Bongiovi, Gustavo, Teresa Galvão Dias, Jose Nauri Junior, and Marta Campos Ferreira. 2026. "A Data-Driven Approach to Estimating Passenger Boarding in Bus Networks" Applied Sciences 16, no. 3: 1384. https://doi.org/10.3390/app16031384

APA Style

Bongiovi, G., Dias, T. G., Junior, J. N., & Campos Ferreira, M. (2026). A Data-Driven Approach to Estimating Passenger Boarding in Bus Networks. Applied Sciences, 16(3), 1384. https://doi.org/10.3390/app16031384

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop