1. Introduction
Among the strategies developed to implement mass transportation (defined as moving a large number of citizens) are means of transport such as trams, subways, and Bus Rapid Transit (BRT). BRT has been highly implemented in the main cities of Colombia and other Latin American countries. A notable example is the BRT system implemented in Bogotá (Colombia), known as Transmilenio [
1].
Transmilenio is an important transportation system for the country’s capital (Bogotá, Colombia) which provides an outstanding service to citizens, reducing travel times at favorable prices; however, it has some aspects to improve. The main complaints of Transmilenio users are about high congestion at stations, long waiting times, and perception of insecurity [
2]. The improvement of Transmilenio’s infrastructure, in recent years, has not been sufficient to meet the population growth in Bogotá, going from 6,840,116 inhabitants in 2005 to an estimated projection of 7,930,000 inhabitants in 2022 [
3]. Additionally, Bogota is one of the most visited cities in Colombia, which increases the demand for public transit in the city.
The majority of BRT systems in Latin America have similar drawbacks, related to high demand during critical hours in the day (peak hours), for which the current infrastructure is not sufficient, although in some cases it has been improved. Passenger load is an important aspect for adequate planning of BRT routes, calculating expected demand both during peak hours and at other times of the day, which seeks to improve the quality of service and user perception.
Passenger load, in a BRT system, is defined as the number of passengers who use the service at a specific time, in one direction of the road, at a particular station [
4]. Automatically predicting passenger load for the following days of operation is commonly complicated. Although some BRT systems have online service operation data, predicting passenger flow requires other types of technological tools. BRT systems commonly use an origin–destination (OD) matrix to calculate passenger flow at stations and appropriately plan their service (routes, vehicles, frequencies); however, these tools require information on each passenger’s entry and exit stations. Typically, a BRT management system detects the entry of a user into a station but does not determine the direction of that user (to the next or previous station) nor the station at which the user exits the BRT system. The direction of passengers and the station where they exit the system can be obtained through sensors and using smart cards in the system infrastructure, but this increases congestion at the stations (due to the time it takes for each user to do so), so it is not commonly performed.
BRT service routes can be programmed using specialized software and passenger load data, specifying the number of vehicles per route, route frequencies, and other parameters. Calculating the passenger load at all stations of a BRT system is not a quick or efficient process, since this excessively increases the amount of data to be evaluated and therefore the time to do so. Additionally, the greatest congestion problems occur at key stations of the BRT service [
4]. Therefore, the selection of key stations, for the required calculation, is representative of the service in general. A limited number of stations could be representative of a system like Transmilenio (10, for example, of the 138 currently existing), considering its size and number of users per day.
Machine learning (ML) has been one of the fastest-growing technologies in recent years. ML has been used to analyze large volumes of data and make certain types of predictions about certain variables. An ML model could predict the passenger load per station in a BRT system if a dataset with the necessary information about the transportation service is available [
5]. This data has been collected and is available for some BRT systems, depending on the level of technological development they have and their progress in Intelligent Transportation Systems (ITSs). Transmilenio (Bogotá) has made a significant investment in ITSs, which has allowed it to collect several datasets in recent years. These datasets have been used to manage and continuously improve the transportation system. Some of these datasets have been made available to the general public as open data. The open data provided by Transmilenio could be useful to generate an ML model that predicts passenger load at certain stations. Currently, Transmilenio performs a prediction of passenger load at certain stations using an OD matrix, which takes considerable time (several months) to analyze. This prediction analysis assumes certain values, for example, the number of exits at a station, because exact detection of passenger exits is not possible due to logistical constraints.
Although there are different studies on the operational improvement of BRT systems using ML, these have been mainly focused on predicting the time of arrival of vehicles at stations or continuous monitoring throughout routes [
6,
7]. Although passenger load data is used in some research to predict other service-related variables in BRT, it is not used as a target variable in the machine learning works consulted.
The problem we are trying to solve is to calculate the passenger load in a BRT system in a more automated and efficient way, avoiding calculations such as those performed with an OD matrix. Considering the above, we proposed the implementation of an ML model that predicts the passenger load at key stations in a BRT system, using and pre-processing the open data available. The most representative contributions of this article are the following:
The implemented ML model, which predicts the passenger load of each BRT system station. BRT system operators, by predicting the passenger load in the following days, can adjust the number of vehicles per route and the frequency of those routes to improve user service.
The generated datasets obtained after a rigorous data preparation process, which could be used by similar projects in BRT systems in other cities. These datasets could be used either as input to developed models or as templates to generate a similar datasets.
The model evaluation using the obtained datasets and certain ML algorithms. For this model, the best results in terms of metrics were sought, adjusting certain parameters and adjusting the amount of data taken.
The proposals for improving the developed model are options that can be considered in other similar ML models, to seek better results in the selected metrics.
The novelty of our work is related to the specific prediction of passenger load for the improvement of BRT systems’ service. Although service improvement in BRT systems has been investigated for several years, and machine learning work has been conducted on the topic in the last decade, most of the related articles focus on other types of predictions, such as those of travel time or waiting time. These works do indeed use passenger load as the relevant data, but the predictions made are related to other related variables. In a literature review conducted on this subject (
Section 2.1), we identified that some works focus on the design and planning of bus networks; others focus on travel and waiting times; and others specifically focus on cargo or passenger demand. Additionally, the dataset generated from the daily data of a BRT used as a use case is an important tool for future studies related to the improvement of this means of transport. The fields ultimately used in the dataset, its format, the processing performed, and the ideal period for the number of days of the data are relevant aspects for such future studies.
One of the benefits of the work performed is precisely that BRT operators can review the information obtained from the predictions of passenger load, compare it with current data, and subsequently make adjustments to the routing planning, but these are not analyzed in this work. Modifications in routing planning would be actions subsequent to the results of this research.
The next sections of this article are organized as follows:
Section 2 presents the materials and methods of this research.
Section 3 presents the results obtained. The discussion of the research results is presented in
Section 4. Finally,
Section 5 presents the conclusions.
2. Materials and Methods
The selected methodology for implementation and evaluation of the ML model was based on the Cross-Industry Standard Process for Data Mining, CRISP-DM [
8]. CRISP-DM is a methodology for developing data mining processes that provide a lifecycle model for data analysis projects. This model includes six interrelated phases, which are business understanding, data understanding, data preparation, modeling, evaluation, and deployment [
8].
A systematic literature review (
Section 2.1) was conducted in the business understanding phase.
Section 2.2,
Section 2.3,
Section 2.4 and
Section 2.5 present the data understanding phase of CRIPS-DM (with a review of Intelligent Transportation System architectures and their services, BRT systems, BRT open datasets available, and algorithms used in related works reviewed). Next,
Section 2.6 presents the activities performed in the data preparation phase (of CRISP-DM). Finally,
Section 2.7 and
Section 2.8 present the modeling and evaluation phases, respectively.
2.1. Systematic Review
The systematic review conducted in this paper used the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) methodology, 2020 version [
9]. This methodology was designed to assist in the literature review process through the development of three stages: identification, screening, and inclusion. Below, the most relevant information about each of the three stages used in the literature review development is presented.
In the first stage (identification phase), the databases for document searching were selected, including Science Direct, Scopus, Web of Science, and the Unicauca repository. The search conducted on specific topics of this work (BRT, passenger load, control and prediction, and ML) yielded a result of 771 documents, and we proceeded to evaluate them in the following phase. Other BRT-related transportation modes were not considered in these specific topics. Our work focused specifically on BRT systems, as it was the specific transportation mode of interest (because it is widely used in developing countries, such as Colombia). We did not consider other modes of transportation such as buses, subways, etc., in the research process (including the systematic review). Although they commonly operate stations, they have some operational characteristics that are different from those of BRT systems. A relevant characteristic of BRT systems compared to subways is that, in some sections, BRT systems may not have dedicated lanes and may share streets with other vehicles (transit vehicles and private vehicles), which affects travel times. A relevant characteristic of buses compared to BRT systems is that some bus systems (at least in developing countries) do not have adequate stop stations for monitoring and data collection.
In the detection stage, documents were filtered based on their relevance to the main topic. In addition, redundant documents (documents that appeared in more than one database) were filtered. Finally, 94 documents were obtained, which were evaluated in the last phase.
In the last stage (inclusion), documents were filtered by reviewing other sections of the documents, such as the Conclusions, Results, and Methods. Upon completion, a total of 31 documents were obtained, of which only 13 were selected due to their high relation to the research objective. These 13 documents are presented in
Table 1, where four relevant criteria are considered regarding the related works and our proposal. Only our proposal included the four considered criteria, and only 5 of the 13 works took into account three of the four criteria.
2.2. Intelligent Transportation System (ITS) Architectures and Its Services
An ITS architecture provides a common basis for planners and engineers with differing concerns to conceive, design, and implement systems using a common language as a basis for delivering an ITS, but does not mandate any particular implementation [
23].
Our work focuses on mass transit services, so it is essential to consider the recommendations for this type of service using existing ITS architectures (as a global reference). The most used ITS architectures worldwide are the American architecture, named Architecture Reference for Cooperative and Intelligent Transportation, or its acronym, ARC-IT [
23], and the European one, named Framework Made For Europe, FRAME [
24].
The ARC-IT architecture was considered for our work due to the high level of detail of its services and available tools. ARC-IT divides its functionality into 10 service areas. Each service area has a certain number of services (service packages), and each service is adequately detailed with physical diagrams (containing physical and functional objects and information flows). The architecture website provides fairly detailed information on any service required, and each of its components is explained [
23].
The most relevant ARC-IT service area to this work is the public transportation area, which includes services identified as relevant, such as “Transit vehicle tracking service”, “Transit fixed-route operations service”, and “Transit Fleet Management”. The last service is focused on the automatic maintenance of traffic scheduling and monitoring (
Figure 1 shows the physical diagram of this service, with each of the components, aspects, and roles). This figure shows the “Traffic Management Center” component, which is responsible for processing the collected information and scheduling preventive and corrective maintenance for the system. This service also facilitates the daily maintenance of the transportation fleet inventory, as well as the assignment of specific schedules, routes, and operators to the system [
23].
We reviewed the recommended operating procedures for ITS architectures, such as ARC-IT, for the Transit Fleet Management service. Also, with the help of ARC-IT, we sought to understand related business, including how fleet management is performed, including demand estimation and route planning. When reviewing this particular service (
Figure 1), we considered it essential to identify the components and information flows that are key to the proper functioning of fleets and efficient route and frequency planning. We identified, for example, the importance of traffic management data related to route scheduling and the continuous reporting of these data to the management center.
2.3. BRT Data Revision
BRT systems have been introduced in many cities around the world as efficient solutions for transit service. Global BRT Data collects and shares data on BRT transport systems worldwide [
25]. This site shows the number of passengers per day using BRT services, the cities where they are located, and the total distance traveled.
According to statistics, around 34,870,954 passengers are mobilized worldwide per day, of which the largest percentage of people are located in Latin America at 59.6%, followed by Asia at 26.49%; the rest is distributed between Europe, North America, Africa, and Oceania. Approximately 20,785,206 people are transported per day in Latin America, and the countries with the highest influx are Brazil, Colombia, Mexico, Argentina, and Ecuador [
25].
Some authors (authors of [
12,
13,
19,
21]) believe that BRT systems should explore various alternatives to optimize their operation. These authors consider it appropriate to conduct studies focused on solving the most common issues related to high congestion at stations and excessive user wait times. These aspects could be improved without expanding the infrastructure by properly planning vehicle routes, thus improving users’ quality of life and optimizing BRT systems.
An applied prediction model is quite useful for making urban development plans and traffic control, although there are factors that can complicate it, such as road incidents and climate change, among others [
21]. Prediction models have been used in different studies, focused on diverse factors of the systems. For example, some concentrate on the design and planning of bus networks, that is, on the coverage and frequency of routes per station [
13]; others concentrate on travel times and waiting times [
19]; and others concentrate, specifically, on cargo or passenger demand [
17]. The latter is quite important because it may have the ability to improve future operations and satisfy the user needs already mentioned above.
BRT systems have implemented ML models to try to improve their operation since they are more economical and efficient. These ML models use specific data collected in different ways, related to the study to be performed. Typically, the data collected is obtained after making certain investments in ITSs, implemented with IoT systems, so it is essential BRT systems have made a significant investment in technology for their improvement.
Although ML requires high computational resources due to the high amount of data to process, the benefits are great when it is used, for example, in passenger load prediction models, facilitating the corresponding analyses by delivering more accurate results.
Regarding ITS architectures, mainly ARC-IT (which was selected for consideration in this work) and services related to the operation and planning of BRT, the ITS point of view reveals data and processes relevant to achieving adequate operation and planning. These data and processes must be considered when proposing a system that allows the calculation of passenger load in a BRT system. However, we considered it necessary to evaluate whether the selected BRT system has collected data and has made it available for analysis by third parties.
When we explored the data needed for this study, we initially analyzed information on BRT operation, as recommended by ITS architectures. Subsequently, we reviewed several BRT systems (national and international). The most relevant were the following:
Orange Line BRT, located in Los Angeles, the United States.
Ahmedabad BRTS, the public transport system of Ahmedabad Janmarg Ltd. in India.
The BRT system in the Republic of Malta.
Dar es Salaam Bus Rapid Transit (DART), implemented in Dar es Salaam (Tanzania).
The BRT public transportation system of Curitiba (Brazil).
Transmetro in Guatemala City.
Metrobus in Mexico City.
The Massive Integrated West (named MIO), the public transportation system in Cali, Colombia.
Metroplús, from the city of Medellín, Colombia, integrated with the Metro.
Transmilenio, the BRT system of the capital city of Colombia, Bogotá.
After reviewing the information available on the different BRT systems, we considered it appropriate to use the Bogotá (Colombia) Transmilenio BRT system, as it has the most useful and easily accessible information with open data for this work. Transmilenio’s information includes data on the number of passengers, bus frequencies, routes, stations, and fares. This information is essential for conducting a BRT analysis and modeling passenger flow predictions [
26]. The other BRT systems we found either lacked the necessary information or required additional permissions to access it.
2.4. Open Datasets Available
The files offered by Transmilenio’s open data are divided into two different types of data. The first corresponds to the data grouped into tables, in JSON and PBF format, that represent geospatial information; the second type corresponds to those in CSV format [
26], related to the validations of passenger (entries and exits) from trunk stations (with the datasets named “Exits” and “Trunk Stations”). The validations were taken using an IoT system; users register (only upon entering the system, at the stations) using turnstiles and smart cards provided by Transmilenio.
2.5. Algorithms Used in Related Works
We identified that some supervised ML algorithms could be used independently to create a model that predicts the required data using the existing datasets, based on the references highlighted in the review of the available literature. The algorithms and the works that used them are presented below.
2.5.1. Random Forest (RF)
A model with this algorithm was applied in [
22]. This work handled multiple variables and captured nonlinear interactions between them. The authors of [
22] identified that there was no linear relationship between the variables used and passenger load. Variables such as rush hours, alterations on public roads, and others showed deviations in the prediction analysis. The model with the RF algorithm was less prone to overfitting than individual decision trees. RF can handle multiple variables and is robust to noise and values outside the study range. Pre-processing data, transforming variables, and splitting training and test data are commonly necessary.
2.5.2. Extreme Gradient Boosting (XGBoost)
This algorithm can be used in combination with multiple weak learning models to create a stronger and more accurate model. It is based on decision trees, which are efficient in terms of performance and accuracy. XGBoost, like RF, can handle multiple variables and capture nonlinear interactions between them. It might be suitable for this problem due to its efficiency and ability to handle multiple variables [
27]. According to [
18], a process could be followed using XGBoost to predict passenger loading in a BRT system. The authors of [
18] mention that the location of a station is a key point (a Point of Interest, POI), where there is more weight of information for model training. The advantages of XGBoost are training speed, scalability, and performance. XGboost handles data outside the study range well, being tree-based, and could be assembled with LSTM and RF algorithms [
27].
2.5.3. Long-Short Term Memory (LSTM)
This algorithm, used in [
12,
15,
17,
18], is designed to address sequence and time series problems. In our work, the prediction of passenger loading in a BRT system could be defined as a time series problem. LSTM is best suited if high-frequency sequential data is available and long-term temporal dependencies need to be captured, so variables available in the datasets could be highly useful as they meet these requirements.
2.5.4. Support Vector Machine (SVM)
A model with this algorithm was used in [
12,
19]. SVM can handle nonlinear relationships between variables and is robust to outliers or extreme values in training data. However, SVM can be computationally expensive for large datasets and requires adjustment of the dataset to fit the prediction model. It could be suitable for this work, as it makes use of time windows that are apt for prediction from historical data.
2.6. Data Preparation
This subsection covers all the activities required to build the final datasets from the initial raw data. The tasks performed included data selection, which determined the data used for analysis; data cleaning, which involved correcting or removing incorrect, incomplete, or irrelevant data; and data construction, where new attributes or variables could be generated to improve the model’s ability to uncover meaningful patterns [
8]. This section has a high level of detail because we considered it relevant to describe the process followed to obtain the final datasets, which are one of the most representative contributions of this work. These datasets could be used by similar BRT system projects in other cities, either as input to developed models or as templates to generate a similar dataset.
2.6.1. Data Selection
This initial task made it possible to determine which data were used for the analysis. In this work, two datasets were selected, which have the necessary characteristics for the development of its objectives. These were the “Outputs” dataset and the “Trunk Validations” dataset, which will be called Dataset 1 and Dataset 2, respectively, from this point in this document onwards. A daily file for each of the two datasets is available on Transmilenio’s open data site, from 1 January 2023, to the current date [
26].
Dataset 1 has a 15 min periodicity and Dataset 2 presents each validation performed (a record for each validation). A validation is the process in which a user passes their smart card through a reader, to enter to use the transportation system. The adjustments required to synchronize the periodicity of the two datasets are presented in later sections of this article.
Initially, the size of the available datasets of the Dataset 1 type varied roughly between 16.9 Mb and 17.8 Mb, with an average daily weight of around 17.38 Mb (these data were measured in a weekly test period, between 14 and 20 August 2023). Meanwhile, for datasets of the Dataset 2 type, the size of the files varied daily between 480 and 680 Mb in the indicated weekly test period, resulting in an approximate weekly average of 580 Mb per day.
Considering the above data, we determined that the grouping of these datasets since 1 January 2023 would create a file too large, and the computational resources to be able to run them would exceed the available capacities. In addition, the datasets were organized daily and needed to be downloaded one by one to perform subsequent tasks (which would take too much time); so, we determined that the selection of data was going to be limited.
Finally, the data selected included files between 6 August and 27 August 2023. This period included key dates such as holidays (Monday, 7 August, and Monday, 21 August). In this way, the selected data include any day of the week and atypical days such as holidays; therefore, this data provides a clear view of the volume of data handled weekly. We emphasize that the selected data includes all service schedules for each day of the selected period, thus including peak travel periods. By including the complete schedule of the day, we aim to ensure that the proposed model performs well under high fluctuations in demand throughout the day, which occur in urban transport.
2.6.2. Data Cleaning
This task focused on correcting or deleting incorrect, incomplete, or irrelevant data. The activities performed included handling outliers, allocating missing values, and correcting errors. We reviewed the aspects that governed the dataset, the columns, and the types of data used to contextualize it. We decided to eliminate missing values to avoid unnecessary data saturation. Regarding error correction, we identified and treated rows with discrepancies and repeated values. Next, the cleaning procedure for the two selected datasets is presented.
Dataset 1 cleaning. We analyzed the columns in the dataset to eliminate any that were not useful, considering each of the daily files required a process of standardization and unification.
The columns considered useful in this dataset were “Transaction_date”, “Time”, “Station”, “Station_Access”, “Device”, “Inputs” and “Outputs”. Next, we filtered Dataset 1 to keep only the records related to the 10 key stations (to further reduce the number of records). These stations were identified through a process of reviewing and counting the number of validations at each station in a given period. The period for the selection of the 10 key stations is the one described in
Section 2.6.1: between 6 August and 27 August 2023. This period included key dates such as holidays (7 August and 21 August). The selected data included any day of the week and atypical days such as holidays; therefore, this data provided a clear view of the weekly volume of data handled. Additionally, we clarify that the selection (of the 10 key stations) was based solely on total passenger volume, so as not to exclude any type of station from the system (transfer stations, intermediate stations, or stations at the beginning or end of trunk routes). However, it is worth noting that of the 10 key stations selected, 6 of them are stations known as “Portals” (the start or end of trunk routes), which are those that normally have the highest number of passengers at any given time. To validate that the 10 selected stations have a high passenger volume throughout the year, some verifications were performed during other periods in other months of the year, obtaining similar results.
We filtered the data to leave only the records that did not have a value equal to 0 in the “Inputs” and “Outputs” columns. Finally, we applied an initial time limit of 3:45 a.m., which is the opening time to the public of the Transmilenio BRT system.
Dataset 2 cleaning. We analyzed the columns in the dataset to eliminate any that were not useful (similarly to what was performed for Dataset 1).
The columns considered useful in this dataset were “Transaction_date”, “Station”, “Station_Access”, and “Device”. Next, we identified the 10 key stations of the system in Dataset 2, which were the same as those presented in the Dataset 1 cleaning process, giving a clear trend in the key stations in the system. In Dataset 2, only records related to these 10 key stations were kept. Finally, we set an initial time limit at 3:45 a.m., to have a similar structure as the one obtained in Dataset 1.
2.6.3. Data Construction
In this task, we created new attributes or variables from the existing data (in Datasets 1 and 2), which helped to improve the analysis of the data by the algorithms used.
Dataset 1 construction. In Dataset 1, the “Transaction_Date” column was modified; this modification consisted of separating the dates into “Month” and “Day” components, which were stored in independent columns. Subsequently, the year component was removed from the “Transaction_Date” column since it was constant (2023) and did not vary throughout all daily datasets. This modification was made similarly for the “Time” column, which was separated into two columns, “Hour” and “Minute”. In addition, due to the need to know which specific day of the week the data belonged, we decided to give them an identifier named “Week_Day_Number”. This new column indicated the days of the week from 1 to 7, starting with Monday with 1 and growing sequentially until Sunday with 7. Another column called “Holidays” was created, which indicated holidays with a value of 1 and used 0 to indicate those that are not.
Dataset 2 construction. In Dataset 2, we identified a required new column, a target column for the prediction model. For this, we grouped by time, creating a new column to count the validations (entries) made, with a periodicity of 15 min. With this grouping, we obtained an objective column named “Inputs”, which corresponded to the number of users who enter the system every 15 min, to the corresponding station, and by the indicated access. This periodicity was the same as that in Dataset 1 to evaluate the datasets independently and also compare the results obtained.
The “Transaction_Date” column, which in this case recorded the date and time of the validations at the trunk stations, was divided into “Month”, “Number_Day”, “Hour_Number”, and “Minute” (in a similar way to what was performed for Dataset 1), facilitating separate analysis of the date and time of the transactions. As for the days of the week and holidays, two columns were created, one for the day identifier and one for the holidays, named “Week_Day_Number” and “Holidays”, respectively (similarly to what was performed for Dataset 1).
2.6.4. Data Integration
In this task, data from different sources were combined to create a coherent and complete dataset. This is crucial in projects involving multiple databases or files, as is the case in our work.
Dataset 1 integration. In this stage of the preparation of Dataset 1, after having performed filtering and standardization tasks for each daily version Dataset 1, each of them was taken to join them by vertical concatenation. All datasets (of Dataset 1 type) had the same number of columns and could be properly integrated, after the filtering and preparation performed. Therefore, we finally obtained a single Dataset 1 with the data prepared from the dates of 6 August at 3:45 a.m. to 2 August 27 at 11:45 p.m. with 437,434 records without null values.
Dataset 2 integration. All daily versions of Dataset 2 went through filtering, standardizing, and grouping processes to have the best-adjusted data possible (similarly to what was performed for Dataset 1). After this, the daily datasets were joined into one, integrating them vertically in order of time. The version of Dataset 2 finally generated used the daily files from 6 August at 4:00 a.m. to 27 August at 11:45 p.m. and contained 322,426 rows.
2.6.5. Data Formatting
For this task, the goal was to change the data format to make it compatible with analysis tools, such as the normalization of numerical ranges or coding of categorical variables. We encoded categorical variables into numerical values to use them in prediction models, because many ML algorithms require all input and output variables to be numeric [
28].
Dataset 1 formatting. In Dataset 1, we reviewed columns with unique categorical values. We used One-Hot encoding to facilitate understanding of the dataset, taking into account the interaction between the model and the dataset for prediction [
29]. In the first instance, for each unique value of the “Station” column, a new column was created, which took a value of 1 in that column if it matched the value of the record; otherwise the column was assigned a value of 0. We removed the “Station_Access” and “Device” columns because they contained too many unique values, and their data type was “object”. Ten new columns were created for Dataset 1 (concerning the mentioned “Station” field), representing the 10 key stations previously identified. Finally, 17 columns were obtained in Dataset 1, including the 10 columns corresponding to the key stations.
Dataset 2 formatting. Following the same guidelines as those for the formatting of Dataset 1, we performed the same One-Hot encoding of the “Station” column. The “Access_Station” and “Device” columns were removed too. The daily key stations became columns with binary float values indicating at which station the validations were performed (in the 15 min). Finally, 16 columns were obtained in Dataset 2, including the 10 columns corresponding to the key stations.
2.7. Modeling
This CRISP-DM phase included all the activities necessary to build the prediction model. This was possible because, at this point, there was a clear and comprehensive understanding of the problem and the business. Additionally, the final dataset was constructed and defined from the initial raw data [
8]. Regarding the algorithms selected to perform the modeling stage, we considered the four reviewed options in
Section 2.5 of this article (RF, LSTM, XGBoost, and SVM). We evaluated aspects, such as prediction accuracy, model complexity, interpretability, training time, and adaptability, of these four options. Finally, we decided to use the four mentioned algorithms to generate the possible models, since the analysis of the criteria showed that all options could be suitable. In addition, this allowed a comparison of the results obtained using each of them.
To achieve this goal, we considered it essential to determine the models, modify the algorithm to be used, pre-process the datasets, and choose some type of additional machine learning (ML) tool. The initial models proposed used Datasets 1 and 2, after performing the entire preparation process mentioned above. Considering the characteristics of the datasets obtained, we defined a supervised ML process to be performed, since a target column was obtained in each of the datasets. We specifically defined a regression ML process, considering that the value of the target column was not a fixed value within a defined group of values but a variable value within a considerable range of a quantitative type (the number of passengers at a specific station, at a time of day, on a specific day). We also determined that each of the datasets should be tested with the four previously selected algorithms. Model testing was planned for each algorithm and each dataset, with two types of training and testing grouping. Subsequently, we planned to use the same models, but with cross-validation, which is explained later. Finally, depending on the results obtained, we anticipated that a refinement of the default hyperparameters of each algorithm used in the models may have been required.
2.7.1. Initial Modeling
We decided to make variations in each algorithm to perform adequate training and construction of the models, starting by dividing the datasets into different proportions for training and testing. The objective was to achieve different results in the metrics and find the best result. The splits were made in the proportions of 70–30% and 80–20% for the training and testing sets, which are the amounts of data commonly recommended [
30,
31,
32,
33]. The default hyperparameter configurations were applied, according to the respective documentation of the algorithms [
34].
2.7.2. Cross-Validation Modeling
The main objective of the cross-validation method is to evaluate and test the performance of an ML model. “K-Folds” is one cross-validation technique that proved useful in the process of evaluating the model, which attempted to randomly divide the dataset into K groups (also known as “Folds”) where each fold contains an equal, or approximately equal, number of records. We trained each model with K-1 folds and validated them with the remaining fold. We repeated the process until each fold was used exactly once as a validation set, resulting in a total of K repetitions of training. These folds are usually chosen in intervals of 5 to 10; a higher K value can reduce bias but can also increase variance and the risk of overfitting [
35]. For this reason, we used two values of K, 5 and 8, to observe the models’ behavior.
2.8. Models Evaluation
The evaluation phase in the CRISP-DM methodology is part of the project life cycle where the analysis of the impact of the data on the study are executed. For this stage, the selection of the evaluation metrics must be taken into account to determine the accuracy and good performance of models. This includes important factors such as prediction accuracy, computational consumption, and execution speed of models, among others, to develop a correct comparison between all the models considering the information obtained.
We trained all the algorithms except the SVM algorithm, which, after several attempts on remote machines and also on a high-capacity server (with 260 GB of RAM and 8 CPU cores), failed to generate the required calculation in less than 10 h. For this reason, we decided to discard the possibility of using the SVM algorithm in the final models.
Different evaluation metrics in the field of data science are used to evaluate the performance of ML models. Regression problems aim to find a quantitative variable or a number (e.g., the price of a vehicle, the time spent by a runner, or in the case of this paper, the number of passengers at a station, at a certain time of day) [
36]. In regression, the following evaluation metrics are used: Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), R-squared (R
2), adjusted R-squared (adjusted R
2), and Root Mean Square Error (RMSE) [
12,
17]. We decided to use the RMSE and MAE metrics for our work; MAPE was discarded because it is a metric that determines its efficiency depending on the scale that is being managed. This work had a large number of “Inputs” field values equal to or close to 0, so the MAPE metric values could be inflated due to division by these very small values. Furthermore, as a complement to the evaluation of the models, we included adjusted R
2, because this metric provided more information.
3. Results
The two datasets finally generated (Dataset 1 and Dataset 2), after a rigorous data preparation process (which is described in
Section 2.6 of this document), were registered on the Kaggle platform and are available for consultation [
37].
As we mentioned in
Section 2.5, we identified that some supervised ML algorithms could be used independently to create a model that predicted the required data using existing datasets, based on the references highlighted in the review of the available literature. That is, we used the most recommended algorithms in the literature identified in the systematic review (RF, XGBoost, LSTM, and SVM). In this way, we developed potential models with algorithms that had previously yielded adequate results, avoiding creating too many models with potentially inadequate results.
The values obtained from the selected metrics of the proposed models, using Datasets 1 and Dataset 2, and the three selected ML algorithms are presented below. It is important to note (as mentioned in
Section 2.8) that the SVM algorithm was not taken into account, since the calculations took too long, even on machines with high capabilities.
The results are presented through tables; no graphics are used, because the calculated data correspond to records in several BRT stations, throughout the daily service schedule and over a period of 3 weeks, which makes it very complicated to generate a graph that allows us to adequately visualize the results obtained with respect to the difference between the predicted and real value.
When predicting passenger load using a machine learning model, correlations between different stations in a BRT system are likely to occur. Some of the reasons for this may be spatial correlation, temporal correlation, and correlation due to external factors. Spatial correlation can arise from geographic proximity (stations close to each other may have similar passenger load patterns) or from routes and travel patterns (passengers may travel between stations following a certain route or pattern). Temporal correlation can arise from daily or weekly patterns in passenger behavior or from specific events and activities. Finally, correlation due to external factors can arise from climate or weather conditions, holidays, and vacations.
Some of these potential correlations were considered in our work. For example, to minimize the risk of temporal correlation and external factors, a relatively long period of time was considered for data capture for the datasets considered in the models, considering at least three weeks, including holidays. It is worth noting that to avoid the impact of a potential correlation between stations, certain algorithms such as LSTM were proposed in the machine learning work, which are ideal for addressing sequence and time series problems, such as the prediction of passenger loading in a BRT system.
3.1. Initial Results of the Modeling for Dataset 1
Table 2 presents the best results obtained with Dataset 1. Some of the best results were obtained with the initial model, a certain algorithm (LSTM or RF), and a certain training and testing distribution (70–30% or 80–20%). Other best results were obtained with the cross-validation model and XGBoost algorithm.
3.2. Initial Results of the Modeling for Dataset 2
Table 3 presents the best results obtained with Dataset 2. Some of the best results were obtained with the initial model, a certain algorithm (LSTM), and a certain training and testing distribution (70–30% or 80–20%). Other best results were obtained with the cross-validation model and XGBoost algorithm.
After reviewing the data in
Table 2 and
Table 3, we concluded that the best model was model number 2 (
Table 2), which used the LSTM algorithm, Dataset 1, and proportions of 80% training and 20% testing, without cross-validation, because the metrics obtained stood out above the other models. The metrics values obtained for the models that used Dataset 2 (
Table 3) were not good.
However, although model number 2 had the best performance with the default hyperparameters, concerning the models evaluated, we did not consider it suitable to be implemented to make predictions (of the number of passengers per station, and by schedule, on a specific date) on dates other than those used for the training and evaluation process. This is because we performed a prediction experiment, for various dates (different from the training dates), diverse times, and different seasons, using the three better models (models 1, 2, and 3 in
Table 2). The results were not satisfactory since the variations between the expected value and the predicted value varied significantly (mainly with the models with the RF algorithm).
Considering this, we decided to perform the modeling and evaluation stages again, including the variation in hyperparameters with certain tools, to search for better values in the selected metrics. Additionally, we considered the option of varying the selected range of dates for the dataset generation. For this new stage of modeling and evaluation, only the two best models previously evaluated were considered.
3.3. Final Results of the Modeling for Dataset 1 After Applying Hyperparameter Variation
We made an initial adjustment to the hyperparameters only with Dataset 1. The Optuna tool was used to achieve the optimization of these hyperparameters (which were initially set with their default values) for each algorithm during training [
38]. Optuna is an automated search tool for optimizing the hyperparameters of ML models. This tool helps to identify optimal hyperparameters using different search methods [
39].
Table 4 presents the best results of performing the evaluations for each model considering Dataset 1 in diverse training and testing ratios, with the values of the hyperparameters found with Optuna.
The values of the obtained metrics, presented in
Table 4, indicate that the model with the highest efficiency in training using Dataset 1 was model number 1, which used the LSTM algorithm with proportions of 80% training and 20% testing, with new values of hyperparameters (‘time_steps’: 9, ‘units’: 50, ‘units_2’: 100, ‘optimizer’: ‘adam’, ‘epochs’: 20, ‘batch_size’: 32). The value of the adjusted R
2 metric was much closer to 1 (ideal value) than what was obtained in the initial models (
Table 2 and
Table 3).
3.4. Final Results of the Modeling for Dataset 1 After Applying Hyperparameter Variation and Input Date Variations
We used initially the date range from 6 August to 27 August (21 days total) to create a single dataset. Varying the amount of data in the dataset, by expanding or decreasing the date range, could influence obtaining better results. Therefore, four variations of the amount of data (per day entered) were applied to the previously selected model (model number 1 in
Table 4), which was the LSTM model with proportions of 80% training and 20% evaluation data, with the optimized hyperparameters.
Table 5 presents the results obtained.
In the results, presented in
Table 5, we identify that variation in the number of days of the range in which the dataset was generated did not generate significant changes in the metrics obtained. However, the best values of the metrics, considering the four selected, were estimated for model number 3, with a range of 31 days, because the decrease in adjusted R
2 was not high and the RMSE rose in a greater proportion.
4. Discussion
In general, the models developed with Dataset 2 did not obtain optimal values; they did not meet the success criteria established for this work. This could show that the preparation phase performed for this dataset was not useful enough.
The results obtained by the models used with cross-validation were also inadequate, especially those that used Dataset 2, although they were executed in less time than the models with Dataset 1. We observed that metric values generally decreased significantly, especially for models using the LSTM algorithm. On the other hand, the performance of the model using five-step cross-validation with the XGBoost algorithm was the best among the models using this algorithm, highlighting its key features: short execution time and low RAM consumption.
The initial models obtained better values of metrics, in general, than the models with cross-validation. The initial models using Dataset 1 obtained better metric values than those using Dataset 2. The models with the RF and LSTM algorithms stood out from the others. Although they had a longer execution time and greater use of RAM, we observed that the metric values were considerably better. The better results with RF and LSTM (in the initial models) were obtained using Dataset 1 with proportions of 80% training and 20% testing. In addition, the metrics obtained from the model with the LSTM algorithm, Dataset 1, and proportions of 70% training and 30% testing data were highlighted too. These three mentioned models were the ones that obtained the best values of the metrics, despite being the ones that consumed the most RAM and had execution times in the range of 3 to 6 min, which was considered moderate time.
Another important point to highlight is the prediction and comparison of some values on different dates (different from the training data) to check the performance of the better models in evaluation. In this case, the three better initial models mentioned were used (models 1, 2, and 3 in
Table 2). The results showed us that with models using the LSTM algorithm, the variations between the expected values and the predicted values varied considerably in percentage terms, but in terms of quantity, the difference was not high. The metrics of the model with the RF algorithm presented variations (between the expected values and the predicted values) considerably in percentage and number too. Additionally, we observed that the RF model was not robust enough for new data, compared to those used in training.
The variation presented between the predicted values and the real values raised concerns regarding the aspects that could be improved in the model. The adjusted R2 metric obtained values between 0.602 and 0.671, which were relatively too far from the ideal value (which was 1). After several analyses, we determined that the hyperparameters of the algorithms could be modified with tools such as Gridsearch or Opuna, to evaluate the results obtained. We also considered it pertinent to change the range of dates in which the data was taken, to perform the training and to evaluate possible improvement in the metrics.
After performing these processes, we determined that the use of the Optuna tool was ideal, reaching a value of the adjusted R2 metric of 0.866, which was much closer to the ideal value than what was obtained with the three initial models. The other parameter adjustment tool option and the modification of the range of dates of the datasets for training did not generate better results in the adjusted R2 metric. However, the option of using the Optuna tool and subsequently modifying the dataset period from 21 to 31 days showed an improvement in the RMSE metric (decreasing from 11.427 to 10.56) and a non-significant decrease in adjusted R2 (going from 0.866 to 0.8333). However, the period required to execute the adjusted model (with Optuna and modification of the dataset period) increased considerably (going from 5 min to more than 30 min).
The final selected model could be a good option for the personnel in charge of the Transmilenio BRT system to make predictions about the passenger load at the stations, at different times, in following days; however, there are some other relevant factors to take into account. The validations present in the datasets do not exactly coincide with the number of passengers at a certain station in a certain period, since there is a considerable percentage of illegal or “sneaked” entries. These illegal entries represent an additional factor that is not present in the existing data (which can represent an increase of 25% with respect to the validations detected), which affects the data predicted by the models developed and the actual data on the number of passengers.
Predicting the passenger load (with the level of certainty achieved in this work) for following days of operation at key stations would allow a BRT system (similar to Transmilenio) to facilitate the planning of the required logistics (e.g., frequencies of transportation, numbers of vehicles, and routes) to improve the service of the transport system (in terms of waiting times, travel times, and others). We clarify that possible improvements in routing planning would be actions subsequent to the results of this work. BRT operators can review the information obtained from the predictions, compare it with current data, and subsequently make adjustments to the routing planning. The current routing planning of the Transmilenio BRT system was not possible to consider in the datasets used for the training and evaluation of the model because the open data of the BRT system was captured using user validation systems (entry to the system). It is not feasible to include route and frequency data in a dataset that is generated with information from user entry systems.
The model was developed using data from the entire BRT service schedule, for every day of a considerable period of time (3 weeks), to adapt to predictions for any type of schedule (including peak travel periods). The model evaluation results show that for all predicted records, metrics close to the ideal were obtained, which means that the model operates adequately in any type of schedule, thanks to the appropriate training performed.
An important limitation of the work performed is that the developed model focuses specifically on the BRT system’s 10 key stations. The model was validated only for these stations; no validation was performed for the remaining stations. A validation of the created model, applied to the remaining stations, may not yield sufficiently accurate prediction results. However, the model should be verified and this is therefore proposed as future work in
Section 5 (Conclusions).
We clarify that the option of creating a model that considers only the 10 key stations was validated by BRT Transmilenio operations personnel. They considered that being able to make passenger load predictions at the 10 key stations represents a significant benefit to the system’s operation, as the volume of passengers handled by these stations is very significant.
Finally, it is very important to consider that in order to create a model that takes into account a larger number of stations (or ideally all stations), which could have much more accurate prediction results at any station, the amount of data in the dataset should grow considerably, limiting the efficiency in the pre-processing and training of the model that uses the said dataset.
5. Conclusions
This article focused on analyzing various options for predicting data sequences based on historical information. The data was related to the passenger load on a BRT system, which is relevant for adequate fleet programming and adequate service to end users. Through this comparative study, we identified that LSTM-based models stand out for offering an optimal balance between prediction accuracy, measured through key metrics such as RMSE, adjusted R2, and MAE, along with considerable efficiency in terms of runtime and RAM resource consumption. These findings demonstrate the relevance of models with the LSTM algorithm as a robust and viable solution for time series prediction, standing out for their ability to balance operational effectiveness and efficiency.
Using the ITS architecture taken as a reference (ARC-IT), we identified that an ideal system responsible for predicting passenger loads on a BRT system captures relevant data on vehicles, such as the number of passengers entering and exiting each car. However, in the BRT system analyzed, we identified that this data was not collected within vehicles; it was collected only from the stations.
We concluded that although Dataset 2 provided information regarding the number of inputs to each station instantaneously (each record was an entry to the system), when performing the preparation process and adjustment to apply the model (15 min intervals at the critical stations of the system), a certain amount of data was lost. This loss of data appears to be the reason for the low accuracy of the predictions. On the other hand, with Dataset 1 there was no such problem since the data was already grouped in that way. Using models with Dataset 1, better values were obtained for the metrics (RMSE, adjusted R2, and MAE). For this reason, Dataset 1 was chosen for the final steps of debugging the models.
The accuracy in the prediction of passenger load for Transmilenio’s BRT system was significantly improved through the application of the Optuna hyperparameter optimizer, an advanced tool that automates and optimizes the selection of the most suitable hyperparameters. In addition, the strategy of selecting specific time intervals for model training (for which the best period was defined as one month, to predict passenger loads for subsequent days or weeks) proved to be particularly effective. The combination of Optuna’s advanced optimization and strategic time slot selection did not significantly improve the metrics. However, the use of a period of 31 days could be recommended (varying the initial period which was 21 days), thereby obtaining an adequate set of metrics (decreasing the adjusted R2 metric by a small amount and increasing the RMSE metric).
Regarding future work, we consider it important to include additional fields in the generated dataset such as weather, the occurrence of massive events in the city (such as concerts and important sporting events), and traffic accidents. These fields can contribute significantly to improving the model’s predictions, because they can help, among other aspects, to minimize the possibility of correlation between stations due to temporal correlation and correlation due to external factors (explained in
Section 3). We also note the importance of future work considering the relevance of illegal entries into BRT transportation systems in the model, to make better approximations of passenger load at critical stations. Finally, as mentioned at the end of
Section 4, it is important to verify the results obtained by validating the model at all stations, not just the 10 stations used to compose the dataset. If the metric values are not satisfactory during these validations, increasing the number of stations included in the dataset could be considered.