From Raw GPS to GTFS: A Real-World Open Dataset for Bus Travel Time Prediction
Abstract
1. Summary
- The study presents a replicable and scalable framework for converting raw GPS traces into a GTFS format, enabling adoption in data-scarce environments.
- It introduces the first GTFS-compatible, GPS-derived public bus dataset from Astana, Kazakhstan, addressing a major geographic gap in open transit data.
- The dataset supports segment-level modeling and analysis in the absence of GTFS-RT feeds, automatic passenger counters, or other advanced infrastructure.
- It provides a benchmark resource for developing offline prediction models and assessing public transit reliability in rapidly urbanizing cities.
2. Data Description
2.1. Raw GPS Data
2.2. The Processed Dataset
3. Methods
3.1. Obtaining GPS Data
3.1.1. Hardware Configuration
3.1.2. Location Service
- PRIORITY_HIGH_ACCURACY: Utilizes GPS satellites as the primary positioning source, supplemented by network-based positioning when satellite signals are degraded. This mode provides meter-level accuracy suitable for precise vehicle tracking but requires increased battery consumption. Optimal for urban transit applications where route adherence monitoring requires high spatial precision and energy consumption in this case does not affect the efficiency of work, since the tablets are connected to the bus power supply.
- PRIORITY_BALANCED_POWER_ACCURACY: Employs a hybrid approach combining GPS, WiFi, and cellular positioning sources with intelligent switching based on signal availability and power optimization algorithms. Provides accuracy within 100 m while significantly reducing power consumption compared to high-accuracy mode.
- PRIORITY_LOW_POWER: Primarily relies on network-based positioning (WiFi and cellular towers) with minimal GPS utilization. Suitable for applications requiring general location awareness with extended battery life but insufficient for precise vehicle tracking applications.
- PRIORITY_PASSIVE: Utilizes only passively available location data from other applications without actively requesting position updates, resulting in minimal power consumption but unpredictable data availability.
- Latitude (φ): Geographic latitude in decimal degrees (−90° to +90°)
- Longitude (λ): Geographic longitude in decimal degrees (−180° to +180°)
- Velocity (v): Instantaneous speed in meters per second
- Temporal stamp (t): Unix timestamp of coordinate acquisition
- Accuracy metrics: Horizontal accuracy estimates provided by the positioning system
3.1.3. Data Quality Assurance
- d < d_min (0.0 km)—indicating duplicate measurements
- d > d_max (0.278 km)—indicating physically impossible displacement given the 5-s sampling interval
3.1.4. Data Transmission and Backend Integration
- Spatial coordinates: Validated latitude and longitude with accuracy metrics.
- Temporal information: ISO 8601 formatted timestamps for temporal synchronization.
- Vehicle identification: Unique identifiers linking coordinates to specific transit units.
- Operational metadata: Route assignments, schedule adherence metrics, and vehicle status indicators.
- Kinematic parameters: Velocity, bearing, and distance measurements for trajectory analysis.
3.1.5. Data Visualization
3.2. The Conversion of GPS Probe Data into GTFS Format
3.2.1. Trip-Level Information Extraction
3.2.2. Segment-Level Information Extraction
4. User Notes
- Findability: The dataset is assigned a unique and persistent identifier (DOI), ensuring it can be reliably located and cited.
- Accessibility: It is hosted on Zenodo, an open-access platform that provides unrestricted access to download and explore the data.
- Interoperability: The dataset is provided in widely accepted formats (.txt and .csv), facilitating seamless integration with various data analysis tools and programming environments.
- Reusability: Detailed metadata is included, and the dataset is shared under a Creative Commons Attribution 4.0 International (CC BY 4.0) License, permitting broad reuse and redistribution with appropriate credit.
5. Conclusions and Future Work
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
GTFS | General Transit Feed Specification |
GTFS-RT | GTFS Realtime |
GPS | Global Positioning System |
ITS | Intelligent Transportation Systems |
APCs | Automated Process Control Systems |
LSTM | Long Short-Term Memory |
References
- United Nations, Department of Economic and Social Affairs. World Urbanisation Prospects: The 2018 Revision (ST/ESA/SER.A/420); United Nations: New York, NY, USA, 2019. [Google Scholar]
- World Population Review. Most Urbanized Countries 2025. 2025. Available online: https://worldpopulationreview.com/country-rankings/most-urbanized-countries (accessed on 15 July 2025).
- Gonçalves, L.A.P.J.; Ribeiro, P.J.G. Resilience of Urban Transportation Systems. Concept, Characteristics, and Methods. J. Transp. Geogr. 2020, 85, 102727. [Google Scholar] [CrossRef]
- Sogbe, E.; Susilawati, S.; Pin, T.C. Scaling up public transport usage: A systematic literature review of service quality, satisfaction and attitude towards bus transport systems in developing countries. Public Transp. 2025, 17, 1–44. [Google Scholar] [CrossRef]
- Singh, N.; Kumar, K. A Review of Bus Arrival Time Prediction Using Artificial Intelligence. WIREs Data Min. Knowl. Discov. 2022, 12, e1457. [Google Scholar] [CrossRef]
- Zhong, G.; Yin, T.; Li, L.; Zhang, J.; Zhang, H.; Ran, B. Bus Travel Time Prediction Based on Ensemble Learning Methods. IEEE Intell. Transp. Syst. Mag. 2022, 14, 174–189. [Google Scholar] [CrossRef]
- Fayyaz, S.; Kiavash, S.; Liu, X.C.; Porter, R.J. A genetic-algorithm and regression-based model for analyzing fare payment structure and transit dwell time. In Transportation Research Board 95th Annual Meeting; Transportation Research Board: Washington, DC, USA, 2016; No. 16-4815. [Google Scholar]
- Kumar, B.A.; Vanajakshi, L.; Subramanian, S.C. Bus travel time prediction using a time-space discretization approach. Transp. Res. Part C Emerg. Technol. 2017, 79, 308–332. [Google Scholar] [CrossRef]
- Han, Q.; Liu, K.; Zeng, L.; He, G.; Ye, L.; Li, F. A Bus Arrival Time Prediction Method Based on Position Calibration and LSTM. IEEE Access 2020, 8, 42372–42383. [Google Scholar] [CrossRef]
- Liu, H.; Xu, H.; Yan, Y.; Cai, Z.; Sun, T.; Li, W. Bus Arrival Time Prediction Based on LSTM and Spatial-Temporal Feature Vector. IEEE Access 2020, 8, 11917–11929. [Google Scholar] [CrossRef]
- Nadeeshan, S.; Perera, A.S. Multi-Step Bidirectional LSTM for Low Frequent Bus Travel Time Prediction. In Proceedings of the 2021 Moratuwa Engineering Research Conference (MERCon), Moratuwa, Sri Lanka, 27–29 July 2021; IEEE: Moratuwa, Sri Lanka, 2021; pp. 462–467. [Google Scholar]
- Pili, F.; Olivo, A.; Barabino, B. Evaluating alternative methods to estimate bus running times by archived automatic vehicle location data. IET Intell. Transp. Syst. 2019, 13, 523–530. [Google Scholar] [CrossRef]
- Osman, O.; Rakha, H.; Mittal, A. Application of Long Short Term Memory Networks for Long- and Short-Term Bus Travel Time Prediction. 2021; preprint. [Google Scholar] [CrossRef]
- Hou, Y.; Edara, P. Network Scale Travel Time Prediction Using Deep Learning. Transp. Res. Rec. J. Transp. Res. Board 2018, 2672, 115–123. [Google Scholar] [CrossRef]
- Yuan, Y.; Shao, C.; Cao, Z.; He, Z.; Zhu, C.; Wang, Y.; Jang, V. Bus Dynamic Travel Time Prediction: Using a Deep Feature Extraction Framework Based on RNN and DNN. Electronics 2020, 9, 1876. [Google Scholar] [CrossRef]
- Yin, Z.; Zhang, B. Bus Travel Time Prediction Based on the Similarity in Drivers’ Driving Styles. Future Internet 2023, 15, 222. [Google Scholar] [CrossRef]
- Kwesiga, D.K.; Guin, A.; Hunter, M. Analysis of bus dwell times from automated passenger count data and the impact of dwell-time variability on the performance of transit signal priority. Public Transp. 2025, 1–23. [Google Scholar] [CrossRef]
- Abdelhalim, A.; Zhao, J. Computer vision for transit travel time prediction: An end-to-end framework using roadside urban imagery. Public Transp. 2024, 17, 1–26. [Google Scholar] [CrossRef]
- Chondrodima, E.; Georgiou, H.; Pelekis, N.; Theodoridis, Y. Public Transport Arrival Time Prediction Based on GTFS Data. In Machine Learning, Optimization, and Data Science; Nicosia, G., Ojha, V., La Malfa, E., La Malfa, G., Jansen, G., Pardalos, P.M., Giuffrida, G., Umeton, R., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2022; Volume 13164, pp. 481–495. ISBN 978-3-030-95469-7. [Google Scholar]
- Wu, J.; Wu, Q.; Shen, J.; Cai, C. Towards Attention-Based Convolutional Long Short-Term Memory for Travel Time Prediction of Bus Journeys. Sensors 2020, 20, 3354. [Google Scholar] [CrossRef] [PubMed]
- BV, S.K.; Fedujwar, R.; Agarwal, A. Travel Time Variability of Bus Routes in Delhi Using Real-Time GTFS Data. In Proceedings of the 2024 16th International Conference on COMmunication Systems & NETworkS (COMSNETS), Bengaluru, India, 3–7 January 2024; IEEE: New York, NY, USA, 2024; pp. 210–215. [Google Scholar]
- Ratneswaran, S.; Thayasivam, U. An Improved Bus Travel Time Prediction Using Multi-Model Ensemble Approach for Heterogeneous Traffic Conditions. In Proceedings of the 2023 IEEE 26th International Conference on Intelligent Transportation Systems (ITSC), Bilbao, Spain, 24–28 September 2023; IEEE: Bilbao, Spain, 2023; pp. 2410–2415. [Google Scholar]
- Reference-General Transit Feed Specification. Available online: https://gtfs.org/documentation/schedule/reference/ (accessed on 24 March 2025).
- Liu, D.; Guo, J.; Gu, Y.; King, M.; Han, L.D.; Brakewood, C. Analyzing Transit Systems Using General Transit Feed Specification (GTFS) by Generating Spatiotemporal Transit Networks. Information 2025, 16, 24. [Google Scholar] [CrossRef]
- Goldstein, B.; Dyson, L. (Eds.) Beyond Transparency: Open Data and the Future of Civic Innovation; Code for America Press: San Francisco, CA, USA, 2013; ISBN 978-0-615-88908-5. [Google Scholar]
- Xian, T.; Chin, T.K.; Marks, B.; Nelson, J.D.; Moylan, E. Bus arrival and departure time updates in the Greater Sydney Area. Sci. Data 2024, 11, 1034. [Google Scholar] [CrossRef] [PubMed]
- Prommaharaj, P.; Phithakkitnukoon, S.; Demissie, M.G.; Kattan, L.; Ratti, C. Visualizing public transit system operation with GTFS data: A case study of Calgary, Canada. Heliyon 2020, 6, e03729. [Google Scholar] [CrossRef] [PubMed]
- Trompet, M.; Liu, X.; Graham, D.J. Development of key performance indicator to compare regularity of service between urban bus operators. Transp. Res. Rec. 2011, 2216, 33–41. [Google Scholar] [CrossRef]
- Cortés, C.E.; Gibson, J.; Gschwender, A.; Munizaga, M.; Zúñiga, M. Commercial bus speed diagnosis based on GPS-monitored data. Transp. Res. Part C Emerg. Technol. 2011, 19, 695–707. [Google Scholar] [CrossRef]
- Furth, P.G.; Hemily, B.; Muller, T.H.; Strathman, J.G. Using Archived AVL-APC Data to Improve Transit Performance and Management; Transportation Research Board: Washington, DC, USA, 2006; No. Project H-28. [Google Scholar]
- Wilkinson, M.D.; Dumontier, M.; Aalbersberg, I.J.; Appleton, G.; Axton, M.; Baak, A.; Mons, B. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 2016, 3, 1–9. [Google Scholar] [CrossRef] [PubMed]
No. | File Name | Size | Description |
---|---|---|---|
1 | gtfs_data.zip | 8.6 MB | Contains GTFS files in a standardized format. |
2 | segment_level_data.csv.zip | 20.55 MB | Contains detailed operational data for individual bus trip segments. |
3 | gps_data.csv.zip | 2.32 GB | Contains raw GPS data collected from public buses. |
Attribute | Description | Example |
---|---|---|
id | Unique record identifier | 877,965 |
route_id | Route identifier | 673 |
trip_id | Unique trip identifier | 1 |
date | Date on which the trip was recorded | 1 August 2024 |
deviceid | Unique identifier for the bus (assigned to the GPS tracking device) | 262 |
direction | Direction of travel (e.g., 1 = outbound, 2 = inbound) | 1 |
segment | Route segment number | 1.0 |
start_point | Departure stop identifier of the bus stop where the segment begins | 101 |
end_point | Arrival stop identifier of the bus stop where the segment ends | 102 |
start_time | Time when the bus departs from the starting stop of the segment | 06:39:49 |
run_time_in_seconds | Time taken to travel between two stops within a segment | 69 |
dwell_time_in_seconds | Duration the bus remains stationary at a stop before departing | 74 |
arrival_time | Time when the bus arrives at the end stop of the segment | 06:40:58 |
departure_time | Time when the bus leaves the stop after dwelling | 06:42:12 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Mansurova, A.; Mussina, A.; Aubakirov, S.; Nugumanova, A.; Yedilkhan, D. From Raw GPS to GTFS: A Real-World Open Dataset for Bus Travel Time Prediction. Data 2025, 10, 119. https://doi.org/10.3390/data10080119
Mansurova A, Mussina A, Aubakirov S, Nugumanova A, Yedilkhan D. From Raw GPS to GTFS: A Real-World Open Dataset for Bus Travel Time Prediction. Data. 2025; 10(8):119. https://doi.org/10.3390/data10080119
Chicago/Turabian StyleMansurova, Aigerim, Aigerim Mussina, Sanzhar Aubakirov, Aliya Nugumanova, and Didar Yedilkhan. 2025. "From Raw GPS to GTFS: A Real-World Open Dataset for Bus Travel Time Prediction" Data 10, no. 8: 119. https://doi.org/10.3390/data10080119
APA StyleMansurova, A., Mussina, A., Aubakirov, S., Nugumanova, A., & Yedilkhan, D. (2025). From Raw GPS to GTFS: A Real-World Open Dataset for Bus Travel Time Prediction. Data, 10(8), 119. https://doi.org/10.3390/data10080119