Applying Transformer-Based Dynamic-Sequence Techniques to Transit Data Analysis

Bumjun Choo; Dong-Kyu Kim

doi:10.3390/engproc2025102012

and

Department of Civil & Environmental Engineering, Seoul National University, 1, Gwanak-ro, Gwanak-gu, Seoul 08826, Republic of Korea

^*

Author to whom correspondence should be addressed.

^†

Presented at the 2025 Suwon ITS Asia Pacific Forum, Suwon, Republic of Korea, 28–30 May 2025.

Eng. Proc.2025, 102(1), 12;https://doi.org/10.3390/engproc2025102012

This article belongs to the Proceedings The 2025 Suwon ITS Asia Pacific Forum

Version Notes

Order Reprints

Abstract

Transit systems play a vital role in urban mobility, yet predicting individual travel behavior within these systems remains a complex challenge. Traditional machine learning approaches struggle with transit trip data because each trip may consist of a variable number of transit legs, leading to missing data and inconsistencies when using fixed-length tabular representations. To address this issue, we propose a transformer-based dynamic-sequence approach that models transit trips as variable-length sequences, allowing for flexible representation while leveraging the power of attention mechanisms. Our methodology constructs trip sequences by encoding each transit leg as a token, incorporating travel time, mode of transport, and a 2D positional encoding based on grid-based spatial coordinates. By dynamically skipping missing legs instead of imputing artificial values, our approach maintains data integrity and prevents bias. The transformer model then processes these sequences using self-attention, effectively capturing relationships across different trip segments and spatial patterns. To evaluate the effectiveness of our approach, we train the model on a dataset of urban transit trips and predict first-mile and last-mile travel times. We assess performance using Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE). Experimental results demonstrate that our dynamic-sequence method yields up to a 30.96% improvement in accuracy compared to non-dynamic methods while preserving the underlying structure of transit trips. This study contributes to intelligent transportation systems by presenting a robust, adaptable framework for modeling real-world transit data. Our findings highlight the advantages of self-attention-based architectures for handling irregular trip structures, offering a novel perspective on a data-driven understanding of individual travel behavior.

Keywords:

travel behavior prediction; variable-length sequences; data-driven mobility analysis; first-mile and last-mile travel

1. Introduction

Public transit systems play a critical role in sustaining the mobility of densely populated cities. However, accurately predicting travel behavior surrounding the use of public transit remains a challenge due to the inherent variability in trip structures. As illustrated in Figure 1, each individual trip may consist of a variable number of transit legs, which leads to missing data and inconsistencies when using fixed-length tabular representations. Traditional machine learning approaches struggle with such variability in sequential travel data, often requiring extensive preprocessing and imputation to accommodate incomplete records. For example, methods such as ARIMA [1], Kalman filters [2], and even more recent techniques like random forest [3] generally assume fixed-length inputs and are thus not well-equipped to handle the irregularities inherent in transit data.

Figure 1. Comparison of tabular and dynamic-sequence travel data.

To address these challenges, we propose a transformer-based dynamic-sequence approach that models transit trips as variable-length sequences. By leveraging the power of attention mechanisms, transformer models enable the flexible representation of each trip, preserving both spatial and temporal context [4]. In our framework, each transit leg is encoded as a token that integrates travel time, a mapped mode indicator, and a 2D positional encoding derived from grid-based spatial coordinates. This novel tokenization not only mitigates the issues arising from missing or incomplete data but also enables the model to capture long-range dependencies across different segments of a trip.

In summary, this paper aimed to demonstrate our findings in the following areas:

Dynamic-Sequence Modeling: We develop a transformer-based model that effectively handles variable-length transit trip data.
Travel Data Tokenization: We introduce a tokenization method that integrates travel time, mode information, and grid-based positional encoding, capturing complex spatiotemporal patterns inherent in travel data.
Independent Regression Pipelines: We independently predict first-mile and last-mile travel times, evaluated with Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE), demonstrating superior performance compared to traditional approaches.

The next section describes the dataset characteristics and the preprocessing techniques used to handle variable-length sequences and missing data. Following that, we explain our methodology, including the token generation process, transformer model architecture, and training procedures. Finally, the Results and Implications section presents our findings and discusses their significance for modern travel data analysis.

2. Data

First- and last-mile travel data can be considered microscopic, as they record individual movements. Thus, the collection of such data must adhere to stringent privacy regulations. For this research, 2021 Individual Travel Survey Data provided by the Korea Transport Database [5] was utilized. This dataset contained encrypted survey data for over 350,000 individual single-purpose travel records. Data such as travel purpose, travel modes, and origin-destination (OD) coordinates for each transit leg were recorded. Among this data, only travel records that had their OD coordinates in Seoul, South Korea, were considered.

Detailed spatial data such as OD coordinates were aggregated to satisfy privacy requirements and enable standardized positional encoding for model implementation. Using administrative boundary data from Statistics Korea [6], we divided the study area into a 13 × 11 grid of 3 km × 3 km cells. Figure 2 shows this spatial aggregation scheme: each cell is assigned a positional label from (1,1) to (13,11), for a total of 143 cells.

Figure 2. Spatial aggregation of data into 3 km × 3 km cells.

The data was then split into three separate tables:

First-mile (FM) data: data regarding the first transit leg before public transit use.
Public Transit (PT) data: data recording one or more consecutive public transit uses.
Last-mile (LM) data: data relevant to the final transit leg after public transit use.

Specifically, PT data was processed to reveal detailed spatiotemporal characteristics regarding each transit leg, and the target for prediction was set as FM and LM travel time data. The methodology used will be illustrated in the following section.

3. Methodology

Our approach to predicting first-mile and last-mile travel times from public transit data is based on a transformer architecture specifically designed to handle the inherent variability in transit trip records. The methodology is organized into three key components: dynamic-sequence generation through tokenization and positional encoding, a transformer model architecture for regression, and a comprehensive training and evaluation procedure.

In the tokenization process, each trip is converted into a sequence of tokens, with each token representing critical information from one leg of the journey. Each token is constructed by combining three elements: the travel time, the travel mode, and a positional encoding generated using sinusoidal functions. This design, illustrated in Figure 3, ensures that the model captures both the temporal and spatial aspects of each transit leg.

Figure 3. Tokenization of public transit data.

Because transit data often suffers from missing spatiotemporal information [7], our method also incorporates special tokens. Start tokens are generated from the starting grid coordinates of the first public transit leg, while end tokens are generated from the ending grid coordinates of the last leg. When grid data is missing, default values are used to preserve the dynamic-sequence format. This comprehensive tokenization strategy allows the transformer architecture to leverage all available information for data-driven prediction.

Transformer-based regressors leverage the power of self-attention to capture long-range dependencies and complex interactions within data [8]. In our case, the transformer model is built upon a stack of transformer encoder layers, each of which integrates a multi-head self-attention mechanism with a position-wise feed-forward network. Residual connections and layer normalization are used throughout to ensure stable gradients and robust learning.

Moreover, as our input sequences are of variable length, we introduce a masked mean pooling operation following previous literature [9]. This mechanism computes the average representation over only the valid, non-padded tokens, resulting in a fixed-dimensional vector that encapsulates the entire trip.

This pooled vector is then passed through a regression head, which consists of one or more dense layers with non-linear activations and dropout for regularization. The final output layer produces a single continuous value corresponding to the predicted travel time, whether for the first-mile or last-mile segment. The modular nature of this design allows us to train independent models for both first- and last-mile predictions, ensuring each model is finely tuned to its specific target.

4. Results and Implications

ARIMA was employed as a benchmark to evaluate our model because it represents a well-established, classical statistical approach for time series forecasting. Although ARIMA models are traditionally limited to univariate predictions and require extensive preprocessing to handle non-stationary data, they have long served as a baseline in many forecasting studies. By comparing our transformer-based predictions with those of an ARIMA model, we can objectively assess the improvements offered by our advanced, deep learning approach in capturing complex spatiotemporal patterns inherent in transit data.

As can be seen in Table 1, transformer-based models outperformed their ARIMA counterparts for both first-mile and last-mile travel time predictions. Specifically, the transformer model for first-mile travel time (T_FMTT) achieved an MAE of 3.2888 min and an RMSE of 5.1985, while the best performing ARIMA model found through a grid search for the first-mile travel time (A_FMTT) recorded higher errors, with an MAE of 4.1599 min and an RMSE of 7.5323. Similarly, for the last-mile travel time prediction, the transformer model (T_LMTT) yielded an MAE of 4.9373 min and an RMSE of 8.1168, compared to the best performing ARIMA model (A_LMTT), which had an MAE of 7.1521 min and an RMSE of 10.4880. These results show that the transformer model improved MAE by approximately 20.94% for the first-mile travel time prediction and approximately 30.96% for the last-mile travel time prediction, suggesting that the transformer-based approach is more effective at capturing the underlying spatiotemporal dynamics in the transit data, leading to more accurate predictions.

Table 1. First- and last-mile prediction results.

The implications of our findings are significant for public transit systems. Enhanced prediction accuracy enables more precise scheduling, reduces passenger waiting times, and improves resource allocation, all of which contribute to greater operational efficiency. Transit agencies can leverage these insights to develop more resilient and adaptive service plans, ultimately increasing passenger satisfaction and fostering more effective urban mobility management. Furthermore, well-trained models capable of accurately forecasting individual travel characteristics pave the way for integrating these predictions with larger datasets. For example, data fusion techniques [10] could be utilized to generate more granular and detailed travel information at a larger scale than was previously available.

Author Contributions

Conceptualization, B.C.; data curation, B.C.; writing—original draft preparation, B.C.; writing—review and editing, D.-K.K.; supervision, D.-K.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work is financially supported by Korea Ministry of Land, Infrastructure and Transport (MOLIT) as Innovative Talent Education Program for Smart City.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data used for analysis is provided by Korea Transport Database (KTDB) and available upon request online: www.ktdb.go.kr/www/contents.do?key=202 (access on 3 August 2025).

Conflicts of Interest

The authors declare no conflict of interest.

References

Suwardo, W.; Napiah, M.; Kamaruddin, I. ARIMA models for bus travel time prediction. IEM J. 2010, 71, 49–58. [Google Scholar]
Liu, H.; Zuylen, H.V.; Lint, H.V.; Salomons, M. Predicting urban arterial travel time with state-space neural networks and Kalman filters. Transp. Res. Rec. 2006, 1968, 99–108. [Google Scholar]
Cheng, L.; Chen, X.; De Vos, J.; Lai, X.; Witlox, F. Applying a random forest method approach to model travel mode choice behavior. Travel Behav. Soc. 2019, 14, 1–10. [Google Scholar]
Hong, Y.; Martin, H.; Raubal, M. How do you go where? improving next location prediction by learning travel mode information using transformers. In Proceedings of the 30th International Conference on Advances in Geographic Information Systems (SIGSPATIAL ′22), Seattle, WA, USA, 1–4 November 2022; pp. 1–10. [Google Scholar]
Korea Transport Database (KTDB). Available online: www.ktdb.go.kr/www/contents.do?key=202 (accessed on 10 July 2024).
Statistics Korea (KOSTAT). Korea Administrative District Boundary Data. Available online: https://sgis.kostat.go.kr/view/pss/openDataIntrcn (accessed on 10 July 2024).
Park, J.-H.; Kim, S.-G.; Cho, C.-S.; Heo, M.-W. The study on error, missing data and imputation of the smart card data for the transit OD construction. J. Korean Soc. Transp. 2008, 26, 109–119. [Google Scholar]
Grigsby, J.; Wang, Z.; Nguyen, N.; Qi, Y. Long-range transformers for dynamic spatiotemporal forecasting. arXiv 2021, arXiv:2109.12218. [Google Scholar]
Hou, L.; Geng, Y.; Han, L.; Yang, H.; Zheng, K.; Wang, X. Masked Token Enabled Pre-Training: A Task-Agnostic Approach for Understanding Complex Traffic Flow. IEEE Trans. Mobile Comput. 2024, 23, 11121–11132. [Google Scholar]
Kusakabe, T.; Yasuo, A. Behavioural data mining of transit smart card data: A data fusion approach. Transp. Res. Part C Emerg. 2014, 46, 179–191. [Google Scholar]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Applying Transformer-Based Dynamic-Sequence Techniques to Transit Data Analysis^†

Abstract

1. Introduction

2. Data

3. Methodology

4. Results and Implications

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

	T_FMTT	T_LMTT	A_FMTT	A_LMTT
MAE	3.2888	4.9373	4.1599	7.1521
RMSE	5.1985	8.1168	7.5323	10.4880

Applying Transformer-Based Dynamic-Sequence Techniques to Transit Data Analysis †

Abstract

1. Introduction

2. Data

3. Methodology

4. Results and Implications

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

Applying Transformer-Based Dynamic-Sequence Techniques to Transit Data Analysis^†