Abstract
Public transit demand forecasting is a foundational component of sustainable urban mobility, enabling efficient operation, equitable service provision, and planning of public transit systems. Urban imagery, such as aerial images, contains rich information about urban sociodemographic characteristics and the built environment, offering particular value for data-scarce regions where conventional datasets are limited or outdated. However, there is limited research on using these images for public transit demand forecasting. This study introduces a deep learning approach for predicting transit ridership using aerial images. The method employs an encoder–decoder architecture to functionally separate image-derived latent representations into sociodemographic and physical environment vectors, which are subsequently used as inputs to a neural network for ridership prediction. Using data from Seoul, South Korea, the effectiveness of the proposed method is evaluated against three baseline configurations. The results show that the sociodemographic latent vector captures spatially organized residential characteristics, while the physical environment vector encodes distinct urban landscape patterns such as dense housing, traditional street grids, open spaces, and natural environments. The proposed model, which uses only imagery-derived latent features, substantially outperforms the pure image baseline and narrows the performance gap with census-informed models, reducing sMAPE by 25–60% depending on the mode. Combining imagery with census variables yields the highest accuracy, confirming their complementary nature. These findings highlight the potential of imagery-based approaches as a scalable, cost-efficient, and sustainable tool for data-driven transit planning.