3.2. Model Hyperparameters
This generator accepts a tensor of shape 3 × 3 × 2 (comprising the actual flow of individuals at the previous moment and random noise of identical dimensions) as input. Initially, a 1 × 1 convolutional layer with 8 channels and padding set to the same is employed to facilitate cross-channel feature fusion without altering the spatial dimensions. This is followed by Batch Normalization and a tanh activation function to constrain the output within the range [−1,1]. Subsequently, the features are flattened into vectors, and two fully connected layers with 32 and 16 hidden units, respectively, each accompanied by Batch Normalization and tanh activation, are utilized to enable interaction and compression of global information. The final fully connected layer outputs a 9-dimensional vector, which is reshaped into a 3 × 3 × 1 prediction matrix.
In the discriminator, we determined the number of convolution channels, the number of fully connected layer units, the activation function, and the normalization method to achieve the best distinction between real and generated samples. The discriminator takes a tensor of shape 3 × 3 × 2 (including conditional channels and sample channels) as input. It passes through a layer of 1 × 1 convolution (output channel number 8, padding = ‘same’), which completes information fusion only in the channel dimension without changing the spatial size, so that the model quickly learns the local correspondence between conditions and samples; then batch normalization is added to stabilize training and accelerate convergence. Then, the convolutional feature map is flattened into a one-dimensional vector and sequentially input into two fully connected layers containing 32 and 16 hidden units. Each layer uses the Leaky rectified linear unit (ReLU) (α = 0.2) activation function, which retains negative information, prevents neurons from dying, and effectively alleviates internal covariate shift with the assistance of BatchNormalization. Finally, through a classification layer containing 2 output units and softmax activation, the output corresponds to the probability of “real” and “generated” samples.
To ensure a fair evaluation of the effectiveness of the conditional input, the control group model is entirely consistent with this model concerning the overall architecture and hyperparameter configurations of each layer. The sole distinction lies in the fact that its input comprises only a single-channel 3 × 3 tensor and does not include the historical flow condition channel.
3.3. Simulation Results
In this study, we used RMSE as an indicator to evaluate the model. Specifically, for each training batch, the generator uses the historical passenger flow data at the previous time (t − 1) as a conditional input and generates a 3 × 3 grid passenger flow prediction matrix corresponding to time t. Then, this prediction matrix is flattened with the true observation matrix after inverse normalization, and the RMSE of each batch is calculated, and finally all batch results are averaged to obtain the average prediction error of the model on the entire data set. This approach not only quantifies the prediction deviation of individual batches but also suppresses occasional fluctuations through the averaging process between batches, ensuring that the evaluation results are more robust and comparable. The control group also uses the same RMSE calculation process, only comparing the 3 × 3 passenger flow matrix generated by random noise with the true passenger flow matrix, without the need for historical conditional input.
Figure 4a shows the RMSE curves of the GAN and control group models in the Shilin area at different training epochs. The blue curve represents the model developed in this study, and the orange curve represents the control group model. The vertical axis shows RMSE, and the standard deviation (SD) of RMSE of each model is marked in the upper right corner of the chart. The RMSE of the developed model is maintained in the range of about 200–300, while the control group is mostly between 350 and 500, which means that the model developed has a significantly smaller average error and higher prediction accuracy in each short-term passenger flow prediction.
Along the epoch axis, RMSE drops rapidly and stabilizes in the first 5–10 epochs. The control group does not decrease significantly until more than 30 epochs. In other words, the model has achieved optimal performance after fewer training iterations, with lower training costs and shorter development cycles. The SD of the RMSE of the model is 34.6 (vs. 50.9), indicating that its performance fluctuates less between different epochs. A lower SD means that the model adapts more evenly to the training data and is less likely to experience large error fluctuations due to batch differences or noise, which improves the repeatability and reliability of the prediction results.
The lower average error, faster convergence speed, and smaller performance fluctuations all indicate that our GAN is not only more accurate in short-term crowd flow prediction tasks, but also more efficient and robust.
Figure 4b shows the average RMSE curves of the developed GAN model and the control model for 10 versions at different training epochs in the Shilin area. The average RMSE falls between about 210–270, while the control model is mostly concentrated between 420 and 500, indicating that the GAN proposed in this study can maintain significantly lower prediction errors under various settings. From the convergence behavior, the developed model dropped and stabilized in the first 5 epochs. In contrast, the control model requires a longer training time to see a significant performance improvement, indicating that the architecture of the developed model has a faster convergence speed. In addition, the RMSE of the developed model fluctuated less between the ten versions (an SD of 37.0), which is much smaller than 61.0 of the control group. This result shows that even if the average of the ten versions is taken, this method shows a consistently excellent trend, verifying its effectiveness.
Figure 5 and
Figure 6 illustrate the training processes for Xinyi Business District and Banqiao Houzhan Business District using our method and the control method, respectively. The results are consistent with those of
Figure 4 for the Shilin Business District. RMSE of the developed either with a single model or the average of 10 models, converges faster than the control group and achieves a lower value. This further confirms the effectiveness of the developed model.
In the final experimental region, the Zhongxiao Business District, results diverged from those observed in other areas, warranting specific discussion. Between the 10th and 25th training epochs, the model’s RMSE temporarily increased to nearly 1000, reducing the performance gap with the control group. This fluctuation reflects the highly variable pedestrian flow in Zhongxiao, characterized by distinct peak periods during lunch hours and substantial evening traffic driven by dining and entertainment activities. Moreover, heterogeneous sources of foot traffic—such as MRT stations, shopping centers, and office buildings—contribute to significant spatial variability in flow levels across different blocks at the same time.
During training, the generator captures these complex patterns, occasionally modeling secondary peaks and atypical scenarios. This exploratory behavior leads to transient overfitting and RMSE instability. However, after the 25th epoch, the model reorients its learning trajectory, and the RMSE declines to below 700, demonstrating its capacity for self-correction. Despite mid-training fluctuations, the model consistently achieves a lower average error than the control group, underscoring its effectiveness in handling dynamic urban environments.
Figure 7b presents the average RMSE across ten independently initialized variants of the GAN and its control group counterpart within the Zhongxiao Business District, plotted against the number of training iterations. The results corroborate the findings in
Figure 4b: the proposed model converges more rapidly to a stable performance level and maintains a consistently low error margin across repeated trials, indicating superior generalization capability.