Traffic Prediction with Self-Supervised Learning: A Heterogeneity-Aware Model for Urban Traffic Flow Prediction Based on Self-Supervised Learning

: Accurate traffic prediction is pivotal when constructing intelligent cities to enhance urban mobility and to efficiently manage traffic flows. Traditional deep learning-based traffic prediction models primarily focus on capturing spatial and temporal dependencies, thus overlooking the existence of spatial and temporal heterogeneities. Heterogeneity is a crucial inherent characteristic of traffic data for the practical applications of traffic prediction. Spatial heterogeneities refer to the differences in traffic patterns across different regions, e.g., variations in traffic flow between office and commercial areas. Temporal heterogeneities refer to the changes in traffic patterns across different time steps, e.g., from morning to evening. Although existing models attempt to capture heterogeneities through predefined handcrafted features, multiple sets of parameters, and the fusion of spatial– temporal graphs, there are still some limitations. We propose a self-supervised learning-based traffic prediction framework called Traffic Prediction with Self-Supervised Learning (TPSSL) to address this issue. This framework leverages a spatial–temporal encoder for the prediction task and introduces adaptive data masking to enhance the robustness of the model against noise disturbances. Moreover, we introduce two auxiliary self-supervised learning paradigms to capture spatial heterogeneities and temporal heterogeneities, which also enrich the embeddings of the primary prediction task. We conduct experiments on four widely used traffic flow datasets, and the results demonstrate that TPSSL achieves state-of-the-art performance in traffic prediction tasks.


Introduction
The importance of traffic prediction in urban planning and management is self-evident.Accurate traffic predictions enable effective traffic management, reduce congestion, and enhance the sustainability of urban transport systems.In particular, grid-based traffic flow prediction plays a crucial role in understanding and managing the dynamics of urban mobility.Dividing cities into manageable segments and predicting the traffic flow within each segment allows for a more detailed analysis of traffic patterns, thus facilitating targeted interventions and infrastructure planning.
Over the years, traffic prediction methodologies have evolved through three major stages: traditional statistical models, machine learning techniques, and deep learning methods.Each stage represents a leap forward regarding prediction accuracy and the ability to handle complex spatial-temporal data.Adopting deep learning in traffic prediction marks a significant milestone, thus offering unprecedented levels of accuracy by leveraging large datasets and capturing intricate patterns in traffic flow.This evolution underscores the growing complexity of urban traffic systems and the increasing need for advanced predictive models to navigate the challenges of modern urban environments.
Traffic data inherently exhibit spatial and temporal heterogeneities, thus reflecting the variability of traffic flow across different regions and time steps.Figure 1a visualizes traffic flow data in Beijing, with (a) showing a heatmap of inflow at 9 a.m. on 1 March 2015 (Sunday).It vividly illustrates the differences in traffic volume between various areas, thus highlighting the concept of spatial heterogeneities.Spatial heterogeneities can arise from many factors, including road layouts, the positioning of transportation hubs (e.g., subway and train stations), the distribution of commercial and residential areas, and events specific to certain regions (e.g., sports events and concerts).Figure 1b shows the changes in inflow for two selected areas, A and B, on 1 March 2015 (Sunday), and 2 March 2015 (Monday).It reveals that traffic patterns in different areas change over time steps, e.g., from weekends to weekdays or from morning to midnight, thereby leading to temporal heterogeneities.These changes are crucial for understanding the dynamics of urban mobility and necessitate sophisticated prediction models capable of capturing such complexities.The spatial and temporal heterogeneities in traffic data not only challenge traditional prediction methods but also provide an opportunity to improve prediction accuracy by incorporating these heterogeneities into the modeling process.Therefore, acknowledging and modeling traffic data's spatial and temporal heterogeneities are crucial for developing accurate and reliable traffic prediction models.We have reviewed many studies and found that current traffic prediction models need to be improved with respect to capturing spatial and temporal heterogeneities.Some models attempt to incorporate temporal features (e.g., periodicity and holidays) into the model [1,2] to capture temporal heterogeneities.Still, these are predefined features that may not fully capture the complexities of urban traffic patterns.Predefined spatial heterogeneities features are typically obtained by graph embedding based on an adjacency matrix [3], thus overlooking the complexities and diversities of regions.This reliance on handcrafted features limits the models' ability to adapt and generalize across various spatial regions and time scales.Models that overlook spatial heterogeneities tend to favor popular areas with heavy traffic flow [4], thereby leading to an incomplete understanding of urban traffic flow.Some studies attempt to capture spatial heterogeneities using different parameters in different regions.Still, this involves many parameters and may lead to suboptimal solutions in nonuniform urban environments [5,6].Meta learning techniques have recently been introduced into traffic prediction to capture spatial-temporal heterogeneities, but the model's effectiveness depends on predefined spatial and temporal features [7,8].Methods that adopt spatial-temporal graphs address temporal nonuniformity [9,10] but assume that temporal heterogeneities across the same period are static, which does not reflect reality.
Furthermore, attempts to actively capture spatial and temporal heterogeneities within models frequently encounter challenges in effectively balancing the granularity of representation with computational efficiency.Models struggle to balance oversimplified assumptions and enormous computational demands when dealing with complex traffic flows that include large amounts of data [11].This gap highlights the necessity for innovative approaches to inherently understand and the model spatial and temporal heterogeneities in traffic flow data.
To address the limitations above, we propose a novel self-supervised learning framework: Traffic Prediction with Self-Supervised Learning (TPSSL).First, this framework leverages a spatial-temporal encoder to encode traffic data's spatial and temporal dependencies.Then, we introduce an adaptive data masking strategy to dynamically adjust the regions that need to be masked based on traffic data characteristics.Recognizing the complexities of capturing spatial and temporal heterogeneities in traffic data, we introduce two auxiliary self-supervised learning paradigms.The self-supervised learning paradigm based on soft clustering is responsible for exploring unique spatial patterns across different regions to learn spatial heterogeneities.It allows the model to identify and differentiate the unique traffic patterns across various urban areas without explicitly labeling, thereby inferring meaningful clusters of spatial regions from the natural distribution of traffic data.Moreover, we adopt a self-supervised learning paradigm based on positive and negative samples to incorporate temporal heterogeneities into the model's feature space.This paradigm is designed to maintain dedicated representations of traffic dynamics, thus adapting to the variability in traffic flow across different time steps in a day.
The main contributions of this paper are summarized as follows: • We propose a novel self-supervised learning framework to model spatial and temporal heterogeneities in urban traffic flow data.We offer a detailed understanding and new insights for other spatial-temporal prediction tasks, e.g., weather forecasting.

•
We introduce an adaptive data masking strategy that dynamically adjusts the regions that need to be masked based on traffic data characteristics, thereby enhancing the model's robustness against noise disturbances and ensuring that the learned representations are accurate and generalizable across different traffic conditions.

•
Our framework incorporates two auxiliary self-supervised learning tasks, which aim to enrich the model's feature space, thus allowing for a deeper exploration of the underlying patterns of spatial and temporal heterogeneities to enhance the primary traffic prediction task.

•
We conduct experiments on several real-world public datasets, thus demonstrating the superiority of TPSSL by achieving state-of-the-art results.We also conduct ablation studies to illustrate the importance of the adaptive data masking strategy and the two self-supervised learning paradigms.Furthermore, we explain the effectiveness of TPSSL through case studies.

Related Work
Traffic prediction has undergone several stages of development, from traditional statistical models to machine learning methods and then to deep learning techniques.The advancements in deep learning techniques have brought breakthroughs to traffic prediction, thus attracting many researchers' attention.Self-supervised learning, a highly effective unsupervised learning paradigm widely used in various fields, has been introduced into traffic prediction.This section reviews the following research: (1) deep learning in traffic prediction and (2) self-supervised learning in representation learning.

Deep Learning in Traffic Prediction
Accurate traffic prediction is crucial for urban planning and traffic management, and deep learning has emerged as a powerful tool in this domain.Deep learning techniques, e.g., convolutional neural networks (CNNs), recurrent neural networks (RNNs), graph neural networks (GNNs), and attention mechanisms, have been widely applied to traffic prediction tasks [12].CNNs have been effectively applied to capture spatial dependencies in traffic data, thus offering significant improvements over traditional methods.Zhang et al. [1] introduced ST-ResNet, a deep spatial-temporal residual network that leverages CNNs to forecast citywide crowd flows, thus showcasing the capability of CNNs to model complex spatial relationships within urban traffic systems.Traditional CNNs are unable to address sequence modeling problems, so Bai et al. [13] proposed the temporal convolutional network, which captures temporal dependencies in traffic data by introducing one-dimensional CNNs.RNNs and their variants, e.g., Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU), have been widely adopted to model temporal dependencies in traffic data.Yao et al. [14] used LSTM to model the correlations between future traffic demand values and neighboring time steps.Li et al. [15] used GRU to model temporal dependencies and replaced matrix multiplication in GRU with diffusion convolution.GNNs have gained attention for their ability to model the graph-structured data commonly found in traffic networks.ChebNet is a spectral method, and Yu et al. [4] used Chebyshev first-order approximation graph convolution to obtain neighboring information for nodes.Due to the ability to model long-range dependencies without the sequential processing limitations of RNNs, attention mechanisms have been explored for traffic prediction.Inspired by attention-based models, Cai et al. [2] proposed a traffic transformer to parallelly predict traffic flow for multiple time steps in a nonautoregressive manner.Attention mechanisms can capture both temporal and spatial dependencies, so Zheng et al. [3] proposed multiple attention mechanisms to jointly act on traffic prediction tasks.Due to the advantages of CNNs, RNNs, GNNs, and attention mechanisms, most studies tend to combine them to improve the accuracy of traffic prediction [2,4,[14][15][16][17].In existing research, most models focus on capturing spatial and temporal dependencies in traffic data, thus often focusing on popular areas in the city and overlooking less popular areas.
Recent advancements in traffic flow prediction models have demonstrated significant improvements by integrating cutting-edge deep learning techniques.Naheliya et al. [18] introduced the MFOA-Bi-LSTM by utilizing a modified firefly optimization algorithm to enhance the predictive capabilities of Bi-LSTMs through optimal hyperparameter tuning.Similarly, Redhu et al. [19] employed a particle swarm optimization-enhanced Bi-LSTM model, thus showcasing the potential of swarm intelligence in refining neural network performance for traffic prediction.Zhang et al. [20] proposed a Multiattention Hybrid Convolution Spatial-Temporal Recurrent Network (MHSRN), which integrates multiattention mechanisms with hybrid convolutional layers to capture complex spatial-temporal patterns effectively.Moreover, Chen et al. [21] developed a Traffic Flow Matrix-Based Graph Neural Network (TFM-GCAM) that employs a novel graph convolution strategy enhanced with attention mechanisms to improve the accuracy of traffic flow prediction.He et al. [22] presented a 3D dilated dense neural network that leverages multiscale dilated convolutions to address the spatiotemporal variations in traffic data more dynamically.Lastly, Bao et al. [23] introduced the Spatial-Temporal Complex Graph Convolution Network (ST-CGCN), which uses a complex correlation matrix to model the intricate relationships between traffic nodes, thereby enhancing both the spatial and temporal feature extraction capabilities.
Recent research efforts have begun to explore how better to capture the spatial and temporal heterogeneities within traffic systems using deep learning methods.Bai et al. [5] introduced an adaptive module, i.e., a data-adaptive graph generation module, to automatically infer the interdependencies among different traffic series, thus avoiding predefined graph structures.Pan et al. [6] adopted a matrix factorization approach in neural networks to decompose region-specific parameters into learnable matrices, thereby modeling latent region functionality and inter-region correlations.Guo et al. [11] represented spatial heterogeneities features by assigning an additional embedding vector to each region and learning these vectors through model training.The above methods learn spatial heterogeneities by applying unique parameters to different areas.However, this strategy results in many parameters and may yield suboptimal results in nonuniformly distributed urban environments.Meta learning techniques have also been introduced into traffic prediction to capture spatial-temporal heterogeneities, but their effectiveness still depends on predefined external spatial and temporal attributes [7,8].Li et al. [9] generated a temporal graph and fused it with a spatial graph to form a spatial-temporal fusion graph.Song et al. [10] captured spatial-temporal heterogeneities in traffic data by constructing a local spatial-temporal graph.Although spatial-temporal graphs aim to capture heterogeneities, they often provide a relatively static representation.If heterogeneities in the traffic network change over time, these graphs may fail to capture dynamic heterogeneities.The above methods have made some progress in capturing spatial and temporal heterogeneities in traffic data, but there are still some limitations.

Self-Supervised Learning in Representation Learning
Self-supervised learning (SSL) is a technique used in representation learning [24], thereby allowing models to discover feature detection or classification representations in raw data automatically.Unlike supervised learning, which requires manually annotated labels, SSL uses inherent structures in the data to generate supervisory signals.This method enables models to learn rich data representations from any observed part of the input data by predicting any unobserved or hidden part.Self-supervised learning has been used in various fields, including Natural Language Processing (NLP) and Computer Vision (CV).In NLP, SSL has been used to learn word embeddings or language models from large, unannotated text corpora, such as BERT and GPT [25,26].In CV, SSL techniques have been used to pretrain models on large image datasets, thus enabling them to recognize visual patterns and objects without relying on labeled datasets, such as SimCLR and MoCo [27,28].
Contrastive learning and generative models are the two prominent methods used in SSL [29].Contrastive learning methods learn representations by contrasting positive and negative sample pairs, thus pulling similar samples closer in the representation space and pushing dissimilar samples further apart.On the other hand, generative models focus on learning to reconstruct or generate data, thereby capturing the data distribution and learning features.
However, the application of self-supervised learning in traffic prediction still needs to be improved.Researchers have explored using self-supervised learning in traffic prediction, and their work has shown promising results.Ji et al. [30] adopted a self-supervised learning paradigm based on temporal continuity to examine the context information of traffic data, thereby better understanding and predicting the dynamic changes in traffic flow.Another study by Ji et al. [31] proposed a contrastive learning-based traffic prediction framework and learned the representation of traffic data through auxiliary tasks to improve traffic prediction accuracy.Our approach differs from these studies because we spatially model traffic flow data as regular grids rather than as a graph.Consequently, our self-supervised learning tasks focus more on learning the spatial and temporal features of regular gridbased data.

Methodology
In this section, we first clarify the key concepts and problem definition of grid-based short-term traffic prediction tasks, then introduce the overall architecture of TPSSL that we propose, and finally describe the critical components of the framework in detail.

Problem Definition
In addressing the grid-based short-term traffic prediction task, it is essential to clearly define key concepts and the specific formulation of the problem.Definition 1. Spatial Region: A spatial region refers to a spatial area within a city designated for analysis.In grid-based traffic prediction models, the city is divided into numerous equally sized grids, each representing a spatial region.These regions are the basic units for collecting and analyzing traffic flow data within their boundaries.Definition 2. Inflow/Outflow: Inflow denotes the quantity of traffic entering a spatial region within a specified time interval, which encompasses all forms of traffic movement, including vehicles, bicycles, or pedestrians.Conversely, outflow signifies the quantity of traffic exiting a spatial region within the same time interval.
We have historical traffic flow data X = [x t−T+1 , x t−T+2 , . . ., x t ], where x t ∈ R M×N×2 represents the traffic flow matrix at time step t.M and N denote the city divided into M rows and N columns of grids.The value of 2 represents the number of channels, where channel 0 denotes inflow, and channel 1 denotes outflow.The objective of the short-term traffic flow prediction problem is to obtain the traffic flow matrix y t+1 ∈ R M×N×2 at time step t + 1.The problem can be formally described as f (•) represents the traffic prediction model that maps historical traffic flow data X to future traffic flow data y t+1 at the next time step.

Architecture
We propose a traffic prediction model called TPSSL.Its purpose is to improve the accuracy of traffic flow data prediction by capturing spatial and temporal heterogeneities through self-supervised learning.As seen in Figure 2, the overall architecture of TPSSL consists of four key modules: a spatial-temporal encoder, adaptive data masking, spatial heterogeneity modeling, and temporal heterogeneity modeling.The spatial-temporal encoder generates a similarity matrix and prediction embeddings while capturing the spatial-temporal dependencies in traffic flow data.Adaptive data masking enhances the model's robustness by dynamically selecting spatial regions to be masked.Spatial heterogeneity modeling and temporal heterogeneity modeling delve deeper into the complexity of traffic data, thus capturing spatial and temporal heterogeneities in traffic flow data and enriching the feature space of the model.

Spatial-Temporal Encoder
The spatial-temporal encoder in our model is designed to effectively capture both spatial and temporal dependencies of traffic flow data, thus providing rich spatial-temporal embeddings for subsequent modules.The encoder is composed of several essential layers, each of which uniquely contributes to the overall ability of the model to process and interpret traffic flow data.
Initially, the traffic data undergoes processing through two 3D convolutional layers.The 3D convolutional layers handle data across spatial and temporal dimensions, thereby allowing interactions between neighboring regions and time steps to extract features that reflect traffic flow dynamics.The following formula can summarize this sequential processing: where X ∈ R T×M×N×2 represents the input traffic flow data, X ′ ∈ R T×M×N×D denotes the embedding after processing by the convolutional layers, and D represents the embedding size.
Next, an essential aspect of the encoder is the computation of the similarity matrix A ∈ R T×M×N derived from the embedding X ′ .This matrix is intended for use in adaptive data masking, thus facilitating the augmentation of the model's training data by emphasizing similarities between traffic patterns.The calculation of the similarity matrix is as follows: where AvgPool3D refers to the average pooling operation across the feature channels.Softmax is applied to normalize the values and emphasize the relative importance of different time steps in the traffic data.
Then, the core of the spatial-temporal encoder is the Convolutional LSTM (ConvLSTM) layer [32], which has been chosen for its proficiency in capturing spatial and temporal dependencies within the data.Unlike the standard LSTM, which processes temporal data, ConvLSTM extends its capability to spatial dimensions, thus making it particularly suitable for traffic prediction tasks where spatial relationships are crucial.The ConvLSTM layer effectively integrates spatial information with temporal dynamics, thus enhancing the model's predictive performance.Following processing by the ConvLSTM layer, we obtain a richer spatial-temporal embedding H ∈ R M×N×D , which is an important input for subsequent modules.

Adaptive Data Masking
The adaptive data masking module is pivotal with respec to enhancing our traffic prediction model's robustness and generalization capability.Unlike traditional random masking techniques, we design a targeted data masking strategy employing the similarity matrix A obtained from the spatial-temporal encoder.This strategy ensures that the augmentation focuses on the most informative parts of the traffic flow data, thereby challenging the model to learn to simulate natural and challenging traffic scenarios.
The similarity matrix A represents the normalized importance of each spatial region at each time step.We aim to mask a percentage of the data that is inversely proportional to its similarity score, meaning regions with lower similarity scores are more likely to be masked.This is achieved by calculating a masking probability distribution from A, where the probability of masking a given spatial region is higher if its corresponding similarity score is lower.Formally, the masking probability for each spatial region is determined as follows: where P t,i,j represents the masking probability for spatial region r i,j at time step t, and A t,i,j denotes the corresponding element in the similarity matrix A.
The masking operation involves selecting regions to be masked based on P, and a predefined masking ratio determines the total number of masked regions.The inflow and outflow of the selected spatial regions are then set to zero, thus simulating the absence of traffic flow information in these regions.This approach challenges the model to make predictions without specific data but encourages it to leverage its understanding of spatial and temporal dependencies to fill in the missing information.The augmented data obtained through adaptive data masking are denoted as X.The embedding obtained after X passes through the spatial-temporal encoder is denoted by H.

Spatial Heterogeneity Modeling
As illustrated in Figure 2, spatial heterogeneity modeling is a crucial component of our traffic prediction framework.We designed a self-supervised learning task based on soft clustering to capture the underlying spatial heterogeneities in traffic data through self-supervised signals, as shown in Figure 3. Specifically, we mapped the embeddings of different spatial regions to prototypes corresponding to different urban functions (e.g., residential areas, office areas, transportation hubs).We obtained the embeddings of the original data and the augmented data through the spatial-temporal encoder, which are denoted as H and H, respectively.We will refer to H and H as the original and augmented embeddings.The original embedding and the augmented embedding of the region r i,j are denoted as h i,j and hi,j , respectively.The prototypes representing the K clusters are denoted as {c 1 , . . ., c K }.The following formula achieves the clustering results of the augmented embedding: zi,j,k = c ⊤ k hi,j where zi,j,k represents the similarity score between the augmented embedding hi,j of region r i,j and the prototype c k .Thus, the clustering assignment of region r i,j can be represented as zi,j = ( zi,j,1 , . . . ,zi,j,K ).Similarly, ẑi,j,k is the similarity score between the original embedding h i,j and the prototype c k : ẑi,j,k = c ⊤ k h i,j .We designed the learning task to maximize the similarity of the original embedding h i,j and the augmented embedding hi,j in the clustering space.The following formula can express the optimization process: where τ is the temperature parameter, which controls the sharpness of the distribution output by the Softmax function.The sum of the loss functions for all regions is used as the final loss of the model, i.e., By minimizing the crossentropy of the original embedding and the augmented embedding in the clustering space, these two types of embeddings are made as close as possible regarding clustering assignments.In the above approach, we generated the clustering assignment matrices Z ∈ R M×N×K and Ẑ ∈ R M×N×K to serve as self-supervised signals for spatial heterogeneity modeling.We must address two issues to ensure that the regional features conform to the proper distribution of urban space.First, we need to ensure that the sum of the clustering assignment matrices for each region is 1.Second, we must avoid situations where all areas receive the same assignment.We introduced the Sinkhorn algorithm [33], which is a regularization-based optimization method to address these two issues.It was used to adjust the clustering assignment matrices to satisfy certain normalization conditions, i.e., the sum of the assignments for each spatial region over all clusters is 1, and the sum for each cluster over all spatial regions is also 1.By alternately normalizing over the spatial region and cluster dimensions, the Sinkhorn algorithm can achieve a balanced clustering assignment strategy.Using Equation ( 7), we applied the Sinkhorn algorithm to Z and Ẑ and replaced the original assignment matrices with the results of the algorithm.

Temporal Heterogeneity Modeling
To inject temporal heterogeneities into TPSSL, we designed a self-supervised learning task based on contrastive learning, as shown in Figure 4.This task aims to identify and capture changes in traffic patterns at different time steps through contrastive learning, thereby enhancing the model's dynamic understanding of time.First, we fused the original embedding h i,j and the augmented embedding hi,j of region r i,j at time step t to obtain the region-level embedding u t,i,j : where w 1 and w 2 are learnable weights, and ⊙ denotes elementwise multiplication.Then, we generated the city-level embedding s t based on u t,i,j .Specifically, we averaged u t,i,j across its spatial dimensions and applied a sigmoid activation function to obtain s t : Subsequently, we used the city-level embedding s t as the summary information, the regionlevel embedding u t,i,j as the positive sample, and the region-level embedding u t ′ ,i,j at other time steps as the negative sample.We introduced a bilinear discriminator to evaluate the congruence of the summary information s t with the positive and negative samples.The congruence score of the summary information s t with the positive sample h t,i,j obtained through the discriminator can be calculated using the following formula: where W ∈ R D×D is a learnable weight matrix, and b is a bias term.To optimize temporal heterogeneity modeling, we contrasted the congruence scores of the summary information s t with the positive sample h t,i,j and the negative sample h t ′ ,i,j .
L(s t , h t,i,j , h t ′ ,i,j ) = − log σ(g(h t,i,j , s t )) + log(1 − σ(g(h t ′ ,i,j , s t ))) The sum of the loss functions for all regions is used as the final loss: This positive and negative sample contrastive learning mechanism ensures that the prediction results are consistent with the traffic pattern at a specific time step while distinguishing other traffic patterns at different time steps and learning temporal heterogeneities.

Model Training
In TPSSL, we used a Multilayer Perceptron (MLP) to predict traffic flow, which can be expressed by the following formula: ŷt+1,i,j = MLP(h t,i,j ) where ŷt+1,i,j represents the predicted traffic flow value for region r i,j at time step t + 1.The prediction loss L p is calculated using the mean absolute error: where λ is a hyperparameter used to balance the traffic flow prediction values of different channels, and y t+1,i,j represents the true traffic flow value for region r i,j at time step t + 1.Finally, the overall loss function L of TPSSL is the weighted sum of the three loss functions: where α, β, and γ are weights.We adopted a dynamic weight adjustment mechanism to accommodate the varying scales and complexities of different tasks, i.e., the dynamic weight averaging (DWA) technique.Initially, the weights α, β, and γ were set to [1, 1, 1], thus providing equal importance to each loss.The DWA technique recalibrates the weights based on the relative learning progress of each task, thus ensuring a balanced optimization among different modules.
The training process of TPSSL can be summarized as follows: First, the original traffic flow data X is input into the spatial-temporal encoder, thus obtaining the original embedding H and the similarity matrix A. Then, the adaptive data masking module utilizes the similarity matrix A to generate augmented data X.X is input into the spatialtemporal encoder, thereby obtaining the augmented embedding H. Next, H and H are fed into the spatial heterogeneity modeling, temporal heterogeneity modeling, and MLP to obtain the final loss function L. Finally, we optimize the model's parameters using the backpropagation algorithm to minimize the loss function L.

Experiment
In this section, we first introduce four datasets and evaluation metrics used in the experiments and then describe the baseline models and the details of the implementation of TPSSL.Finally, we evaluate the performance of TPSSL through comparative experiments, ablation studies, and case studies.

Data Description
We utilized four publicly available traffic flow datasets: BJTaxi [1], NYCBike1 [1], NYCBike2 [34], and NYCTaxi [34].The NYCBike1 and NYCBike2 datasets are based on the bike rental systems of New York City, while the BJTaxi and NYCTaxi datasets are based on the taxi systems of Beijing and New York City, respectively.A detailed overview of each dataset, including the number of grids, time intervals, start and end dates, and the number of bikes or taxis, is provided in Table 1.These datasets differ in geographical location, time span, and traffic volume, which enables our model to be comprehensively evaluated across various traffic conditions.These datasets were constructed using a sliding window strategy to generate inputoutput pairs.The input data comprise traffic flow data for the four hours preceding the predicted time step, traffic flow data from the same time step on the previous three days, and the two hours before and after that time step.After the generation of input-output pairs, which preserve the continuous chronological order, the dataset was divided into training, validation, and testing sets with a ratio of 7:1:2.Specifically, the initial 70% of the sequentially ordered data was allocated for training, thus ensuring that the validation and testing sets representing the subsequent 10% and 20%, respectively reflect the original temporal order to maintain the inherent time series structure and prevent data leakage.

Evaluation Metrics
To evaluate the accuracy of TPSSL, we used two widely accepted metrics: Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE).Both metrics are essential to assess the performance of traffic flow predictions, with lower values indicating better predictive performance.The MAE measures the average magnitude of prediction errors and is calculated as follows: where y i,j and ŷi,j represent the true and predicted values, respectively.The MAPE provides a percentage measure of predictive accuracy, which is particularly useful for understanding the magnitude of prediction errors relative to the true values.It is defined as where y i,j and ŷi,j have the same meaning as in Equation ( 16).

Baselines
To evaluate the performance of TPSSL, we compared it against a series of baseline models encompassing traditional time series models, machine learning algorithms, and deep learning models.These models have been categorized as follows: Traditional Models:

•
Autoregressive Integrated Moving Average (ARIMA) [35]: It is a classic model in time series forecasting that combines autoregressive, differencing, and moving average components to model various time series data.• Support Vector Regression (SVR) [36]: It provides a powerful mechanism for capturing linear relationships in data by using support vector machines for regression tasks.
Dependency-Aware Traffic Prediction Models: • Spatiotemporal Residual Network (ST-ResNet) [1]: It captures the spatial and temporal dependencies of traffic data through residual connections and convolutional operations.• Spatiotemporal Graph Convolutional Network (STGCN) [4]: It integrates graph convolutional networks with temporal convolutional networks, thus simultaneously modeling spatial and temporal dependencies in traffic data.

•
Graph Multiattention Network (GMAN) [3]: It introduces multiple attention mechanisms, thus allowing the model to dynamically adjust its focus on different regions and time steps of the traffic network.
Heterogeneity-Aware Traffic Prediction Models: • Adaptive Graph Convolutional Recurrent Network (AGCRN) [5]: It combines nodeadaptive parameter learning and data-adaptive graph generation modules to capture fine-grained spatial and temporal correlations without predefined graphs automatically.• Spatial-Temporal Synchronous Graph Convolutional Network (STSGCN) [10]: It captures complex local spatial-temporal correlations through a synchronous modeling mechanism and the heterogeneities of local spatial-temporal graphs through multiple modules at different time periods.• Spatial-Temporal Fusion Graph Neural Network (STFGNN) [9]: It generates a time graph and fuses it with the spatial graph to parallelly process data from different periods, thus effectively learning hidden spatial-temporal dependencies.
These baseline models provide a wide range of approaches to traffic flow prediction, from traditional methods to state-of-the-art models that integrate complex spatial and temporal dependencies and heterogeneities.The heterogeneity-aware traffic prediction models capture the complexity and diversity of traffic data by assigning different parameters to different regions and time steps, which makes them particularly useful for traffic prediction tasks.

Implementation Details
The TPSSL model was built using the PyTorch framework, and we carried out all experiments on a single GeForce RTX 4090 GPU.The model has an embedding size of 64, and all convolution operations adopt a kernel size of three, which balances model complexity and computational efficiency.We used an adaptive data masking strategy with a masking rate 0.1 to introduce variations into the training data without significant information loss.For efficient convergence to the optimal solution, the training process leverages the adaptive learning rate capabilities of the Adam optimizer.Some hyperparameters were set: the learning rate was 0.001, the weight decay was 0, the batch size was 32, and the number of training epochs was 100.We used an early stopping strategy, which terminates the training process early if the loss value on the validation set does not improve for 15 consecutive epochs.

Results
In this study, we evaluated the performance of TPSSL on four widely used public traffic flow datasets: BJTaxi, NYCBike1, NYCBike2, and NYCTaxi.We compared TPSSL against a diverse set of baseline models, including traditional models such as ARIMA and SVR; dependency-aware traffic prediction models such as ST-ResNet, STGCN, and GMAN; and heterogeneity-aware traffic prediction models such as AGCRN, STSGCN, and STFGNN.Additionally, we included the backbone network ConvLSTM of the spatialtemporal encoder as a baseline to demonstrate the effectiveness of our self-supervised learning paradigms.To ensure fairness, we trained ConvLSTM and TPSSL with five different seeds, just like the baseline models whose results come from Ji et al. [31].
Our results show that TPSSL outperformed all other models on all datasets, whether from the perspective of MAE or MAPE.Bolded numbers represent the best results, and underlined numbers represent the second-best results.This success was mainly due to our choice of an appropriate backbone model for the spatial-temporal encoder.ConvLSTM also perforeds well when making spatial-temporal predictions of traffic data alone, as seen from the underlined data in Tables 2 and 3.However, the two self-supervised learning tasks introduced in TPSSL further improved the predictive performance of ConvLSTM.
Moreover, we observed some interesting phenomena from Tables 2 and 3. Deep learning-based traffic prediction models were found to be far superior to traditional time series and machine learning methods regarding prediction accuracy.Additionally, there was no strict distinction between dependency-aware and heterogeneity-aware models regarding their predictive performance.They exhibited different strengths on different datasets.On the BJTaxi dataset, the predictive performance of heterogeneity-aware models was worse than that of the dependency-aware models.We believe heterogeneity-aware models introduce additional parameter space, thus affecting the model's judgment of dependencies while attempting to capture heterogeneities.In contrast, the proposed TPSSL framework uses independent modules to capture dependencies and heterogeneities without affecting each other.This indicates that the self-supervised learning paradigms in TPSSL are very effective in traffic flow prediction tasks.It also suggests that incorporating self-supervised learning into traffic prediction models could be a promising direction for future research.
In a broader comparison across baseline models, TPSSL outshined traditional models like ARIMA and SVR, which, while robust in simpler scenarios, struggled with the complex spatial and temporal dynamics that are typical of urban traffic data.Such observations underscore the limitations of models that fail to integrate advanced spatial-temporal mechanisms.
Among the deep learning approaches, TPSSL showed clear advantages over models such as ST-ResNet, STGCN, GMAN, AGCRN, STSGCN, and STFGNN.Unlike these models, which may excel in spatial or temporal settings but not uniformly across both, TPSSL's architecture allows it to adeptly manage and synthesize these dimensions.The effectiveness of TPSSL was particularly notable in environments with intricate spatial-temporal interactions, where it maintained high accuracy and robustness, thus suggesting a superior ability to generalize across varied traffic conditions.
Each model brings certain strengths to traffic prediction: ST-ResNet and STGCN are praised for their spatial and temporal resolution; GMAN is known for its attention mechanisms that finely tune its focus across the network; AGCRN adapts well to dynamic graph structures; STSGCN synchronizes spatial-temporal elements effectively; and STFGNN explores novel graph fusion techniques for enhanced prediction.Unlike STGCN, GMAN, AGCRN, and STFGNN, which utilize complex graph-based approaches, ST-ResNet and TPSSL employ grid-based data structures.TPSSL differentiates itself by integrating adaptive data masking and heterogeneity-aware modules that optimize spatial and temporal dependencies within this grid framework.The integration of these features reduces the computational demands compared to graph-based models.It improves prediction accuracy, thereby enabling TPSSL to consistently excel in head-to-head comparisons on inflow and outflow predictions across all listed datasets.
The distinct modular approach of TPSSL, which independently but cohesively handles both spatial and temporal data variances, sets it apart from other models.This dual capability positions it as a benchmark model in traffic flow prediction and a highly adaptive framework suitable for the evolving demands of urban traffic management and planning.

Ablation Study
To analyze the impact of each submodule on the performance of TPSSL, we conducted ablation studies.We proposed three variants for the ablation study, i.e., TPSSL-SHM, TPSSL-THM, and TPSSL-RM.TPSSL-SHM disables the temporal heterogeneity modeling module in TPSSL, while TPSSL-THM disables the spatial heterogeneity modeling module in TPSSL.TPSSL-RM uses a random data masking strategy to replace the adaptive one.
Figure 5 shows the results of the ablation study.The results indicate that each submodule plays a significant role in the model's performance.Specifically, the TPSSL-SHM variant, which lacks temporal heterogeneity modeling, tended to perform worse than the full TPSSL model, with increases in both the MAE and MAPE across all datasets.Thus was particularly evident in the outflow predictions for NYCTaxi, thus underscoring the importance of temporal heterogeneity modeling in traffic prediction tasks.When spatial heterogeneity modeling was removed from TPSSL, there was a decline in performance.This effect was observed across all datasets for inflow and outflow predictions, which underscores the significance of spatial heterogeneity modeling in understanding the complex patterns of urban traffic.The TPSSL-RM variant, which employs a random masking strategy, showed an inferior performance compared to the adaptive strategy used in TPSSL.This was consistent across all datasets, thus reinforcing the value of the adaptive data masking strategy in improving prediction accuracy.Despite the performance improvements of TPSSL over its variants, the error rates may still appear relatively high.This can be attributed to the inherent complexity and variability of urban traffic data across different datasets and channels (inflow/outflow).The datasets used in our studies represent a range of urban environments and traffic conditions that can influence the predictability and, hence, the resulting error metrics.Furthermore, the auxiliary self-supervised tasks of spatial and temporal heterogeneity modeling and the data augmentation strategy random masking are designed to enhance the primary prediction task's generalization and robustness.Still, they do not independently determine the model's overall predictive accuracy.In practice, the primary predictive performance of TPSSL during testing is derived from its core component, the spatial-temporal encoder, without the involvement of the auxiliary tasks.

Case Study
To further validate the performance of TPSSL, we conducted case studies on the BJTaxi dataset.The BJTaxi dataset, comprising detailed geotagged taxi trajectories within Beijing, provides a pertinent example due to its extensive coverage of densely populated urban areas and less congested suburban zones.This diversity makes it an exemplary case for testing the spatial-temporal modeling prowess of TPSSL.
Figure 6a shows the grid segmentation of the BJTaxi dataset, with the underlying map taken from Google Maps.With the t-SNE algorithm's help, the two models' hidden embeddings were projected into the 2D space.As shown in Figure 6b,c, we used the kmeans clustering algorithm [37] to cluster the 2D embeddings.Furthermore, we visualized the clustering results in the grid space, as shown in Figure 6d,e.
We can see from Figure 6b,c that the hidden embeddings of TPSSL are more compact in space.At the same time, we can see from Figure 6d,e that TPSSL could accurately identify different types of areas, e.g., the traffic hub area marked in red and the suburbs marked in brown and green.Not all green and brown grids denote suburban areas; some represent central residential districts with lower taxi flow, like the Hutongs in Beijing.The lower taxi flow in the Hutongs can be attributed to their narrow alleyway configurations, which restrict vehicle access and discourage heavy traffic.This precision in classification demonstrates TPSSL's superior capability in discerning complex urban traffic structures compared to ConvLSTM.All these insights confirm that TPSSL excels at capturing the spatial heterogeneities inherent in urban traffic more effectively than ConvLSTM.The expanded case study validates TPSSL's enhanced performance and underscores its potential applicability in real-world urban planning and traffic management scenarios.

Conclusions
In this paper, we proposed a new self-supervised learning framework to improve the performance of traffic prediction models.TPSSL uses a spatial-temporal encoder and two self-supervised learning tasks to capture the dependencies and heterogeneities of traffic data, respectively.The generation of augmented data based on the adaptive data masking strategy can enhance the robustness and generalization of the model while providing more information for subsequent self-supervised tasks.The self-supervised paradigm based on soft clustering and positive-negative sample pairs can capture traffic data's spatial and temporal heterogeneities separately without negatively affecting the model's predictive performance.We conducted experiments on four public datasets, and the results show that TPSSL achieved the best predictive performance on all datasets.We also conducted ablation and case studies, thus verifying the accuracy and effectiveness of TPSSL and providing further explanations for the model's outstanding performance.
In the future, we will explore incorporating self-supervised learning techniques into other traffic prediction models to improve the predictive accuracy further.Additionally, we aim to investigate the application of TPSSL to real-time traffic data from area traffic control sensors, such as induction loops.This will enable us to leverage current data for learning and prediction, thereby enhancing model validation with actual traffic conditions observed over extended periods.At the same time, we will also study how to apply TPSSL to spatial-temporal data prediction tasks in other fields.

Figure 1 .
Figure 1.Visualization of spatial and temporal heterogeneities in traffic flow data in Beijing.(a) Heatmap of inflow at 9 a.m. on 1 March 2015 (Sunday).(b) Changes in inflow for two selected areas, A and B, on 1 March 2015 (Sunday), and 2 March 2015 (Monday).

Figure 3 .
Figure 3. Spatial heterogeneity modeling in TPSSL.Different shapes of embeddings represent different prototypes.Blue embeddings are generated from the original data, and orange embeddings are generated from the augmented data.This module is implemented based on soft clustering, thus using the similarity of original and augmented embeddings in the clustering space to guide learning spatial heterogeneities.

Figure 4 .
Figure 4. Temporal heterogeneity modeling in TPSSL.This module is implemented based on contrastive learning, thus capturing changes in traffic patterns at different time steps through the congruence of the summary information of spatial regions with positive and negative samples.

Figure 5 .
Figure 5. Ablation Study of TPSSL.We compared TPSSL with its three variants: TPSSL-SHM, TPSSL-THM, and TPSSL-RM.The results demonstrate that each submodule plays a significant role in the model's performance.

Figure 6 .
Figure 6.Visualization of the case studies of TPSSL and ConvLSTM.(a) is the grid segmentation of the BJTaxi dataset.(b,c) are the t-SNE projections of the hidden embeddings of TPSSL and ConvLSTM in the 2D space, respectively.(d,e) are the reconstructed visualizations of (b,c) in the grid space, respectively.
Note: # Regions represents the number of spatial regions in the dataset.# Bikes/Taxis represents the number of bikes or taxis.The symbol + indicates the actual number is greater than the displayed value.

Table 2 .
Predictive performance of each model on inflow for the four datasets.

Table 3 .
Predictive performance of each model on outflow for the four datasets.Bolded numbers represent the best results, and underlined numbers represent the second-best results.