Vehicle Re-Identiﬁcation with Spatio-Temporal Model Leveraging by Pose View Embedding

: Vehicle re-identiﬁcation (Re-ID) research has intensiﬁed as numerous advancements have been made along with the rapid development of person Re-ID. In this paper, we tackle the vehicle Re-ID problem in open scenarios. This research differs from the early-stage studies that focused on a certain view, and it faces more challenges due to view variations, illumination changes, occlusions, etc. Inspired by the research of person Re-ID, we propose leveraging pose view to enhance the discrimination performance of visual features and utilizing keypoints to improve the accuracy of pose recognition. However, the visual appearance information is still limited by the changing surroundings and extremely similar appearances of vehicles. To the best of our knowledge, few methods have been aware of the spatio-temporal information to supplement visual appearance information, but they neglect the inﬂuence of the driving direction. Considering the peculiar characteristic of vehicle movements, we observe that vehicles’ poses on camera views indicating their directions are closely related to spatio-temporal cues. Consequently, we design a two-branch framework for vehicle Re-ID, including a Keypoint-based Pose Embedding Visual (KPEV) model and a Keypoint-based Pose-Guided Spatio-Temporal (KPGST) model. These models are integrated into the framework, and the results of KPEV and KPGST are fused based on a Bayesian network. Extensive experiments performed on the VeRi-776 and VehicleID datasets related to functional urban surveillance scenarios demonstrate the competitive performance of our proposed approach.


Introduction
In recent years, there has been an explosive growth of massive surveillance camera installations that become an indispensable part of human life in urban public spaces with particular benefits for security services [1][2][3][4][5]. One of the significant functions of urban surveillance systems is to assist security officers in locating suspicious objects whereby the task entails an input query object's image, and the objective is to search for the same target in videos recorded by various cameras. With the rapid development and recent advances in person re-identification (Re-ID) techniques, vehicle Re-ID as the task of matching identical vehicles captured by different cameras distributed over non-overlapping scenes has attracted increasing attention. However, compared to person Re-ID [6][7][8][9], vehicle Re-ID is still a frontier topic along with the growing explosion in the use of urban surveillance cameras among the large-scale areas. It can be viewed as a definite image retrieval task in intelligent surveillance, which is different from traditional vehicle detection, recognition and categorization problems [10][11][12].
As object Re-ID technologies have developed, there have been considerable advances in vehicle Re-ID based on appearance features, such as texture, color and semantic properties, as well as orientation-invariant features inspired by person Re-ID research. Furthermore, there are few studies adopting license plate recognition. However, a vehicle license plate helps little in an open traffic environment where high-quality license plate images are difficult to obtain. Due to the impact of occlusions, illumination, poses indicating various orientations, and other factors, the task of vehicle Re-ID remains challenging in a large-scale road surveillance network. Moreover, the vehicle Re-ID task exhibits the specific characteristics that differ from person Re-ID. The general appearances of two different vehicles of the same color and type can be quite similar from the same viewpoint due to the characteristics of solid matter, while the same vehicle in different environments has a dramatically varying visual appearance in practical surveillance. Figure 1 illustrates various scenarios and demonstrates that satisfactory performance cannot be obtained by only relying on appearance features. On the other hand, distinctive characteristics in the vehicle Re-ID task not only cause problems but are also beneficial. The identities (IDs) of different vehicles driven by diverse individuals cannot be exactly the same. Additionally, driving behaviors are constrained by traffic rules, and vehicles move in certain directions that cannot be changed easily in continuous time in separate driving ways. Consequently, vehicles' spatio-temporal information is more effective than appearance for vehicle Re-ID. Accordingly, we propose to construct a two-step framework: (1) exploring appropriate appearance features for representing the vehicles, and (2) utilizing the spatio-temporal clues that assist in the retrieval process. The rationale is that in the physical world, the same vehicle cannot be seen by two different non-overlapping cameras at the same time, and different vehicles have different movement behaviors. Nevertheless, utilizing the proposed framework in a practical urban surveillance system faces three significant challenges. Firstly, in the early stage, the vehicle Re-ID task is explored in the specified view. While in open scenarios, appearance variations across different poses of a vehicle are far more important than those of a person. In contrast, different vehicles with the same color, type and pose always look more alike than different poses of a given vehicle. Secondly, although spatio-temporal information helps optimize the performance of vehicle Re-ID, existing spatio-temporal models neglect the problem that the time intervals of different vehicles driving through the viewable areas of a given pair of cameras from different directions are probably the same or close. This problem demonstrated by Figure 2 causes the spatio-temporal model not to perform as expected. Finally, spatio-temporal information and appearance features are heterogeneous and cannot be measured directly.  Existing vehicle Re-ID approaches also notice the vehicle Re-ID problem in a practical urban surveillance environment and make significant efforts to address the above challenges. Based on person Re-ID research, pose view has been explored to distinguish the appearances [13][14][15][16]. Local regions represented as different poses such as the front, sides and rear in vehicle images can be regarded as supplementary to the general visual appearance in recognizing vehicles. In fact, we explore the phenomenon that there are always some keypoints in a certain pose that are invisible in vehicle images captured by surveillance cameras, while some other keypoints in other local regions can be seen. Inspired by this observation, we consider exploiting simplified keypoints of a certain side view to define the poses. In addition, we utilize the integration of the general visual appearance and each pose's viewpoint appearance features with keypoints of an embedding to represent the visual features of vehicles. As Figure 3 illustrates, the fixed shooting directions of cameras in the camera network are different, and so are a vehicle's poses captured by cameras. By jointly considering the vehicles' poses and camera shooting directions, the relative movement directions of vehicles can be estimated. For example, if a vehicle's pose captured by the camera is the front, we can infer that the vehicle is approaching the camera and vice versa. The driving direction of the vehicle in the region can be determined based on the location of cameras and the relative direction. As vehicles' driving directions are generally stable in a real-world traffic environment, the vehicles with the same driving directions caught by each camera are most likely to be the same. Therefore, we construct a spatio-temporal model guided by pose view, which introduces an estimation of driving directions to optimize the spatio-temporal model. Finally, since visual features and spatiotemporal information refer to different data models that are difficult to fuse directly, we map them into the probability space to optimize the vehicles identification's performance. According to the above, we integrate pose view with keypoint embedding into both the visual appearance model and spatio-temporal model.
In conclusion, the contributions of this study can be summarized as follows: • We propose a two-branch framework to optimize the performance of the vehicle Re-ID task, where one branch is a visual model combining general visual appearance and pose view with keypoint embedding, and the other is a spatio-temporal model guided by pose view, so that poses are involved in both branches' models.
• To the best of our knowledge, we take the lead in introducing a spatio-temporal model guided by pose view which optimizes the existing spatio-temporal methods, filtering the cases of similar vehicles appearing in the same section of the road but driving in different directions. Moreover, we account for this mechanism in deciding how pose view guides the spatio-temporal model. • We design a fusion model with the generated visual model and a spatio-temporal model guided by pose view based on a Bayesian network. Extensive experiments on public vehicle Re-ID datasets demonstrate the effectiveness and superiority of the proposed approach. Our proposed model can also be easily extended to a practical urban surveillance environment. Compared with our previous work [17], we reform the framework with keypointsbased pose view guiding-both the visual model (KPEV) and spatio-temporal model (KPGST), which significantly improves the accuracy for vehicle retrieval in the open scenarios. Additionally, we explain the mechanism of how the pose view guides the spatiotemporal model and design a fusion model based on Bayesian network, which makes the process more reasonable. To evaluate the effectiveness of our framework, we conduct extensive experiments on two large-scale vehicle Re-ID datasets, VeRi-776 [18] and Vehi-cleID [19]. Comprehensive experiments demonstrate that the framework not only improves the accuracy but also remains efficient.

Person Re-ID
Person Re-ID has been widely studied in the computer vision field, which has various important applications in recent years. Using features based on convolutional neural networks (CNN) and deep metric learning has led to significant progress being made in solving the person Re-ID problem [20][21][22][23][24][25][26][27], e.g., occlusions, clothes variations, domain gap. Pose-based approaches [28][29][30][31] based on deep neural networks have been applied to person Re-ID tasks and extensively explored. Furthermore, contextual information such as spatio-temporal information, object locations, the topology of 120 cameras, etc., has been widely exploited in multi-camera-based person Re-ID tasks [32][33][34].

Vehicle Re-ID
Vehicle Re-ID in a large-scale urban surveillance is a frontier area that has attracted more interest in recent years. Feris et al. [35] proposed a vehicle detection and retrieval framework. The vehicles were firstly classified by different types, sizes and colors. Recent works on the vehicle Re-ID task mainly concentrate on making breakthroughs based on deep neural networks. Some vehicle Re-ID research relies on a specific view and addresses the single-view case. Liu et al. [19] focused on precise vehicle retrieval while considering metric mapping to cluster positive samples. Yan et al. [36] exploited multi-grain ranking constraints and further optimize the ranking by the likelihood loss function. Some works began to tackle the vehicle Re-ID task on the road network. With global and partial multiregional distance-based feature learning, Chen et al. [37] design a three-branch network to learn coarse-to-fine vehicle information. Zheng et al. [38] develop a two-stage architecture, aiming for learning robust vehicle representation progressively.
Liu et al. [18,39] released a high-quality vehicle Re-ID dataset VeRi-776 with 776 vehicle IDs captured by 20 cameras in a large-scale scenario, which contributed significantly to vehicle Re-ID research. The researchers explored an appearance-based model by integrating low-level and high-level semantic features based on CNNs. Additionally, they also utilized license plate information. Lou et al. [40] collected a new dataset captured by a large surveillance system containing 174 cameras covering a large urban district. It is the first vehicle Re-ID dataset that is collected from unconstrained conditions arising from the data collection in a real surveillance camera network of a city-scale district, covering a huge diversity of viewpoints, resolutions, illuminations, camera sources, weathers, occlusions, backgrounds, vehicle models in the wild, etc. However, it is difficult to correctly match vehicles among a large number of candidates in a real-world traffic environment by distinguishing vehicles only through visual features due to variations of image background, viewpoints and illumination. Inspired by the existing Re-ID methods focused on the person Re-ID problem, pose-based, multi-view and spatio-temporal relations-based approaches are taken into consideration when tackling the vehicle Re-ID task.

Multi-View and Contextual Models
Multi-view and contextual models have been widely exploited in multi-camera systems and nowadays are adopted into vehicle Re-ID research. Zapletal et al. [41] and Sochor et al. [42] proposed to use a 3D structure for aligning different vehicle vehicles' faces to extract accurate features. Zhou et al. [43] proposed to exploit the Spatially Concatenated ConvNet and a CNN-long short-term memory (CNN-LSTM) bidirectional loop to learn transformations across different viewpoints of vehicles and applied that approach to the Toy Car Re-ID dataset. Additionally, locations of keypoints are helpful, as the learned features can be aligned well by such keypoints. Wang et al. [44] used local region features of different orientations based on 20 keypoint locations and combined such features to learn more accurate features. Moreover, the authors utilized spatio-temporal information to optimize performance. Shen et al. [45], Liu et al. [46] and Li et al. [47] also exploited spatio-temporal relations. Although they also achieved good performances, they overlooked the influence of different driving directions on spatio-temporal models. Regarding vehicle-orientation-camera as a triplet and reforming shape similarity as orientation and camera Re-ID, Zhu et al. [14] utilize camera and orientation similarity as the penalty to obtain final similarity after training vehicle, orientation and camera Re-ID, respectively.

Problem Formulation
The task of vehicle Re-ID is to retrieve all vehicles in a camera network that have the same ID as the query vehicle. For the clarity of the problem definition, some notations used in the vehicle Re-ID problem are described as follows. In an urban surveillance system, we define a camera network C, which is composed of M + 1 cameras C 0 , C 1 , . . . , C m with a non-overlapping field of view. The i-th vehicle at C n is represented by The moment it is captured is represented by t i n . For vehicle Re-ID, we expect to find the vehicles that have the same ID as a query vehicle in different camera views. The match probability between the probe O i n and the candidate O j m can be expressed as follows

Model Overview
The architecture of the model is illustrated in Figure 4, which contains the following main steps: The visual match probability of a probe and a candidate is assessed by feature distance (Section 3.4). step 3 Estimating the vehicle's driving direction based on pose and guiding the spatiotemporal model. In this step, the relationship between the vehicle's driving direction and pose category is inferred from the camera topology and shooting directions.
The spatio-temporal model is guided by the vehicle's driving direction and is called the Pose-Guided Spatio-Temporal model (PGST). The spatio-temporal match probability is inferred by the PGST model (Section 3.5). step 4 Joint metric of the visual features and spatio-temporal features. In this step, we assume that the vehicle's visual occurrence probability and spatio-temporal occurrence probability are independent from each other. The vehicle's final match probability is calculated by combining the visual probability with spatio-temporal probability leveraging by pose view based on the Bayes' formula. We rank probabilities in the descending order and select the top rank (Section 3.6).
In the following sections, we will describe in detail the design of each key component of the model and analyze those components.

Keypoint-Based Pose Classifier
As shown in step 1 of Figure 4, a Keypoint-based Pose Classifier (KPC) is proposed to extract the vehicle's poses features and estimate pose category. Inspired by the OIFE [44], it contains four modules, i.e., pose keypoint regressor, global feature extractor, pose feature extractor and pose classifier module. The architecture of KPC is illustrated in Figure 5. The pose keypoint regressor estimates the vehicle's 20 keypoint locations. The annotation of these keypoints is shown in Table 1. These keypoints are chosen as some principal vehicle components, e.g., the wheels, the lamps, the logos, the rear-view mirrors, and the license plates. The architecture of the pose keypoint regressor is a stacked hourglass network [48] that is usually used to generate response maps of human joints for human pose estimation. The pose keypoint regressor takes a vehicle image as input and yields 20 response maps of the vehicle's keypoints. As shown in Figure 6, poses are classified into four categories: front, rear, left side, and right side. The resulting 20 response maps are assigned to four clusters according to the visible points on each pose category: C 1 = [5, 6,7,8,9,10,13,14], C 2 = [15, 16,17,18,19,20], C 3 = [1, 2,6,8,11,14,15,17], and C 4 = [3,4,5,7,12,13,16,18]. The final output region masks are computed as the summation of all the feature maps belonging to each cluster: where i = 1, 2, 3, 4.  The global feature extractor is adopted to obtain the global features. The network architecture consists of two ResNet Blocks [49]. The input images are resized to 256 × 256 and convolved by two ResNet Blocks. The size of the output feature map f 1 0 is 64 × 64 and the number of output channel is 512. A region mask R I has the same size as a feature map f 1 0 . For each local branch, f 1 0 is element-wisely multiplied to obtain each preliminary pose local feature maps f 1 To optimize the feature distances between inter-class and intra-class samples, a multiple loss function is designed for metric learning. It is formulated as: where we use hyperparameter ω to balance two types of loss. Feature f i n is the pose-invariant feature of vehicle O i n . The visual-based match probability of O i n and O j m can be regarded as a similarity between features. Specifically, the similarity between f i n and f j m is measured by cosine distance, which is defined as:

Pose-Guided Spatio-Temporal Model
In the urban surveillance system, camera shooting angles and camera topography are readily obtained. Combining a vehicle's captured poses and camera shooting angles, we can estimate the vehicle's relative driving direction. In a vehicle's complete trajectory, the vehicle must have the same relative driving direction at every location in the trajectory. In addition, the driving direction of the vehicle establishes the order in the spatio-temporal sequence. Hence, we develop an algorithm for the pose-guided spatio-temporal model. It contains three steps: estimating the vehicle's relative driving direction by combining the vehicle's captured poses and camera shooting directions, constructing the relative geographic relationship between camera pairs and computing the spatio-temporal match probability.

Estimating the Vehicle's Relative Driving Direction
A vehicle's poses and pose confidence can be estimated by the pose classifier. Pose confidence is denoted by α and is the maximum output probability score of the classifier. As shown in Figure 8a, we set the relative direction to be along the tangent of the road.
The north of the road is always the upstream of the road. Figure 8b illustrates the mapping of pose and driving direction. Different driving directions result in various captured poses. Angle θ denotes the inclination of shooting direction and the vehicle's driving direction. Based on observation, we examine the relationship between angle θ and poses, as Table 2 illustrates. The mapping of driving directions and a vehicle's poses is constructed based on the above assumption. If a vehicle's pose is known, the vehicle's relative driving direction D i n can be estimated by combining the cameras' shooting direction and the vehicle's poses. Specifically, the vehicle's driving direction from the north to the south of the road is represented as D i n = 1, and it is otherwise represented as D i n = −1.

Constructing the Relative Geographic Relationship between Camera Pairs
To clearly describe the geographical relationship between camera pairs, we establish a 2D coordinate system for the camera network topology. As shown in Figure 9, the relative position between camera pair C n,m can be represented by a vector. For example, the position of C n is represented by G n = (x n , y n ), the − − → G n,m denotes the geographical position of C n relative to that of C m , which is formulated as − − → G n,m = G m − G n , and its positive and negative signs represent the direction.

Computing the Pose Guide Spatio-Temporal Match Probability
To explore the effect of spatio-temporal information for vehicle Re-ID, we select vehicles with 576 IDs from 20 camera pairs in an urban surveillance and analyze all positive samples, and among them, the spatio-temporal translations time of each camera pair. The histogram in Figure 10 shows the time intervals of translations between camera pairs. We select several probability curves to fit the trend and observe that the lognormal-like probability model [44] performs best. Based on this assumption, the spatio-temporal match probability The direction and appearance time of the vehicle can be assumed to be independent. The pose guide spatio-temporal match probability P pgst can be formulated as,

Joint Metric of Visual Probability and Spatio-Temporal Probability
The data type of visual information is different from that of spatio-temporal information. In terms of the vehicle Re-ID problem, a vehicle's visual-based feature distribution and spatio-temporal feature distribution are independent of each other. Hence, we can integrate these two types of data based on a Bayesian probability model. The match probability of the probe O i n and the candidate O i n can be formulated as, The prior probability P(t i n , t j m , D i n , D j m ) can be assumed to be equal for all data. Then, it is formulated as,

Dataset and Evaluation Metric
To evaluate the effectiveness of our vehicle Re-ID framework, we mainly implement experiments on the VeRi-776 [18] dataset that contains spatio-temporal information on vehicles' movements and camera topology. Experiments on the VeRi-776 dataset show that the proposed approach effectively improves the performance of vehicle Re-ID performed without considering license plates. We follow the evaluation protocol proposed in [45]. The R-1, R-5, and R-20 accuracy as well as the mean Average Precision (mAP) are also adopted to evaluate the accuracy of the methods.

Effect of Estimating Vehicles' Poses Keypoint Regressor
The primary contribution that needs to be investigated is the effectiveness of extracting the vehicle's poses feature. Inspired by OIFE [44], the keypoint regressor [48] is an orientation-based regional proposal module. The keypoints of a vehicle in a visible orientation can be located by the keypoint regressor. The OIFE work [44] manually specifies 20 keypoints' coordinates and orientation class (such as front, rear, right side, left side) annotations for each vehicle in the VeRi-776 dataset. We adopt these orientation annotations to train and test the keypoint regressor. All input images are resized to 256 × 256 pixels. The ground truth heat map of size 64 × 64 consists of a 2D Gaussian (with standard deviation of 1 px) centered on the keypoint location. The Mean Squared Error (MSE) loss is computed to compare the predicted heat map with ground truth heat map. The RM-Sprop [51] optimization with a learning rate of 1.5×10 −4 is used to train the network. The final prediction of the network is the maximum activating location of the heat map for a given keypoint. Parameter r 0 denotes the allowed error threshold of distance between the ground truth and a prediction value. As Table 3 shows, our trained keypoint regressor attains 93.87% accuracy (r 0 = 5) on the testing images and exceeds by 1.37% the result of OIFE [44].

Effect of Performance on Vehicle's Poses Classifier
The proposed KPC method classifies vehicle poses into four categories (front, rear, left side and right side). The AlexNet mode [50] is regarded as our basic pose classifier. We use two ResNet [49] blocks to obtain the vehicle global features. Hence, we also compare our KPC approach with ResNet18.
Our KPC training strategy is as follows. The learning rate is 2.5 × 10 −3 . The loss function is the softmax loss. The size of a minibatch is set to 64 and the models are trained for 80 epochs. We first fix the parameters of the trained pose keypoint regressor in KPC and load the AlexNet blocks and the ResNet blocks in KPC. The AlexNet is pre-trained on ImageNet. The ResNet block is pre-trained on VeRi-776 and VehicleID by classifying each vehicle's ID. Note that there are two poses (front and rear) for each vehicle in VehicleID. We replace the four branches with two branches and output two categories. Table 4 displays the number of training sets and testing sets on VeRi-776 and VehicleID datasets. Table 5 shows the performance of different methods on different VeRi-776 and VehicleID datasets. We use the precision and recall to evaluate each category separately. These two indicators are calculated by following Equation (7   The performance of the pose classifier reflects the confidence degree of the pose category output. It also directly affects the performance of the pose-guided spatio-temporal model. It can be seen from the results that when the vehicle pose category increases, the effect of the pose classifier based on keypoint positioning gradually becomes obvious, and the average accuracy rate on each category is as high as 90.0% or more.

Effect of Extracting Vehicle's Visual Features
We select the final feature fusion layer in KPEV as the pose-invariant feature representation for the Re-ID task. The ResNet18 model [49] is regarded as our base network structure. To optimize the KPEV model, we vary hyper parameter ω, which is used to adjust the ratio between the softmax loss and the triplet loss. If ω equals 0, the final loss degenerates into the softmax loss, and in the other extreme of ω equal to 1, the final loss turns out to be the triplet loss. By multiple sets of comparative experiments on parameter ω, we find that the KPEV performs best on the VeRi-776 dataset when ω = 0.25, and on the VehicleID dataset when ω = 0.5. The learning rate for the VeRi-776 and VehicleID datasets starts from 0.001 and 0.01, respectively, and it decays every 150 rounds. The size of a minibatch is set to 64, and the models are trained for 500 epochs.
Our KPEV training strategy includes two steps. (1) We pre-train the base ResNet18 network with the softmax loss. (2) After parameters of the trained KPC have been fixed, we train the backbone of the feature fusion layer in KPEV and fine-tune the parameters of the ID feature extractor. Specifically, for the VehicleID dataset, if 13,161 IDs in the entire training set are used, it is difficult to achieve convergence of the KPEV model with the softmax loss. Hence, we randomly extract vehicle images of 1000 IDs from the entire training set and use them as a training subset for training the KPEV model. Table 6 presents performance comparisons between KPEV and the state-of-the-art. For the VeRi-776 dataset, it attains approximately 9.94% mAP and 3.42% R-1 gains over KPEV over our baseline, which is ResNet18. For different test sizes of the VehicleID dataset, it achieves improvements of 4.98%, 4.60% and 3.89% in R-1 accuracy over ResNet18, and 3.19%, 2.53%, and 1.61% over LCSR [52]. Table 6. Comparison (%) of vehicle Re-ID performance of visual models on the VehicleID dataset. The best results are shown in bold.

VehicleID
VeRi-776  XVGAN [54] and VAMI [13] both identify the vehicle's ID under different poses by generating the characteristics of the vehicle under different perspectives. Compared with them, our KPEV proposes to extract the pose's invariant feature, the mAP results over KPEV on the VeRi-776 dataset achieve 30.10% and 4.62% improvements, respectively. Moreover, the R-1 results over KPEV on the VehicleID dataset achieve approximately 19.00% and 9.00% improvements, respectively. Our visual framework is inspired by OIFE [44], but differently, we use four detailed branches to mine vehicle pose features and adopt triplet loss to fuse local features and global features. The result over KPEV on the VeRi-776 dataset attains 6.75% mAP gains over OIFE [44].

•
StdNorm ST: The probability of a vehicle's transition interval ∆t observed by the camera pair is assumed to follow the standard normal distribution and can be computed using Equation (8). Parameters σ n,m and µ n,m are the variance and the mean value (based on the training set) of the transition interval observed by camera pair (n, m).
• LogNorm ST: Inspired by the OIFE [44], we can estimate the spatio-temporal probability of transition interval ∆t using the lognormal distribution. The respective probability is calculated using Equation (9). The two parameters µ n,m and σ n,m are the mean and standard deviation of the variable's natural logarithm based on the training set.
• STHist: Inspired by the work [61] of spatio-temporal person Re-ID, we apply it to vehicle Re-ID using the histogram with the Parzen Window approach to estimate the spatio-temporal distribution. The spatio-temporal probability is computed by Equation (10). All the transition interval positive training samples between the camera pair (n, m) are placed into several bins. The width of each bin is denoted by d. Variable N k n,m represents the total number of vehicles in the k-th bin. If the ∆t is a transition interval of estimation, k = ∆t d is calculated firstly, next, N k n,m is estimated, and finally, we can obtain P st as follows: We apply our PGST algorithm to the above spatio-temporal models. All spatiotemporal methods are applied to the VeRi-776 dataset. Table 7 presents a comparison of performance of spatio-temporal models for Re-ID. It shows that the LogNorm spatiotemporal method performs best, which verifies that this log-norm distribution is more suitable for the spatio-temporal pattern of vehicles. Figure 11a shows that the PGST method generally attains more than 10% mAP gains over the spatio-temporal model without pose guidance. Figure 11b shows a comparison of CMC curves. We observe that PGST has clear advantages according to R-1 to R-50 match rates. The result demonstrates that the pose-guided spatio-temporal model method can effectively reduce Re-ID errors.

Comparisons over Fusion Model with Visual and Spatio-Temporal
To extensively investigate the performance of the fusion model with visual and spatiotemporal features on the Re-ID task, we perform detailed experiments with various fusion methods. Our method of fusion with visual and spatio-temporal features is built on the Bayesian merging method. Table 8 presents a comparison of all of our fusion models on the VeRi-776 dataset. Figure 12 shows the CMC curves of fusion models on the VeRi-776 dataset. Compared to fusion models without pose-guided spatio-temporal features, fusion models with such KPGST features attain over 4.00% improvements of mAP. Moreover, we observe that the KPGST model of LogNorm PGST has the best performance, and it can be deemed to be the optimal model for the entire algorithm.   Table 9 shows a comparison of our KPGST approach and the state-of-the-art fusion methods. OIFE + STR [44] proposes a lognormal spatio-temporal model for Re-ID. The VAMI + STR [13] and Siamese CNN + STR [45] models utilize the product of the time difference and physical distance as the final spatio-temporal matching score. The approaches of [13,44] both fuse models by adding weighted visual matching scores and visual matching scores. The Siamese CNN + PathLSTM [45] method uses a multi-layer perception (MLP) with two layers to process visual-spatio-temporal path proposals, and it subsequently matches candidate paths by an LSTM network. The KPGST method obtains 4.36%, 6.79% and 21.98% R-1 gains over VAMI + STR [13], Siamese CNN + PathLSTM [45] and OIFE + STR [44], respectively. Such significant improvements can be ascribed to two aspects. Firstly, an effective visual extractor of vehicle features is used that outperforms the methods of [44,45]. Moreover, we apply the pose-guided spatio-temporal algorithm to make the best use of spatio-temporal constraints. The improvements suggest that KPGST achieves better performance by considering spatio-temporal relationships in the vehicle Re-ID task. Table 9. Comparisons (%) with our KPGST and state-of-the-art Re-ID methods on the VeRi-776 dataset. The best results are shown in bold.

Conclusions
Investigating practical scenarios, we consider the specific characteristics of vehicles and propose a simple but effective model called the KPGST model to solve the vehicle Re-ID problem in urban surveillance systems. The main idea in this paper is to exploit a two-branch framework that includes appearance features' fusion of global and poseview and spatio-temporal constraints that are both guided by vehicles' poses. During the processing of pose features merged into global appearance features, we exploit keypoints to enhance the accuracy of pose recognition. Furthermore, we explore the mechanism of how the poses guide the spatio-temporal model based on a Bayesian network. Unlike the existing state-of-the-art methods that only utilize appearance features or some methods that analyze the vehicle Re-ID problem with the help of spatio-temporal information, we optimize the appearance features' framework with both global and regional features and develop a more accurate spatio-temporal constraint model. A performance analysis shows that our method for the practical vehicle Re-ID problem is reasonable and insightful.