Rethink Motion Information for Occluded Person Re-Identification

: Person re-identification aims to identify the same pedestrians captured by various cameras from different viewpoints in multiple scenarios. Occlusion is the toughest problem for practical applications. In video-based ReID tasks, motion information can be easily obtained from sampled frames, and provide discriminative human part representations. However, most motion-based methodologies are designed for video frames which are not suitable for processing single static image input. In this paper, we propose a Motion-Aware Fusion (MAF) network, aiming to acquire motion information from static images in order to improve the performance of ReID tasks. Specifically, a visual adapter is introduced to enable visual feature extraction, either from image or video data. We design a motion consistency task to guide the motion-aware transformer to learn representative human-part motion information and greatly improve the learning quality of features of occluded pedestrians. Extensive experiments on popular holistic, occluded, and video datasets demonstrate the effectiveness of our proposed method. This method outperforms state-of-the-art approaches by improving the mean average precision (mAP) by 1.5% and rank-1 accuracy by 1.2% on the challenging Occluded-REID dataset. At the same time, it surpasses other methods on the MARS dataset with an improvement of 0.2% in mAP and 0.1% in rank-1 accuracy.


Introduction
Person re-identification (ReID) aims to identify the same pedestrians captured by a variety of cameras from different viewpoints and in various scenarios [1][2][3][4][5].ReID has a wide range of real applications and can have a significant impact on a variety of industries.For example, by identifying and tracking individuals in public places, ReID can enhance public safety and potentially reduce crime rates.In retail, ReID can be used for customer traffic counts and behavioral analysis.By analyzing the recognition characteristics of pedestrians, traffic management authorities can better monitor the use of sidewalks and provide a safer traffic environment for pedestrians.In addition, ReID can be applied to traffic flow monitoring and congestion prediction.
The ReID system can be utilized in both image-based and video-based environments.Image-based techniques [6][7][8] aim to link still images, such as a single frame from a camera, of individuals captured by a network of non-overlapping cameras.On the other hand, video-based ReID [9][10][11] involves matching the input video tracklets of an individual against a collection of tracklet representations.Compared to image-based methods, videobased ReID benefits from the motion and spatio-temporal information provided by video data, allowing the system to identify a person's body silhouette and distinctive human parts more effectively.Many video-based methods [12][13][14] also incorporate motion information to reduce the impact of background objects and address the issue of occlusion.It is worth noting that incorporating motion information from still images can further improve the handling of occlusions in challenging scenarios.While recent deep learning methods have produced satisfying retrieval performances in the main pedestrian regions, the problem of occlusions caused by diverse obstacles remains a challenge in real-world applications.
Compared to the general person re-identification (ReID) problem, the current challenge of occluded person ReID is two-fold.Firstly, interference from unknown objects can cause significant fluctuations in human features, leading to difficulties in feature extraction.To address this issue, previous methods [8,15] have employed targeted occlusion data enhancement or introduced a more robust pre-trained model.However, these approaches tend to focus more on the person's appearance features and memorize specific occlusion types, resulting in a lack of robustness in the extracted human-part features.In realworld scenarios, occlusion types are often unpredictable and randomly located, making specific occlusion data augmentation limited in terms of its generalizability across the entire domain.Secondly, exploring more representative person features is crucial for the occluded person ReID framework.Multi-pedestrian occlusions [16,17] are particularly challenging compared to other types of occlusions.In these scenarios, the model's ability to distinguish the features of different pedestrians becomes even more important [18].However, relying solely on the external features of a pedestrian's appearance is not sufficient.Implicit features are also needed for effective ReID.
To address these challenges, we propose three frameworks that deeply examine the implicit motion information and explicit visual characteristics of pedestrians.The first framework focuses on the dynamic processing of image or video feature inputs, providing a unified architecture for both image-based and video-based ReID tasks.We introduce a visual adapter with a set of learnable visual queries to integrate visual features from the visual encoder and reduce computational complexity in the cross-attention between motion and visual information.The adapter considers a single image as a single-frame video, enabling our framework to handle both image-based and video-based ReID tasks.The second framework seeks to obtain implicit motion information through a motionaware transformer.By passing the integrated visual features through the transformer, we establish another learnable query to obtain human per segment motion representations.We also design a motion consistency task to extract motion information from still images and continuously refine the motion representations, without relying on any pre-trained models.The final framework fuses visual features and motion information using a standard vision transformer architecture.The fusion encoder learns the relationship between notable human parts and human per segment motion representations.To evaluate the effectiveness of our approach, we conduct experiments on both image-based (including occluded and holistic) and video-based ReID benchmarks.As shown in Figure 1, our proposed method achieves competitive results on both image-and video-based ReID tasks.
The main contributions in this paper can be summarized as follows: • A novel architecture is proposed to simultaneously deal with video-based and imagebased ReID tasks.

•
We propose a motion-aware transformer and a motion consistency task to extract human motion information, which not only provides discriminative human representations but alleviates dressing similarly.

•
Sufficient experiments on several public video-based and image-based ReID datasets have demonstrated that our proposed framework outperforms the state-of-the-art methods.

Related Work
In this section, we first review the development of person ReID methods.As shown in Table 1, we provide a brief summary of some of the classic ReID methods.Then, we review motion-supervised segmentation methods, which inspire us to better utilize the motion information.[12] CVPR 2022 Video Rank-1, mAP Use temporal relations to obtain human features MEVID [22] WACV 2023 Video mAP, CMC Propose a multi-view video ReID dataset

Image Person Re-Identification
Image person re-identification, including holistic and occluded parts, aims to deal with retrieving a person of interest in other camera views.With the generation of large-scale datasets and the development of deep learning methods, recent works that utilize transformers to obtain refined human features, have achieved the best performance on holistic person ReID tasks.Li et al. [7] proposed a Part-Aware Transformer to deal with occlusion situations, which utilizes the transformer encoder-decoder architecture with learnable part prototypes for occluded person Re-ID and achieves a competitive performance.Wang et al. [8] proposed a feature diffusion model, including a non-pedestrian occlusion augmentation strategy, an occlusion erasing module, and a feature diffusion module, to help the model distinguish diverse occluded situations and precisely perceive target pedestrians.Tan et al. [6] designed a dynamic prototype mask for occluded person Re-ID, which does not rely on extra pre-trained networks, but uses a hierarchical mask generator to enrich the holistic prototype, and simultaneously retains the information from the whole image and achieves automatic alignment.Although those methods achieve satisfactory performance in the holistic or occluded ReID benchmarks, some occlusion-based models will still be affected by unpredictable occlusion and are largely influenced by pedestrians' appearance.

Video Person Re-Identification
Compared to image data, the additional temporal relations in video effectively alleviate many issues, such as occlusion and motion blur.It is also easier to acquire motion information and optical flow.One mainstream method is utilizing temporal attention to measure the importance of each frame and give up the low-quality frames at the same time.The other mainstream method is mutual enhancement by utilizing the self-attention or GCNs to better model the temporal relations and enhance the dependencies between frames.For instance, Yin et al. [23] proposed a motion information-based network, which utilizes an RNN-mask network to obtain motion information and introduces a pre-trained keypoint detector to obtain four local part features.Kiran et al. [21] proposed a mutual attention network to acquire spatio-temporal video features for ReID using optical flow.Bai et al. [12] designed a salient-to-broad module to leverage the temporal relations from the perspective of difference amplification and obtained more comprehensive and informative representations.However, these methods still have some drawbacks, such as the low ability of global-range feature concatenation and high computational cost.

Motion-Guide Segmentation
Siarohin et al. [24] presented a self-supervised deep learning method for co-part segmentation that leverages motion information to obtain human segments.Similar to the previous work [25][26][27], it relies on a reconstruction objective to disentangle the object's semantic and appearance representations.These methods heavily rely on the reconstruction model to complete the entire training stage, which also have high computational requirements.However, the use of motion information from previous works has inspired us.In particular, this work is inspired by the fact that motion information can be used to distinguish human parts and provide latent motion tokens.
In contrast to the above methods, our methods can simultaneously deal with imageand video-based ReID tasks with lower computational complexity.At the same time, the motion-aware transformer with the motion consistency task enables it to obtain motion information from a still image.

Methodology
In this section, we introduce our proposed Motion-Aware ReID method in detail.As shown in the left part of Figure 2, it mainly consists of four modules, including a visual encoder, a visual adapter, a motion-aware module, and a fusion encoder.Here, we briefly give a general introduction to our ReID process.First, we extract the vision features from the full image context with the visual encoder module.Next, the vision adapter module is devised to integrate these features into visual tokens.Taking visual tokens as input, a motion-aware module is carefully designed and trained to further acquire motion tokens from these visual tokens.Then, we jointly merge visual tokens, motion tokens, and a hybrid class token together to feed the fusion encoder module.Finally, we utilize this hybrid class token in the ReID task head to make pedestrian identifications.

Visual Encoder and Visual Adapter
As shown in Figure 2b, the visual encoder module serves as a backbone to extract visual features.As traditional convolutional-based neural networks cannot extract robust features of the target person under different background regions with diverse characteristics very well [7], we adopt a pre-trained ViT-B/16 as our default visual encoder.
During the training process, the images in each batch will be paired, and there will not be a situation where there is only one image with one ID.This collection strategy is prepared for calculating motion consistency loss in Section 3.2.
Similarly, in the ViT part, the visual encoder module reshapes the T input frames of 2D images X ∈ R T×H×W×C , into a sequence of flattened image patches X p ∈ R T×N×(P 2 •C) .Here, (H, W, C) represent the height, width, and channel of the original image, respectively.The sequence contains a total of N = H • W/P 2 image patches, and the dimension size of each image patch is (P 2 • C).To keep the constant latent vector size D consistent through all layers in this module, we also apply the linear projection to transform these patches from (P 2 • C) dimensions to D dimensions.Then, we add the embedding vector of positional information to obtain a sequence of extracted visual features The next module, named the visual adapter module, is in charge of integrating these extracted visual features F N into visual tokens VT L ∈ R L×D .Note that L denotes the number of output visual tokens.Typically, we set L to be smaller than N to further reduce the subsequent computational complexity.For previous ReID methods for video, the input features are f ∈ R T×N×D and its computational complexity is O(T × N × D).When we utilize the visual adapter, its computational complexity becomes O(L × D), where L << (T × N), and thus the complexity of our method could be significantly reduced.
The key point in the visual adapter module is its capability to process both video and image inputs.By changing the input frame T, we can adapt to different types of input data (i.e., image, video, and hybrid).As shown in Figure 2b, the visual adapter module associates each input frame with the yellow "position" embeddings, according to its frame sequence order.Then, the resulting sequence of embedding vectors serves as input keys and values of the following multi-head self-attention block.Finally, the adapter module generates a fixed number of L visual tokens, and L adapter queries are trained to learn about how to integrate visual features together [28,29].
As several images (T > 1) of the same person contain more rich spatial-temporal information when compared to a single image input [30], this integration process also helps retaining adequate spatial-temporal information and improves the robustness of the training process [24].

Motion-Aware Transformer
In Figure 2c, the motion-aware transformer module takes the aforementioned visual tokens VT L from the visual adapter module to generate corresponding motion tokens MT L .It consists of a standard cross-attention layer [31], a multi-head self-attention layer, and a feed-forward network layer.The cross-attention layer aims to extract foreground human body parts from the VT L with the learnable queries.Next, the self-attention blocks further incorporate the local context of human parts into separate part prototypes.The feed-forward network (FFN) part, consisting of two fully connected layers, introduces non-linearity and produces attention output MT L .
To learn valid and effective queries, we elaborate a motion consistency task and corresponding loss function for the training process of the motion-aware transformer.To make it easier to explain the main steps of the motion consistency task, we choose two images from different views as input in Figure 3.Note that the motion tokens are independently generated from the consistency task, and consequently, the proposed motionaware transformer module can be used for a single image at inference time.For the two images (i.e., source and target), the MLP-1 utilizes the motion tokens MT L to obtain the segmentation results of human parts M S,part and M T,part .The output of MLP-1 (M part ∈ R N×H D ×W D ) represents different probability distributions for the N different human parts.Here, the number of human parts divided in the motion-aware transformer is determined by the length N of learnable queries; H D and W D mean the height and width dimension of segments.In Figure 3, we set N = 10, and obtain ten probability distributions for the ten human parts.Formally, let M k part be the ) segment of human parts M part .In the probability distribution M k part , we define the highest probability point in the distribution as the key point p k part .The key point, represents the location associated with the k th segment of human parts.Hereby, for the source image and target image, we can both extract ten key points of different human parts according to the segmentation results of MLP-1.
We design the MLP-2 to describe the motion of all points in the segments of human parts M part .The output of MLP-2 represents an affine transformation [24], which is used to approximate the optical flow F of every segment of human parts.Here, we assume that the motion of each segment follows an affine model [24], this implies that there exists A ∈ R 2×2 and β ∈ R 2 such that: Here, z is the location point of the M k part .After the output of MLP-2 explicitly approximates the affine parameters A and β for our given source image (S) and the target image (T), we can obtain F with the following equation: where p k S,part ∈ R 2 and p k T,part ∈ R 2 are the selected key points of the source image and the target image, respectively.A k S and A k T are the predicted motion descriptions from the MLP-2.In other words, the optical flow F can be approximated by an affine transformation corresponding to each segmented part.
Given the source image, and the calculated optical flow fields F of each segment M k part , we can approximate each segment of human parts in the target image by: where ⊗ denotes the element-wise product.Taking the shoulder part of same person in Figure 3 as an example, the segmentation map corresponding to the shoulder part in the source image is multiplied with the corresponding F k to obtain the prediction Mk for the segmentation map of the shoulder part after the motion.Our goal is to make this prediction similar to the segmentation map M k T part corresponding to the shoulder part in the target image.
A popular method to compare the similarity of two probability distributions is KLdivergence.The distribution should be as consistent as possible, so KL loss is used here to restrict the motion consistency.Finally, the motion consistency loss can be calculated by: The first summation part in the equation is the KL loss, where L eq represents the equivariance constraint loss [27].The equivariance constraint loss is calculated by thinplate spline deformations, which have been widely used in unsupervised key point detection [25,32] to ensure the robustness and stability of the training process.We also adopt L eq to mainly stabilize our training process and make human part segmentation maintain discriminative.The motion consistency loss L mc constantly optimizes the learnable queries.Eventually, the implicit motion information, human part-segment information, and their relationships are integrated into the learnable queries.We finally obtain more accurate motion tokens from the motion-aware transformer module.

Fusion Encoder
The fusion encoder module mainly outputs an additional token h m cls for the final re-identification tasks.We apply learnable linear projections over visual tokens VT L and motion tokens MT L .Then, we concatenate them with an additional token M CLS together as the input for the fusion encoder transformer.The additional token M CLS allows crossattention between the projected vision and motion representations and makes the fusion of visual tokens and motion tokens.For retrieval tasks, the final hidden state output h m_cls is used as final human feature representations.

Training and Inference
In the training process, we first pre-train the motion-aware transformer module with video datasets in order to obtain stable and effective queries in the MAT module.The existence of the visual adapter module enables the training process to support both video datasets and image datasets.Then, we activate the normal training process of using benchmark datasets for comparison.Our proposed method is trained in an end-to-end manner.The objective function consists of the two following parts: where λ g and λ mc are scaling factors.For the final ReID tasks, we calculate cross-entropy loss and triplet loss [33] for identification with the ground truth as follows: where h m_cls are the output of fusion encoder, L c represents the cross-entropy loss, and L t represents the triplet loss.
In the inference stage, we only use the h m_cls token from the last layer of the fusion encoder as the representative information of each image for the subsequent retrieval tasks.

Datasets and Evaluation Metrics
Market-1501 [34]  Partial-iLIDS [35] contains a total of 238 images from 119 people captured by multiple cameras, and their occluded regions are manually cropped.
Partial REID [36] is an especially designed partial person ReID benchmark.It involves 600 images from 60 people.We take the occluded query set and holistic galley set for the experiments.
Occluded REID [37] contains 2000 images belonging to 200 identities.Each identity has five full-body person images and five occluded person images with different viewpoints and different types of severe occlusions.
MARS [38] is collected by 6 near-simultaneous cameras.It contains 1261 different pedestrians, each captured by at least 2 cameras.
LS-VID [39] utilizes a 15-camera network and selects 4 days for data recording.It contains 14,943 sequences of 3772 pedestrians, and the average sequence length is 200 frames.
iLiDS-VID [40] is extracted from the iLIDS MCTS dataset with 600 videos of 300 identities.Due to the limitations of the iLIDS MCTS dataset, the iLIDS-VID occlusion is very severe.
PRID-2011 [41] has 385 videos from camera A and 749 videos from camera B, where only 200 people appear in both cameras at the same time.
Evaluation metrics.We adopt Cumulative Matching Characteristic (CMC) curves and mean average precision (mAP) to evaluate the quality of different Re-ID tasks.

Implementation Details
Our model training is divided into two phases in total.In the first phase, we pre-train the model using the training set of all the image and video datasets mentioned above, where the image is considered as a video with T = 1.Due to the presence of the visual adapter, our network can accept inputs from both images and videos, and the sampling method used is the random sampling of images and videos.In the second stage, fine tuning is performed on each dataset using its training set.
Images and video frames are all resized to 256 × 128.The patch size is set to 16.For video data, every batch has 32 clips, which correspond to 8 identities.The layers of the motion-aware transformer and the fusion encoder are both set to 6.For image data, every batch contains 8 identities, each including 4 different perspectives.The network is trained over 120 epochs and optimized by the Adam optimizer with a weight decay of 0.005.We also use random flipping and random erasing with a probability of 0.5 for data augmentation.The number of learnable queries in the motion-aware transformer is set to 10. λ g and λ mc in Equation ( 5) is set to 1 and 0.5, respectively.In the test stage, we use all frames in units of 4-frame clips and obtain the final video feature by averaging all those h m_cls , and the cosine similarity is used for retrieval.Additionally, the motion consistency task is only activated during the training stage.For a single image, we directly assign the input image as the source image, and the target image is randomly selected from the different perspectives of the same ID in the same batch.For the video frames (e.g., a 4-frame video clip), we assign the first frame as the source image, and the target image is randomly selected from the three remaining frames of the same video.

Comparison with State-of-the-Art Methods
Comparisons on Video-based Datasets.On account of the design of the visual adapter, our method could deal with video-based ReID tasks.As shown in Table 2, when comparing our approach with state-of-the-art methods, we have achieved comparable performance.Especially on the MARS and LS-VID datasets, we have achieved the highest Rank-1 and mAP.The results of the video-based methods have demonstrated that our visual adapter enables us to distill the spatio-temporal features and the dependencies among video frame features.Comparisons on Holistic Datasets.The results on the Market-1501 are shown in Table 3.It is clear that our method achieves the best performance compared to other stateof-the-art methods and with other part-based or global-based methods.This demonstrates that our approach could obtain more representative features for holistic pedestrians.
Comparisons on Occlusion Datasets.The results on the Occluded-REID are shown in Table 4 and the results on the Partial datasets are shown in Table 5.When comparing our method with state-of-the-art methods, we achieve the highest Rank-1 and mAP on both Occluded-REID and Partial datasets.Especially on the partial datasets, we achieve 89.2%/93.2%and 77.5%/89.6% on Rank-1/Rank-3 for Partial-REID and Partial-iLIDS, respectively.
As we all know, occluded ReID tasks always need more refined part features to perform pedestrian retrieval.Through these experiments, our method has demonstrated a strong ability to cover occlusions.The reason could be summed up in two ways: On the one hand, the motion consistency task guides the model to extract human part features and aggregate the motion information using learnable queries.On the other hand, the fusion encoder combines the motion information with original vision features, which helps the model generate discriminative representations.

Ablation Studies
Experiments on motion consistency task.When we delete the motion consistency task from the motion-aware transformer, we directly utilize the learnable queries in the motion-aware transformer to extract human part features.As we can see in Table 6 #1, #2, #7, and #8, our approach appears to perform poorly on Occluded ReID tasks and yield similar results on holistic ReID tasks.It is evident that, without the motion consistency task, our method pays more attention to modeling the pedestrian appearance features, which will be greatly influenced by any occlusion.
Additionally, unlike previous motion-based methods [21,23] that directly used motion information, such as [24], in an explicit way, our method first obtains implicit motion information through a motion-aware transformer and a motion consistency task.Then, a fusion encoder is utilized to fuse the visual features and the implicit motion tokens in an explicit way.Here, we have demonstrated that the introduction of motion information is the main factor in dealing with occluded situations.Experiments on visual adapter and motion-aware transformer.To provide a feeling of the importance of the visual adapter and the motion-aware transformer, we add two ablation experiments and summarize results in Table 6.As shown in Table 6 #3, our methods can only handle video-based-related ReID tasks when the visual adapter is introduced.As shown in Table 6 #6 and #8, removing VA affects the accuracy of video-based tasks (MARS) (1.3% drop) more than other image-based tasks.This is because the learnable latent queries of the VA module can aggregate and learn spatio-temporal information for multiple video frames, while this has no effect on still images.The MAT module utilizes motion information to focus more on human part segments.As shown in Table 6 #5 and #8, the absence of MAT largely reduces the accuracy of O-REID (3.2% drop) and P-REID (6.1% drop) as the attention of human-part segmentation is affected by background noises.
Experiments on the length of learnable queries in visual adapter.We conduct supplement experiments (Table 7) regarding the different latent query lengths of the visual adapter module.In contrast to previous methods [12], the query length can be adjusted to fit different datasets.If we increase the query length to 256, the accuracy of the video task outperforms the method [12].However, if the query length (512) is too long, the model will be over-fitted, thus reducing the model accuracy.Experiments on the length of learnable queries in motion-aware transformer.We change the length of the learnable queries in the motion-aware transformer since the length of the queries corresponds to the degree of refinement of human parts.In other words, the length of the query determines how many parts the human body will be split into.We evaluate different lengths of queries on Market-1501, Occluded-REID, Partial-REID, and MARS.As shown in Table 8, the ablation experiments show the same trend: as the length increases, the Rank-1 index always increases, then decreases, and reaches a peak at 10. Obviously, on the occluded and partial datasets, when the length is set greater than 10, our method experiences a sudden decline in Rank-1.This phenomenon proves our method is sensitive to the length of learnable queries.It is obvious that when our model needs to focus on more human parts, there will be many unnecessary distinguishments, which will directly increase the difficulty of modeling the human body.On the contrary, on a holistic REID task, the Rank-1 is not particularly dramatic.This phenomenon proves two things.Firstly, occlusions rely more on refined human part features, but holistic situations rely more on global features.The suitable length of the queries setting will help the model learn the body parts in sufficient detail.Secondly, the fusion encoder is able to successfully aggregate local and global features into the hidden state h m_cls .Reviewing previous human part-based methods, their part-aware masks may not benefit from suppressing the disturbance and further grouping human part features, which may be the main reason why some part-aware masks maintain high confidence scores in the background area.It is worth mentioning that their design achieves great performance, but still pays more attention to human appearance features, which shows limitations in occluded situations.In contrast to these methods, by introducing the motion information from still images via a motion consistency task, we do not only focus on the pedestrian appearance features but also on their representative motion information, which makes our model more robust.
Experiments on either pre-trained on video datasets.As shown in Table 9, if we do not utilize the video dataset to pre-train the motion-aware transformer, the performance of our method will be affected to some extent.It brings about a 0.6∼0.7%decrease in Rank-1 and an approximate 0.5% decrease in mAP.A very important part of our work is how to extract the motion information of the human body parts from the static images.The motivation here for pre-training the motion-aware transformer is also to give the model some prior knowledge of motion information for the subsequent training of static images, which will make the training process smoother and more stable.Note that the visual adapter is the main factor to help us to acquire the motion prior knowledge from the video data.But this does not mean that our model cannot be trained from scratch.Without the pre-training process for the video dataset, our model is still able to achieve competitive performance.It is just more eager for more training epochs.

Visualization
Overview.Our proposed framework for ReID is based on motion information, and after sufficient training, the network will focus its attention on pedestrians, as shown in Figure 4a. Figure 4b also shows the effect of the motion consistency task that we have introduced, whose main purpose is to learn the features of each part of the human body; this whole process is similar to a semantic segmentation process, and the occlusions, such as cars, plates, will be considered as background noise for the motion consistency task.As shown in Figure 5a, we show the image-based and video-based attention maps from the visual adapter module which sense the approximate range of the human body and suppress the background noise.As shown in Figure 5b, the latent token introduced in the motion consistency task can be found through its attention graph that the latent token pays more attention to the detailed features of each part of the human body, and at the same time, it has a certain suppression effect on the background noise.Therefore, the method proposed in this paper can effectively reduce the background noise caused by objects that may change their positions, such as cars and license plates.The visualization of human part segmentation.We visualize the human part segmentation from the motion consistency task in Figure 4.In Figure 4a, we can intuitively find that after adding motion information, the attention map is more focused on the human body and has a strong resistance to occlusions.In Figure 4b, when the pedestrian rides a bike in various postures or is occluded by a car, the model is still able to effectively distinguish between human and obscured.It is worth noting that all of these enhancements are made possible by introducing motion information.
The visualization of attention map and the CMC curve.Taking one still image or several video frames as input, the visual adapter module will first extract coarse global features (Figure 5a) of the approximate motion area.Then, the MAT module further learns the refined motion information for each part of the human body from the coarse global features and provides more distinguishable human part features (Figure 5b).As shown in Figure 6, we demonstrate that introducing the motion information can reduce the sensitivity to human appearance.At last, as shown in Figure 7, we provide the CMC curve for every datasets.

Limitations
Our method is sensitive to the length of the learnable queries in the motion-aware transformer, which is proven in Section 4.4.Since our approach is to process video data input in the form of a visual adapter, it entails a certain degree of information loss.Meanwhile, there is still room for improvement in the extraction of spatio-temporal information.Additionally, the best performance first requires pre-training the motion-aware transformer on video datasets.

Conclusions and Future Work
In this paper, we rethink the motion information for person re-identification and propose a motion-aware fusion network.In contrast to previous methods, on the one hand, our method is able to simultaneously deal with image data and video data by introducing a visual adapter.On the other hand, our method enables us to obtain implicit motion information, not only from video data, but also from still image data.Moreover, the implicit motion information is fed to a fusion encoder for deeply modeling the relationship between vision features and corresponding motion information.To this end, our method achieves new state-of-the-art results on both holistic and occluded ReID datasets.Furthermore, we show the competitive performance on video-based datasets.In the future, considering the limitations of our proposed method, we aim to develop a novel way to utilize the spatio-temporal information effectively without the pre-training stage of the motion-aware transformer module.

1 Figure 1 .
Figure 1.MAF model is capable of handling ReID tasks based on both image and video inputs, and is able to achieve competitive results on the ReID task for both images and videos.

Figure 2 .
Figure 2.An overview of the proposed motion-aware fusion network for person re-identification.In (a), the overall network consists of two branches, one for extracting visual features from video and image inputs and the other for mining implicit motion information.In (b), we show the detail of our visual adapter.In (c), we show the detail of the motion-aware transformer (MAT) module.

Figure 3 .
Figure 3.The explanation of the motion consistency task.For a more concise presentation of the motion consistency task, we omitted two input images from different views here.

Figure 4 .
Figure 4. (a) We show the attention map of MAF, and it can be found that the attention of MAF is more focused on the human body region part and less affected by the background noise.(b) We show the intermediate results of the segmentation branch in the motion consistency task, and it can be found that the motion consistency task we introduced can help the MAF better localize more parts of the human body, and thus obtain more fine-grained local features of the human body.

Figure 5 .
Figure 5. (a) The visualization showing the attention map in the visual adapter module shows that the visual adapter can roughly localize the approximate area of the human body.(b) By visualizing the latent token in the motion consistency task, it can be found that based on the visual adapter, the latent token refines the human body by focusing on different human body parts separately, thus providing the local features of the human body.

Figure 6 .Figure 7 .
Figure 6.The visualization demonstrates the difference in retrieval with or without the introduction of motion information, and it can be found that the introduction of motion information can effectively alleviate the previous problem of dressing similarly.

Table 1 .
A brief summary of classic ReID methods.
contains 12,936 training images of 751 persons, 19,732 query images, and 3368 gallery images of 750 persons captured from 6 cameras.It is a holistic dataset.

Table 2 .
Performance comparison with state-of-the-art methods on MARS, LS-VID, iLiDS-VID, and PRID-2011 datasets.Our method achieves a competitive performance on four datasets.

Table 3 .
Performance comparison with state-of-the-art methods on Market-1501.

Table 4 .
Performance comparison with state-of-the-art methods on Occluded-REID dataset.Our method achieves the best performance.

Table 5 .
Performance comparison with state-of-the-art methods on Partial REID dataset, and Partial-iLIDS dataset.Our method achieves the best performance on two datasets.

Table 6 .
Ablations of key modules of MAF on Market-1501, Occluded-REID, Partial-REID, and MARS.Here, O-REID is the abbreviation of Occluded-REID and P-REID is the abbreviation of Partial-REID.MCT refers to the motion consistency task.VA refers to the visual adapter.MAT refers to the motion-aware transformer.

Table 7 .
The experiments of latent query length in visual adapter on MARS, LS-VID, iLiDS-VID, and PRID-2011.The gray denotes the previous SOTA.

Table 8 .
The experiments about the length of learnable queries in motion-aware transformer on Market-1501, Occluded-REID, Partial-REID, and MARS.Here, O-REID is the abbreviation of Occluded-REID and P-REID is the abbreviation of Partial-REID.

Table 9 .
The experiments on either pre-trained on video datasets.Here, O-REID is the abbreviation of the Occluded-REID and P-REID is the abbreviation of Partial-REID.The gray represents the value affected.