Unsupervised Person Re-Identification via Deep Attribute Learning

Zhang, Shun; Xu, Yaohui; Zhang, Xuebin; Cheng, Boyang; Wang, Ke

doi:10.3390/fi17080371

Open AccessArticle

Unsupervised Person Re-Identification via Deep Attribute Learning

by

Shun Zhang

^1,*

,

Yaohui Xu

¹,

Xuebin Zhang

¹,

Boyang Cheng

¹ and

Ke Wang

²

¹

School of Electronics and Information, Northwestern Polytechnical University, Xi’an 710129, China

²

China Railway First Survey and Design Institute Group Co., Ltd., Xi’an 710043, China

^*

Author to whom correspondence should be addressed.

Future Internet 2025, 17(8), 371; https://doi.org/10.3390/fi17080371

Submission received: 24 June 2025 / Revised: 4 August 2025 / Accepted: 6 August 2025 / Published: 15 August 2025

(This article belongs to the Special Issue Advances in Deep Learning and Next-Generation Internet Technologies)

Download

Browse Figures

Versions Notes

Abstract

Driven by growing public security demands and the advancement of intelligent surveillance systems, person re-identification (ReID) has emerged as a prominent research focus in the field of computer vision. However, this task presents challenges due to its high sensitivity to variations in visual appearance caused by factors such as body pose and camera parameters. Although deep learning-based methods have achieved marked progress in ReID, the high cost of annotation remains a challenge that cannot be overlooked. To address this, we propose an unsupervised attribute learning framework that eliminates the need for costly manual annotations while maintaining high accuracy. The framework learns the mid-level human attributes (such as clothing type and gender) that are robust to substantial visual appearance variations and can hence boost the accuracy of attributes with a small amount of labeled data. To carry out our framework, we present a part-based convolutional neural network (CNN) architecture, which consists of two components for image and body attribute learning on a global level and upper- and lower-body image and attribute learning at a local level. The proposed architecture is trained to learn attribute-semantic and identity-discriminative feature representations simultaneously. For model learning, we first train our part-based network using a supervised approach on a labeled attribute dataset. Then, we apply an unsupervised clustering method to assign pseudo-labels to unlabeled images in a target dataset using our trained network. To improve feature compatibility, we introduce an attribute consistency scheme for unsupervised domain adaptation on this unlabeled target data. During training on the target dataset, we alternately perform three steps: extracting features with the updated model, assigning pseudo-labels to unlabeled images, and fine-tuning the model. Through a unified framework that fuses complementary attribute-label and identity label information, our approach achieves considerable improvements of 10.6% and 3.91% mAP on Market-1501→DukeMTMC-ReID and DukeMTMC-ReID→Market-1501 unsupervised domain adaptation tasks, respectively.

Keywords:

person re-identification; attribute learning; self-training; unsupervised domain adaptation

1. Introduction

Recently, with the rapid development of surveillance devices and the growing demands for public security, cameras deployed in public areas such as streets and school campuses generate vast amounts of video footage daily. This video data is used to analyze pedestrians’ activity patterns and behavioral characteristics and is applied in fields such as object detection and multi-camera object tracking. Person re-identification (ReID) seeks to establish correspondences between the underlying identities of person bounding boxes captured across non-overlapping camera views. This capability has garnered increasing research interest due to its considerable role in the computer vision field, which includes Multi-camera multi-object tracking (MCMT) [1,2,3], multi-camera object activity analysis [2,4], and smart cities systems [5], among others. The task remains challenging primarily because visual appearance undergoes dramatic transformations across disjoint camera perspectives. These variations are attributable to covariate factors, including posture diversity, viewpoint disparities, partial occlusions, illumination inconsistencies, low-resolution imaging, and background clutter [6,7]. Most present ReID approaches adopt the supervised learning paradigm, including in distance metric learning [8,9,10,11] and deep learning methods [12,13,14,15], which presume the availability of substantial volumes of manually labeled images with person identities for learning a matching distance function or a feature representation.

In contrast to person identity, which provides a holistic characterization, pedestrian attributes [16,17] constitute intermediate semantic descriptors localized to specific anatomical regions of the human body. For example, upper-body clothing primarily characterizes the torso, hair length pertains to the head region, and shoe type correlates with the feet. These attributes are defined by a domain-specific basis set that is semantically interpretable by humans, exhibiting high intra-class consistency and reduced ambiguity under subjective interpretation. Consequently, they demonstrate enhanced invariance to intra-individual appearance variations induced by posture changes, viewpoint shifts, and other confounding factors, thereby serving as an appropriate depiction for human–machine interaction and intuitive human comprehension. As illustrated in the first row of Figure 1, conventional ReID systems may fail to distinguish the first two individuals wearing visually similar gray clothes and backpacks; however, attribute-based auxiliary cues, such as gender (male), accessory absence (no hat, no bags), can effectively eliminate false matches by leveraging complementary discriminative signals. Although part-level information is notable for robust person ReID, its full potential remains insufficiently exploited, particularly regarding the synergistic integration of attribute semantics and their contextual interdependencies. Extensive research has demonstrated that person ReID algorithms achieve substantial performance gains through the integration of pedestrian attribute supervision [18,19,20]. Nevertheless, manual annotation of attribute labels is considerably more labor-intensive and time-consuming than assigning identity labels, particularly in real-world deployment scenarios. Despite efforts by Lin et al. [19] to annotate several ReID datasets with attribute metadata, a persistent scarcity of large-scale, attribute-annotated ReID benchmarks remains a considerable limitation in the field. As it is prohibitively expensive to obtain sufficient training data for a set of attributes, this scalability issue severely limits the usability of existing attribute-based supervised ReID methods.

Recently, deep convolutional neural networks (CNNs) have adopted an end-to-end and integrated learning framework, which leads to promising performance and generalization ability in various visual tasks like image classification [22,23] and face recognition [24,25,26]. Building upon the complementary strengths of semantic pedestrian attributes and CNN-based feature learning, our research aims to train a discriminative attribute detection model for person ReID. The marked annotation burden associated with manually labeling diverse and fine-grained attributes at scale poses a fundamental constraint. Consequently, the notable challenge addressed herein is the design of a transfer learning paradigm that leverages labeled attribute data from an independent source domain to train a CNN model. This paradigm ensures domain-invariant attribute representation and optimal generalization to unlabeled target ReID data, mitigating domain shift while preserving discriminative power. Confronted by limited attribute-labeled data and cross-domain discrepancies, we propose an unsupervised domain adaptation (UDA) paradigm that leverages attribute semantics from a labeled source dataset to learn domain-invariant representations, thereby enabling robust person ReID on target datasets without annotation dependency.

In the source dataset, we present a multi-task learning framework by training the pre-trained attribute CNN model with joint deep learning to simultaneously predict auxiliary attribute and identity labels, i.e., local attribute feature representations and global identity. Then we propose an attribute consistency scheme for unsupervised adaptation of the pre-trained attribute CNN model on the unlabeled target data to further enhance the discriminative compatibility of the CNN model toward the target domain for person ReID. Specifically, we apply the pre-trained attribute CNN model on the target dataset to extract both local attribute descriptors and global image features, and we further apply an unsupervised clustering algorithm to generate clusters by exploiting the potential similarity from the joint global-local features of unlabeled images. These independent clusters are assigned cluster labels, which serve as pseudo identities for supervising the training process on the target dataset. During learning on the target dataset, extracting features with a newly trained model, assigning the unlabeled images with pseudo-labels, and fine-tuning the trained model work together alternately until the learning converges. To evaluate deep attributes, we validate their performance on four well-known person ReID benchmarks.

This research makes the following key contributions: (a) We propose an unsupervised person ReID approach that integrates a novel multi-task CNN architecture with a self-training algorithm; it is capable of jointly leveraging both attribute and identity data. (b) We introduce an attribute consistency scheme and incorporate it into the unsupervised domain adaptation framework, aiming to ensure discriminative compatibility between person identity features and attribute features during the learning process. (c) Our comparative analysis demonstrates that learned deep attributes exhibit superior cross-task generalizability, achieving competitive performance in both person ReID and fine-grained attribute classification tasks.

The remainder of this paper is organized as follows. Section 2 reviews the most relevant works on person re-identification and attribute-based learning. Section 3 presents our proposed part-based CNN architecture and details the two-stage training pipeline. Section 4 describes the experimental setup, benchmark datasets, and evaluation metrics, followed by extensive comparative results and ablation studies. Section 5 concludes the paper and outlines future research directions.

2. Related Work

2.1. Person Re-Identification

Current person ReID approaches fall into three primary categories: (1) hand-crafted feature design (i.e., designing and deriving primitive appearance descriptors for human appearance) [27,28,29,30]; (2) effective distance metric learning (i.e., discriminative metric learning learns an embedding space where person feature pairs are pulled close and cross-person pairs are pushed apart) [8,9,10,11]; (3) CNN-based learning (i.e., integrating these two separate phases within a unified, end-to-end framework generates robust and enhanced feature representations).

Recent advancements establish CNN-based models as a widely adopted paradigm in ReID, leveraging their capacity to learn robust and transferable feature representations that enhance cross-camera discriminability. Approaches automatically extract discriminative pedestrian feature representations from the entire body in images, global feature learning that captures distinctive characteristics of appearance to characterize identities of various pedestrians. Nevertheless, due to intricacies in visual data acquired from monitoring environments, global feature learning is likely to overlook some subtle yet detailed information, which degrades the accuracy in large-scale yet complex Re-ID situations. In order to overcome this limitation, part-based approaches for person Re-ID, which isolate discriminative anatomical regions from images to characterize partial discriminative traits of subjects, have been introduced and proven as a robust methodology for enhanced Re-ID precision [31,32,33,34]. Such works include works locating part regions with empirical knowledge about human bodies [31,32], strong learning-based pose information [35], region proposal methods [33], uniformly partitioned regions for local feature extraction, and mid-level attention focused on salient segments [34]. Unlike the previous works, our approach combines global full-body feature extraction with equally divided part-based local feature learning. Our architecture incorporates hierarchical feature learning by fusing mid-level attribute labels and identity supervision in an end-to-end manner. The coupled optimization mechanism leverages attribute-identity correlations to jointly enhance discriminative power for both tasks, eliminating the need for staged training pipelines.

To our knowledge, so far, most of the CNN-based ReID methods are based on the supervised learning strategy working on a number of manually labeled training data with only person identities. Thus, such methods suffer from poor scalability in realistic ReID deployments, as a large number of labeled samples are inaccessible in many practical situations. The deep neural networks pre-trained on existing source datasets cannot generalize well on the new target data due to different appearance characteristics. Consequently, unsupervised person Re-identification methodologies have emerged to address cross-domain distribution discrepancies in scenarios where annotated data is absent within the target domain.

The general idea behind those unsupervised methods is to make features extracted by neural networks similar between domains as a foundational paradigm in domain adaptation, adversarial methods optimize a domain classifier to discriminate feature provenance, enforcing domain-invariant representation learning via gradient reversal mechanisms. For example, Deng et al. [36] proposed a generative adversarial network of similarity preservation (SPGAN) to achieve image translation across domains by enforcing structural consistency, bridging the origin and destination domains. Self-training-based methods emerge as an alternative strategy and attain pioneering results on unsupervised person ReID. In general, self-training assigns pseudo-labels to target domain instances exhibiting elevated confidence scores. Fan et al. [37] introduced a progressive optimization framework that iteratively alternates between K-means clustering and integrated discrimination embedding (IDE) fine-tuning. Liu et al. [38] deploy a reciprocal search mechanism to progressively refine estimated pseudo-labels. Song et al. [39] use a self-training loop that alternates between pseudo-label assignment (via current encoder) and encoder update (via assigned labels). Our work also adopts the self-training scheme to enhance the features contributing to the target classification. To address cross-domain discriminability alignment, we propose an attribute consistency regularization mechanism that enforces attribute-identity compatibility during unsupervised adaptation on unlabeled target data, thereby mitigating domain shift through latent semantic alignment.

2.2. Attribute-Based Learning

The person attribute is a kind of high-level semantic information highly relevant to a person’s identity but invariant to various visual appearances. Current research extensively leverages deep learning frameworks for attribute learning to facilitate attribute prediction [16,17] and enhance person ReID [40]. In ReID systems, semantic attributes demonstrate substantial efficacy in preserving intra-class representation consistency while amplifying inter-class discriminability [18,41,42]. Su et al. structured a joint optimization framework employing low-dimensional factorized trait embeddings [43,44] to resolve cross-camera person re-identification challenges. Similarly, Schumann and Stiefelhagen integrated automatically inferred attribute descriptors into convolutional neural network training pipelines [45], enabling joint optimization of identity and attribute representations. TJ-AIDL in [46] learns joint identity-attribute discriminative features for the unlabeled target dataset. In our work, we adopt a similar strategy to TJ-AIDL by leveraging person attribute information from a labeled source dataset, but we develop a more effective part-based CNN architecture where global image and local attribute representations are learned simultaneously. Unlike TJ-AIDL, we also introduce an additional self-training scheme that trains our joint identity-attribute model for robustness.

3. Model Description

3.1. Problem Formulation

Consider a labeled source attribute dataset

I_{s} = {(I_{s}^{i}, y_{s}^{i}, a_{s}^{i})}_{i = 1}^{N_{s}}

, where

N_{s}

person bounding box images

I_{s}^{i}

are annotated with unique identity indices

y_{s}^{i} \in Y = {1, \dots, C_{s}}

and attribute vectors

a_{s}^{i} \in R^{m}

. This instructional dataset depicts

C_{s}

distinct identities captured from disjoint camera perspectives. We denote

a_{s}^{i} = [a_{s}^{i, 1}, \dots, a_{s}^{i, m}]

as a feature annotation vector containing m attributes, where

a_{s}^{i, j} \in {0, 1}

serves as the binary flag for the occurrence of the

j t h

attribute. Also, we have a target dataset

I_{t} = {I_{t}^{i}}_{i = 1}^{N_{t}}

, which consists of

N_{t}

person images. Note that the identity of each image

I_{t}^{i}

in the target dataset

I_{t}

is unknown. The objective is to adapt a CNN model pre-trained on the source labeled attribute dataset to the target dataset where there is neither attribute nor identity label information.

3.2. Overview

To address the above problem, we present an unsupervised domain adaptation approach for person ReID and attribute prediction. As illustrated in Figure 2, our approach consists of two training stages:

(1): Pre-training attribute CNN model on the source dataset: We develop a part-based CNN architecture of two components for image and body attribute learning at a global level, and upper and lower body image and attribute learning at a local level. The architecture is trained on the source dataset of labeled attributes in a supervised manner to jointly learn an attribute-semantic and identity-discriminative feature representation simultaneously.
(2): Self-training attribute CNN model on the target dataset: In the second stage, we adopt the same CNN architecture as in the first stage. At the beginning of the second stage, we use the parameters of the pre-trained attribute CNN model to initialize the attribute CNN model. We then extract the global and local image and attribute features on the unlabeled images in the target dataset and employ an unsupervised clustering approach to assign them pseudo labels. The attribute CNN model is fine-tuned using the inputs of the unlabeled images and pseudo labels. During self-training, we iteratively extract features with the newly trained CNN model, assign pseudo labels, and fine-tune the CNN model until it becomes stable.

Figure 2. Our unsupervised domain adaptation approach for person ReID.

3.3. Fully Supervised Attribute CNN Pre-Training

As source-trained models acquire transferable person representations beneficial for ReID tasks, we pre-train an attribute CNN model on the source attribute dataset, including the ground truth attribute annotations in a fully supervised manner. Unlike existing methods that adopt the generic CNN backbones (e.g., VGG-Net and ResNet), we develop a dedicated network architecture that accounts for both the visual characteristics of pedestrians and the semantic structure of person attributes. Building on prior work that learns discriminative representations from distinct human body regions [47], we propose a component-aware deep CNN to derive the globally invariant representation of pedestrians and the local-invariant representation of human attributes simultaneously. To enhance feature discriminability, a synergistic framework integrating semantic attributes is implemented in our CNN model via multi-task learning, generating complementary representations through joint optimization of attribute classification and identity recognition tasks.

Our CNN model has several shared frontal hidden layers to extract the low-level image features. We then feed the comment features into both the global and part branches. The global branch is to learn global image and attribute feature representation. The part branch learns both global-level representations and localized upper-body and lower-body images and attribute features. Figure 3 illustrates our proposed deep attribute learning architecture for person re-identification. We adopt ResNet-50 as our backbone network, rigorously validated to deliver state-of-the-art efficacy across diverse ReID frameworks.

The ResNet-50 architecture is structured into five convolutional stages: re_conv1, re_conv2_x (for x = 1, 2, 3), re_conv3_x (for x = 1, 2, 3, 4), re_conv4_x (for x = 1, 2, 3, 4, 5, 6) and re_conv5_x (for x = 1, 2, 3), where x denotes the quantity of blocks stacked (The prefix “re_” is for all blocks; Table 1 in [23] provides more details). The architecture retains the original ResNet-50 configuration through the re_conv4_2 block. Subsequent to the re_conv4_1 block, the network bifurcates into dual branches: a global branch and a part-based branch. The dual-branch CNN architecture is trained for three heterogeneous tasks: person re-identification, semantic attribute classification, and cross-instance verification. The global branch preserves all re_conv4_x and re_conv5_x block parameters from standard ResNet-50. The re_conv5_x output feeds into a global max-pooling (GMP) layer for spatial feature aggregation. The global branch generates dual independent feature representations via fully connected layers:

f_{s, G}

is used for joint person identification and attribute recognition based on the global region of the input image, while

f_{s, G}

is dedicated to person verification. This global branch acquires integrated characteristic embeddings and mid-level attributes from the entire image without spatial partitioning.

The part branch adopts an analogous topological configuration to the global branch. The key architectural modifications is that down-sampling operations are omitted in the re_conv5_1 block to maintain optimal receptive field dimensions for local feature preservation; a part-based partitioning strategy bifurcates re_conv5_1 output feature maps along the height dimension into complementary upper and lower regions: upper-body and lower-body regions, enabling more precise attribute localization, as detailed in Table 1. The features

f_{s, P}

derived post-GMP are dedicated to two objectives: joint person identification and attribute recognition, and cross-instance verification. These bisected feature segments specialize in learning local representations for identity and attribute recognition:

f_{s, U P}

targets upper-body attribute identification, while

a_{s, L P}

focuses on lower-body attribute identification. After finishing pre-training, we can predict the attributes of an image with the pre-trained attribute CNN model through the following steps: (1) For an image

I_{s}^{i}

in the source dataset, we extract the attributes

a_{s, G}^{i}

and

a_{s, P}^{i}

predicted on the global level and the attributes

a_{s, U P}^{i}

and

a_{s, L P}^{i}

predicted on the local level; (2) The above four attributes will vote for the final attribute prediction, and each attribute item has three votes. We select the maximum number of votes as the final attribute prediction

{\tilde{a}}_{s}^{i}

.

3.4. Unsupervised Adaptive Attribute Learning

The pre-trained CNN model performs poorly if it is directly applied to the target dataset because of the domain shift issues, where distributions in source and target visual domains may be appreciably different. To address the problem, most classical unsupervised person ReID approaches [37,48] feed each unlabeled image in the target dataset into the pre-trained CNN model for feature extraction, and then use some clustering algorithms to group unlabeled images into several clusters. Each unlabeled sample is assigned a pseudo-label derived from clustering outcomes, facilitating subsequent semi-supervised model training. These labels reconfigure the target dataset for iterative model refinement through fine-tuning. While we adopt this idea for clustering analysis on unlabeled images in the target domain and then using the resultant groupings for domain shift adaptation, our framework implements hierarchical feature comparison: global full-body representations are augmented with fine-grained attribute features extracted from anatomically partitioned upper and lower body regions.

To implement the proposed algorithm, we first input each unlabeled image

I_{t}^{i}

from the target dataset into the pre-trained model detailed in Section 3.3 for feature extraction. Thus, we can achieve the upper-body feature

f_{t, U P}^{i}

, the global features

f_{t, G}^{i}

and

f_{t, P}^{i}

and the lower-body feature

f_{t, L P}^{i}

. We concatenate the four features, global features

f_{t, G}^{i}

and

f_{t, P}^{i}

, the upper-body features

f_{t, U P}^{i}

and the lower-body features

f_{t, L P}^{i}

, to form the final feature

f_{t, C}^{i}

. We then employ the

f_{t, C}^{i}

-based unsupervised clustering algorithm to partition the person images into distinct clusters, assigning each image a pseudo-label according to its cluster assignment. Thus, we generate pseudo-labels

y_{t, C}^{i}

for each image feature

I_{t}^{i}

. Consequently, we construct a new target dataset in which each image is assigned a pseudo-label based on the clustering results of the final feature vector

f_{t, C}^{i}

, as follows: Consequently, we construct a target dataset where each image is assigned a pseudo-label derived from cluster assignments of final feature vectors

f_{t, C}^{i}

, which are formally defined as follows:

S_{i d} = {I_{t}^{i}, y_{t, C}^{i}; 1 \leq i \leq N_{t}} .

(1)

The pseudo-labels serve as self-supervision to optimize the pre-trained model for cross-dataset adaptation, adhering to the cluster assumption that features within the same cluster should converge while those across different clusters diverge.

In the unsupervised adaptation, we aim to fine-tune the pre-trained CNN model on the target dataset with not only the global person identity information but also the semantic attribute information. Images of the same identity should exhibit similar features, while those of different identities should possess dissimilar features. Motivated by this observation, we present an attribute consistency scheme to adapt the pre-trained CNN model to produce similar attribute labels for the same person and vice versa. We employ the pre-trained model to predict the attribute label

{\tilde{a}}_{t}^{i}

for each image

I_{t}^{i}

at the beginning of unsupervised domain adaptation. As a result, we can add the attribute labels into the new dataset:

S_{a t t r} = {I_{t}^{i}, y_{t, C}^{i}, {\tilde{a}}_{t}^{i}; 1 \leq i \leq N_{t}} .

(2)

Our attribute consistency scheme includes two parts: (a) We perform jointly optimizing attribute prediction by leveraging both global and local feature representations for the model to learn to weigh and correct for noisy supervision through multi-level consistency. Specifically, we extract global-level attribute predictions

{\tilde{a}}_{s, G}^{i}

and

{\tilde{a}}_{s, P}^{i}

as well as local-level predictions

{\tilde{a}}_{s, U P}^{i}

(upper part) and

{\tilde{a}}_{s, L P}^{i}

(lower part); These four sets of predicted attributes participate in a voting mechanism (denoted as

L_{a t t r}

in Equation (11)). The final prediction of attributes

{\tilde{a}}_{s}^{i}

is determined by selecting the attribute value with the highest count of votes. Here, we treat

{\tilde{a}}_{t}^{i}

as the pseudo attribute labels for

I_{t}^{i}

. That means we believe the predicted attribute labels by the pre-trained model. (b) We minimize the triplet loss (denoted as

L_{a t t r - t r i p}

in Equation (11)) of attribute vectors through updating, i.e., minimize the distance between the attributes of two images with the same pseudo label, and maximize the distance between the attributes of two images with different pseudo labels. The attribute consistency scheme would ensure that the same person has similar attributes and avoid meaningless attributes. The former part in the scheme is formulated by the attribute identification with input of the image

I_{t}^{i}

and its attribute prediction

{\tilde{a}}_{t}^{i}

, while the latter one can be formulated by our attribute triplet loss

L_{a t t r - t r i p}

. We will describe the details in Section 3.5.

Our self-training algorithm for unsupervised adaptation follows the work in [37]. We extract features with the adapted CNN model at the beginning and generate clusters with the unsupervised clustering algorithm. The pseudo cluster labels are assigned to the unlabeled images in the target dataset. The unlabeled images with the pseudo identity labels and pseudo attribute labels are input to the adapted CNN model for fine-tuning as shown in Figure 4. The optimization process induces a redistribution of data instances within the feature space, thereby progressively aligning images representing identical individuals toward their respective cluster centroids. Consequently, increasingly informative samples are iteratively incorporated into the training set until the inclusion of additional reliable data instances becomes infeasible.

3.5. Loss Function

This subsection delineates the formalization of loss functions deployed for supervised pre-training and unsupervised domain adaptation.

3.5.1. Supervised Pre-Training

The pre-trained network is trained end-to-end in a supervised manner using our multi-task learning architecture (Figure 3), which incorporates ID Loss, attribute identification loss, and Batch Hard Triplet Loss:

\begin{matrix} L_{s u p} = L_{i d} + λ_{1} L_{a t t r} + λ_{2} L_{t r i p - b h}, \end{matrix}

(3)

where

L_{i d}

and

L_{a t t r}

denote the ID Loss and attribute identification loss;

L_{t r i p - b h}

represents the triplet-based verification loss; and

λ_{1}

,

λ_{2}

balance the contributions of these losses.

Person identification loss. About the ID Loss, we employ the cross-entropy loss function to optimize the identity classification task, using training labels from multiple identities. Formally, the probability of image

I_{i}

being predicted as identity

y_{i}

is given by the following:

p_{i d} (I_{i}, y_{i}) = \frac{\exp (w_{y_{i}}^{T} f (I_{i}))}{\sum_{k = 1}^{C} \exp (w_{k}^{T} f (I_{i}))},

(4)

where

f (I_{i})

denotes the embedding representation (

f_{G}

or

f_{P}

in Figure 3) for

I_{i}

, and

w_{k}

the classifier weight corresponding to enrollment class k. For a batch of

n_{b s}

images, the training loss:

L_{i d} = - \frac{1}{n_{b s}} \sum_{i = 1}^{n_{b s}} \log (p_{i d} (I_{i}, y_{i})) .

(5)

By promoting highly discriminative identity features (i.e., characterized by maximal inter-class variance), this loss facilitates accurate multi-class classification. The person identification loss is utilized for all four local and global features (

f_{G}

,

f_{P}

,

f_{U P}

, and

f_{L P}

).

Attribute identification loss. The attribute loss function consists of four components:

L_{a t t r}^{G}

(global attributes),

L_{a t t r}^{P}

(part-based attributes),

L_{a t t r}^{U P}

(upper-body attributes), and

L_{a t t r}^{L P}

(lower-body attributes):

\begin{matrix} L_{a t t r} = L_{a t t r}^{G} + L_{a t t r}^{P} + L_{a t t r}^{U P} + L_{a t t r}^{L P} . \end{matrix}

(6)

We address attribute prediction through multi-label classification, applying a sigmoid activation to the last fully connected layer outputs to constrain confidence values within

[0, 1]

. The global attribute loss

L_{a t t r}^{G}

applies binary cross-entropy (BCE) across all m attribute classes:

\begin{matrix} L_{a t t r}^{G} = - \frac{1}{n_{b s}} & \sum_{i = 1}^{n_{b s}} \sum_{j = 1}^{m} (a_{i, j} \log (p_{a t t} (I_{i}, a_{i, j})) + \\ (1 - a_{i, j}) \log (1 - p_{a t t} (I_{i}, a_{i, j})), \end{matrix}

(7)

where

a_{i, j}

denotes the ground-truth attribute label, and

p_{a t t} (I_{i}, a_{i, j})

represents the predicted classification probability for the

j t h

attribute class of training image

I_{i}

. Similarly, the part-based (

L_{a t t r}^{P}

), upper-body (

L_{a t t r}^{U P}

), and lower-body (

L_{a t t r}^{L P}

) attribute losses compute binary cross-entropy between ground truth labels (Table 1) and predicted classification probabilities.

Batch Hard Triplet Loss. A triplet typically comprises an anchor sample (

I_{s^{o}}

), a positive sample (

I_{s^{+}}

) with the same identity, and a negative sample (

I_{s^{-}}

) from a different identity. Batch Hard Triplet Loss integrates triplet mining into the training loop by selecting the hardest positive and hardest negative samples for each anchor within a mini-batch to form hard triplets. The mini-batch is structured in the following manner: Each mini-batch comprises P randomly selected identities with K instances per identity, resulting in a fixed batch size of

P \times K

. In mini-batch training, every image functions as an anchor sample in sequential order. The Batch Hard triplet loss selects the most challenging positive sample and negative sample per anchor:

L_{t r i p - b h} = \sum_{i = 1}^{P} \sum_{s^{o} = 1}^{K} {[D_{s^{o}, s^{+}}^{*} - D_{s^{o}, s^{-}}^{*} + α]}_{+},

(8)

where

D_{s^{o}, s^{+}}^{*}

and

D_{s^{o}, s^{-}}^{*}

denote the hardest positive and negative sample distances relative to anchor

I_{s^{o}}

, defined as

\begin{matrix} D_{s^{o}, s^{+}}^{*} = \sum_{i = 1}^{P} \sum_{s^{o} = 1}^{K} \max_{s^{+} = 1, \dots, K} D (f (I_{s^{o}}^{i}), f (I_{s^{+}}^{i})), \end{matrix}

(9)

\begin{matrix} D_{s^{o}, s^{-}}^{*} = \sum_{i = 1}^{P} \sum_{s^{o} = 1}^{K} \min_{\begin{matrix} s^{-} = 1, \dots, K \\ j = 1, \dots, P \\ j \neq i \end{matrix}} D (f (I_{s^{o}}^{i}), f (I_{s^{-}}^{j})), \end{matrix}

(10)

where

f (I_{i})

refers to the feature vector (

f_{G}

,

f_{P}

,

f_{U P}

and

f_{L P}

in Figure 3) of

I_{i}

.

3.5.2. Unsupervised Adaptation

We employ the pseudo labels

y_{t, C}^{i}

and attribute labels

{\tilde{a}}_{t}^{i}

as the supervised information to fine-tune the CNN model (as shown in Figure 4) for cross-dataset adaptation. The total objective function of the adaptive training CNN model is formulated as follows:

\begin{matrix} L_{u n s u p} = L_{v} + λ_{3} L_{a t t r} + λ_{4} L_{a t t r - t r i p}, \end{matrix}

(11)

where

L_{v}

is the batch hard triplet loss for person identity information:

\begin{matrix} L_{v} = & L_{t r i p - b h} (f_{t, G}^{i}, y_{t, G}^{i}) + L_{t r i p - b h} (f_{t, P}^{i}, y_{t, P}^{i}) \\ + L_{t r i p - b h} (f_{t, U P}^{i}, y_{t, U P}^{i}) + L_{t r i p - b h} (f_{t, L P}^{i}, y_{t, L P}^{i}) . \end{matrix}

(12)

L_{a t t r}

is the person identification loss (in Equation (7)) given pseudo attribute labels

{\tilde{f}}_{t}^{i}

, and

L_{a t t r - t r i p}

is our attribute triplet loss.

Attribute Triplet Loss. We compute the attribute triplet loss

L_{a t t r - t r i p}

during the training. Fine-tuning minimizes the triplet loss by concurrently: Reducing feature distance between anchor

I_{s^{o}}

and positive sample

I_{s^{+}}

, Increasing feature distance between anchor

I_{s^{o}}

and negative sample

I_{s^{-}}

. Subsequently, the corresponding loss function can be formulated as follows:

L_{a t t r - t r i p} = \sum_{\begin{matrix} s^{o}, s^{+}, s^{-} \\ y_{s^{o}} = y_{s^{+}} \neq y_{s^{-}} \end{matrix}} {[D (a_{t}^{s^{o}}, a_{t}^{s^{+}}) - D (a_{t}^{s^{o}}, a_{t}^{s^{-}}) + α]}_{+} .

(13)

Here, if the

D (a_{t, G}^{s^{o}}, a_{t, G}^{s^{-}})

is larger than

α

, the loss would be 0.

4. Experimental Results

Our approach is evaluated on four benchmark datasets; results are detailed below: Market1501 [49], DukeMTMC-ReID [50], VIPeR [51] and PRID [52]. We first delineate the implementation protocols and dataset statistics, then conduct comprehensive benchmarking against state-of-the-art unsupervised ReID methods on three large-scale datasets: Market1501 [49], DukeMTMC-ReID [50] and VIPeR [51]. Next, we evaluate the accuracy of predicted attributes on three datasets: VIPeR [51] and PRID [52]. An ablation study quantifies the contribution of component-level building blocks to our proposed CNN architecture. Finally, qualitative results demonstrate ranking and attribute recognition performance through visual examples.

4.1. Implementation Details

Supervised pre-training: For our experimental setup, we utilize ResNet-50 parameters initialized with ImageNet pre-training. During both training and evaluation phases, input images undergo resizing to

384 \times 128

pixels followed by standardization using mean and standard deviation normalization. Mini-batches consist of 32 samples organized as 32 distinct identities with 4 instances per identity. Optimization employs Stochastic Gradient Descent (SGD) across 320 training epochs. The learning rate initiates at

2 \times 10^{- 4}

and undergoes exponential decay starting at epoch 160. Empirical determination sets the margin parameter

α

to 1.2.

Unsupervised domain adaptation: We apply an unsupervised clustering algorithm to obtain pseudo labels and then train the whole adapting framework for 30 iterations until the model is stable. We decrease the initial learning rate from

6 \times 10^{- 4}

to

6 \times 10^{- 5}

, and the training epoch for each self-training iteration is 70. For the weight parameters of loss functions,

λ_{1}

is set to 2, and

λ_{2}

and

λ_{3}

are both set to 0.1.

The proposed architecture was developed in the Pytorch [53] deep learning environment, utilizing two NVIDIA Titan GPUs.

Each GPU is equipped with 3584 CUDA cores, 12 GB of GDDR5X memory, and a memory bandwidth of 10 Gbps. Consistent with these hardware specifications, identical experimental settings were applied across all dataset evaluations. For the typical task “Market-1501→DukeMTMC-ReID”, the total training time is approximately 10 h, including supervised learning on the source dataset Market-1501 and unsupervised domain adaptation on the target dataset DukeMTMC-ReID, which includes feature extraction, clustering, pseudo-label generation, and self-training learning.

The detailed training configurations are summarized in Table 2.

4.2. Datasets

Our approach is assessed across four publicly available datasets. About key dataset statistics, such as the count of training/test identities, probe/gallery images, and attribute information, are compiled in Table 3. Note that we validate our experiments using the labeled source dataset for pre-training our CNN model, but we only use the training images without any label information (identity or attribute labels) for self-training on the target dataset.

4.2.1. Market1501

The Market-1501 dataset [49] comprises 32,668 automatically detected bounding box images of 1501 unique identities, captured across six cameras. The training set includes 12,936 images of 751 identities, while the test set consists of 19,732 images from the remaining 750 identities. Specifically, the dataset contains 3368 query images and a gallery of 19,732 images. Unlike hand-labeled datasets, Market-1501 uses the deformable part model (DPM) for pedestrian detection, making it more representative of real-world scenarios. Market-1501 provides 27 annotated attributes, including gender, age, bag-carrying status, and clothing colors (specified separately for upper and lower body garments). We have pre-partitioned the dataset into dedicated training and test subsets, and evaluation follows the Single-Query protocol—selecting one query image at a time to retrieve matching identities from the gallery.

4.2.2. DukeMTMC-ReID

DukeMTMC-ReID dataset [50] is a person ReID benchmark derived from the larger DukeMTMC multi-camera tracking dataset. It is constructed from 85 min of high-resolution videos captured by 8 cameras, with person bounding boxes manually annotated. Structurally similar to Market-1501, DukeMTMC-ReID includes 36,411 bounding boxes spanning 1404 identities that appear in at least two cameras, along with 408 distractor identities visible in only one camera. The dataset is split into 702 identities for training and 702 for testing. For evaluation, a single query image per identity per camera is used, and all other relevant images are placed in the gallery. The final composition includes a gallery of 17,661 images, 2228 queries, and 16,522 training images. The dataset is annotated with 23 attribute categories, including shoe type and color, presence of hats, and whether the person is carrying a backpack, handbag, or other types of bags. As with Market-1501, the dataset splits are predefined, and evaluation follows the standard Single-Query protocol.

4.2.3. VIPeR and PRID

VIPeR and PRID datasets utilize attribute annotations sourced from PETA [54]. Each image contains labeling for 61 binary attributes alongside 4 multi-class attributes (e.g., gender, hair length, backpack presence, upper/lower body clothing, and footwear). The multi-class attributes are expanded into 44 binary equivalents, yielding comprehensive 105-dimensional binary attribute representations. During cross-dataset evaluation, the test dataset is systematically excluded from the PETA training subset. For instance, when evaluating on VIPeR, none of its images participate in CNN training.

VIPeR [51] presents substantial challenges due to its low-resolution imagery, maintaining its status as a benchmark despite extensive research. This dataset comprises 632 unique identities, each represented by two images captured from distinct viewpoints under varying illumination conditions. Images are standardized to

128 \times 48

resolution and randomly partitioned into equal training and testing subsets.

PRID [52] features dual-camera viewpoints: Camera A captures 385 individuals while Camera B records 749 individuals, sharing 200 common identities across both views. Adopting the single-query protocol and data partitioning from [55], our experiments use 100 randomly selected common identities (with one image per view) for training. The remaining Camera A images form the probe set (100 identities), while Camera B’s 649 non-overlapping identities constitute the gallery. All images are rescaled to

128 \times 64

pixels.

4.2.4. Evaluation Metrics

For person ReID evaluation, we employ the CMC curve and mAP. The CMC curve quantifies the probability of finding the correct identity match within top-k ranked gallery candidates. For each query, gallery images are ranked by ascending distance, and Top-k accuracy is defined as

A c c_{k} = 1

if the correct identity appears in the top-k results, and

A c c_{k} = 0

otherwise. The ultimate CMC curve is derived by computing the mean of

A c c_{k}

across all qprobes [56], quantifying how regularly the true match appears within the top-ranked results.

The mAP metric, on the other hand, evaluates retrieval performance more comprehensively by considering all correct matches. We compute the mAP by averaging the area under the Precision-Recall curve (AP) per query. For evaluation, we utilize publicly available tools from [49,57].

Attribute recognition performance is evaluated via per-attribute classification accuracy. Following standard protocols, the gallery set serves as the test set, with background and junk images systematically excluded due to missing annotations. We report both per-attribute classification accuracy and the overall mean accuracy across all attributes as the final performance measure.

4.3. Comparative Study on Unsupervised Person ReID

Given the unsupervised nature of our model, we benchmark it against state-of-the-art unsupervised ReID methods on Market-1501, DukeMTMC-ReID, and VIPeR datasets under the single-query setting. The methods constitute a collection of interconnected yet leading unsupervised ReID techniques across two categories: (1) unsupervised person ReID methods based on source identity label information, such as LOMO [58], UMDL [59], SPGAN [36], PUL [37], Theory [39], MAR [60], and ENC [61]; (2) unsupervised person ReID methods based on source identity and attribute label information, such as TJ-AIDL [46], SSDAL [62], UDAPR [63], MMFA [64]. Below, we denote the source labels, identity and/or attribute labels, used in each work for training by ID and/or Attr, and tabulate all the comparative results in Table 4, Table 5 and Table 6.

On the DukeMTMC-ReID dataset, the Theory (ID) employing only identity and also using a self-training algorithm performs the best in the ID category by achieving 49.0% in mAP, 68.4% in Top-1, 80.1% in Top-5, and 83.5% in Top-10, respectively. TJ-AIDL (ID + Attr) with the supervision of both identity and attribute label information for deep learning yields the best performance in this category; i.e., 23.0% in mAP, 44.3% in Top-1, 58.6% in Top-5, and 65.0% in Top-10, respectively. In contrast, it is evident from Table 4 that our proposed deep CNN architecture (ID + Attr) substantially outperforms all the methods used in our comparative study in terms of all metrics by increasing mAP to 54.2%, Top-1 to 73.1%, Top-5 to 81.3%, and Top-10 to 83.8%, respectively.

On Market-1501, the Theory (ID) employing only identity and also using a self-training algorithm performs the best in the ID category by achieving 53.7% in mAP, 75.8% in Top-1, 89.5% in Top-5, and 93.2% in Top-10, respectively. TJ-AIDL (ID + Attr) with the supervision of both identity and attribute label information for deep learning yields the best performance in this category; i.e., 26.5% in mAP, 58.2% in Top-1, 74.8% in Top-5, and 81.1% in Top-10, respectively. Conversely, our approach (ID + Attr) outperforms all the methods used in our comparative study in terms of metrics mAP (55.8%) and Top-1 (78.2%). Empirical results demonstrate that jointly exploiting ID and attribute information enables learning appreciably more discriminative representations by synergistically combining low-level image features and mid-level attribute semantics.

On the VIPeR dataset, the DLLR (ID) learns global image features with only person identity, and it performs the best in the ID category by achieving 29.6% in Top-1, 54.8% in Top-5, and 64.8% in Top-10, respectively. JSLAM (ID + Attr) integrates identity and attribute labels to address unsupervised ReID, yielding 34.6% Top-1, 60.1% Top-5, and 69.5% Top-10 accuracy. Our method (ID + Attr) outperforms it across all metrics, attaining 39.6% Top-1, 62.2% Top-5, and 70.3% Top-10 accuracy. On PRID, TJ-AIDL (ID + Attr) reaches 34.8% Top-1, while our method achieves state-of-the-art performance (35.5% Top-1) by simultaneously learning attribute-semantic and identity-discriminative features.

4.4. Results on Unsupervised Attribute Recognition

We evaluate the performance of person attribute on the gallery sets of the VIPeR and GRID datasets, with quantitative results presented as shown in Table 7. Our approach achieves state-of-the-art performance in attribute recognition accuracy, establishing a new benchmark that surpasses prior methods [62]. On the VIPeR dataset, we increase the average attribute recognition accuracy to the optimal 63.2%, and on the GRID dataset, the average attribute recognition accuracy is 65.1%. In our unified architecture, identity labels serve as complementary cues for attribute recognition, generating richer discriminative representations than single-task learning.

4.5. Ablation Study

Our person ReID approach learns discriminative features by jointly leveraging person identity labels and attribute annotations on the source dataset (i.e., supervised person ReID learning) and self-adapts the global image and attribute features on the unlabeled images in a target domain (i.e., unsupervised person ReID learning). We perform module-level ablation studies to quantify the contribution of individual components to ReID performance, systematically removing each module to isolate its impact:

(1): Ours-woID: without the person identification loss $L_{i d}$ ;
(2): Ours-woAttr: without the attribute identification loss $L_{a t t r}$ ;
(3): Ours-woTrip: without the batch hard triplet loss $L_{trip - bh}$ ;
(4): Ours-woAdapt: without the pre-trained model adaption on the target dataset;
(5): Ours-woAttrTrip: without attribute triplet loss $L_{attr - trip}$ ;
(6): Ours-woST: without self-training iterations;
(7): Ours-woGF: without the global branch in the deep CNN architecture;
(8): Ours-woLF: without the part branch in the deep CNN architecture.

All experiments are first pre-trained on the source dataset Market-1501, and then evaluated on the target dataset DukeMTMC-ReID using mAP and Top-1, Top-5, and Top-10 accuracy as metrics. To ensure fair comparison and isolate the effect of the model itself, no re-ranking techniques are applied during evaluation.

Effectiveness of identity recognition: To validate identity recognition efficacy, we contrast our full model with an identity-supervision-ablated variant (Ours-woID) under identical experimental conditions. As evidenced in Table 8, Ours-woID exhibits universal performance degradation across all metrics in both supervised and unsupervised ReID scenarios. For example, on the supervised ReID performance, mAP and Top-1 of Ours-woID drop from 81.5% to 62.8% and 92.7% to 82.3%, respectively; on the unsupervised ReID, mAP and Top-1 of Ours-woID drop from 54.2% to 47.7% and 73.1% to 67.0%, respectively. Experimental outcomes confirm that jointly leveraging identity and attribute labels enhances re-identification accuracy in our approach.

Effectiveness of attribute recognition: In order to quantify contributions of attribute supervision, we benchmark the full framework against its ablation variant without attribute supervision (Ours-woAttr). Table 8 documents consistent performance degradation across all metrics for the attribute-ablated configuration. Specifically, Ours-woAttr reduces mAP by 13.7 percentage points (from 91.5% to 77.8%) in supervised ReID evaluation. This ablation study confirms that incorporating attribute recognition appreciably boosts re-identification efficacy.

Effectiveness of identity verification: To assess identity verification efficacy, we benchmark our model against an ablated variant without batch-hard triplet loss for identity learning (Ours-woTrip). Table 8 demonstrates that removing identity verification reduces the mAP of the Ours-woTrip model to 71.4% on supervised ReID and 35.0% on unsupervised ReID, which shows identity verification is also considerable for person re-identification performance.

Effectiveness of domain adaptation: To quantify the contribution of domain adaptation, we benchmark our model against its ablated variant without CNN adaptation (Ours-woAdapt). Table 8 shows that the Ours-woAdapt model adopts the pre-trained model on the source dataset, achieving the lowest performance on the target dataset: mAP-18.4%, Top1-27.9%, Top5-38.0%, and Top10-44.0%. As the two domains are appreciably different, the pre-trained CNN model cannot distinguish the people in the target dataset.

Effectiveness of self-training algorithm: To quantify the contribution of self-training, we benchmark our model against its ablated variant, excluding self-training (Ours-woST). Table 8 shows that the Ours-woST model is fine-tuned with the pseudo labels, which contain lots of noise. Thus, the person’s ReID performance is decreased: mAP is 28.0%.

Effectiveness of multi-part attribute triplet loss: In order to quantify the contribution of the multi-part attribute triplet loss, we benchmark the full model against its ablated variant without this loss (Ours-woAttrT). As shown in Table 8, removing the multi-part attribute triplet loss (Ours-woAttrT) degrades performance across all metrics. Specifically, Ours-woAttrT decreases mAP to 51.9% and 71.4% to 79.6% on the unsupervised ReID performance. Results demonstrate that our multi-part attribute triplet loss enhances person ReID performance.

Effect of unsupervised parameter analysis: For the loss function in Equation (11), the weighting parameters

λ_{3}

and

λ_{4}

represent the contribution of

L_{a t t r}

and

L_{attr - trip}

, respectively. The influence of

λ_{3}

and

λ_{4}

on model performance is analyzed in Figure 5 and Figure 6. When

λ_{3}

= 0,

L_{a t t r}

is discarded, so as

λ_{4}

for

L_{attr - trip}

. With the injection of

L_{a t t r}

and

L_{attr - trip}

, mAP improves appreciably. When

λ_{3}

dominates the performance curve, mAP peaks at

λ_{3} = 0.1

, with both smaller and larger values resulting in decreased performance. A similar trend is observed for

λ_{4}

, with the highest mAP achieved at

λ_{4} = 0.2

.

Effectiveness of global/local features: To evaluate the contribution of global/local level representation, we benchmark our full model against an ablated variant that discards the global branch (Ours-woGF) and the part branch (Ours-woLF). As shown in Table 8, removing global features (Ours-woGF) reduces the mAP from 81.5% to 75.6% on supervised ReID and from 54.2% to 53.2% on unsupervised ReID, confirming the critical role of global context in achieving robust identity matching. Discarding local features results in consistent performance degradation across all evaluation metrics: for supervised ReID, the mAP drops from 81.5% to 74.7%; similarly, for unsupervised ReID, the mAP decreases from 54.2% to 51.7%. These uniform declines demonstrate that local fine-grained details are indispensable for maintaining high accuracy.

Effectiveness of iterative pseudo-label refinement: To quantify the contribution of successive iterations, we benchmark our model against ablated variants trained with multiple different self-training iterations. Figure 7 shows that limiting the refinement to a single iteration reduces the mAP on unsupervised ReID from 54.2% (30 iterations) to 48.5%. Extending to 3 iterations recovers the mAP to 50.8%, while 5 iterations further lift it to 52.5%—still below the 30-iteration baseline. These progressive gains demonstrate that iterative pseudo-label refinement is crucial for achieving the final performance.

4.6. Visual Inspection

Here, we present some typical results on how our model behaves via visual inspection. To demonstrate the effective retrieval made by our proposed approach, we illustrate typical sample ranking results on the Market-1501 dataset in Figure 8. The ranking results demonstrate the proposed method’s strong retrieval capability, exhibiting higher true positive recall rates while effectively suppressing false positives. Notably, our approach successfully retrieves target persons despite substantial pose discrepancies relative to the query images. As illustrated in Figure 8 (Row 2), using a frontal-view query image generates Top-10 rankings containing the same identity in diverse poses, validating our model’s robustness to viewpoint variations.

Furthermore, Figure 9 presents attribute recognition results on the evaluation datasets. We observe model limitations in predicting pedestrian-carried object attributes (e.g., backpacks), where occlusion challenges appreciably impact prediction accuracy for such features.

5. Conclusions

We propose an unsupervised person ReID framework that jointly leverages identity and attribute labels within a unified network to learn discriminative re-identification features. By effectively fusing complementary cues from both label types during training, our model achieves competitive performance across four widely used benchmark datasets. The work presented in this paper provides further evidence that the exploitation of the existing information sources properly improves person ReID performance without the need for additional laborious and time-consuming data annotation, thereby taking a large step toward real applications.

In future work, we plan to explore the correlations and dependencies among semantic attributes to further refine the learning process and improve performance in unsupervised settings. Additionally, we aim to address remaining challenges such as handling occlusion and leveraging auxiliary information for more robust inference. To strengthen the validity of our findings, we will also incorporate statistical significance tests and confidence interval analysis in our evaluation protocols to better establish the robustness and reliability of the observed improvements.

Author Contributions

Conceptualization, S.Z. and Y.X.; methodology, Y.X. and X.Z.; validation, X.Z., B.C. and K.W.; writing—original draft preparation, S.Z. and Y.X.; supervision, S.Z. All authors have read and agreed to the published version of the manuscript.

Funding

The work is supported by the National Natural Science Foundation of China (no. 62271409), and Shaanxi Province Key Research Program (no.2024QY2-GJHX-10).

Data Availability Statement

Data are available upon request.

Conflicts of Interest

The author Ke Wang is employed by the company: China Railway First Survey and Design Institute Group Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

He, Y.; Wei, X.; Hong, X.; Shi, W.; Gong, Y. Multi-target multi-camera tracking by tracklet-to-target assignment. IEEE Trans. Image Process. 2020, 29, 5191–5205. [Google Scholar] [CrossRef]
Xie, Z.; Ni, Z.; Yang, W.; Zhang, Y.; Chen, Y.; Zhang, Y.; Ma, X. A robust online multi-camera people tracking system with geometric consistency and state-aware re-id correction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–18 June 2024; pp. 7007–7016. [Google Scholar]
Amini-Omam, M.; Torkamani-Azar, F.; Ghorashi, S.A. Maximum Likelihood Estimation for Multiple Camera Target Tracking on Grassmann Tangent Subspace. IEEE Trans. Cybern. 2018, 48, 77–89. [Google Scholar] [CrossRef] [PubMed]
Loy, C.C.; Xiang, T.; Gong, S. Time-delayed correlation analysis for multi-camera activity understanding. Int. J. Comput. Vis. 2010, 90, 106–129. [Google Scholar] [CrossRef]
Vitello, P.; Capponi, A.; Fiandrino, C.; Giaccone, P.; Kliazovich, D.; Bouvry, P. High-precision design of pedestrian mobility for smart city simulators. In Proceedings of the IEEE International Conference on Communications, Kansas City, MO, USA, 20–24 May 2018; pp. 1–6. [Google Scholar]
Zhang, S.; Zhang, Q.; Wei, X.; Zhang, Y.; Xia, Y. Person Re-Identification With Triplet Focal Loss. IEEE Access 2018, 6, 78092–78099. [Google Scholar] [CrossRef]
Niculescu-Mizil, A.; Patel, D.; Melvin, I. MCTR: Multi Camera Tracking Transformer. In Proceedings of the Winter Conference on Applications of Computer Vision, Tucson, AZ, USA, 28 February–4 March 2025; pp. 874–884. [Google Scholar]
Ma, A.J.; Yuen, P.C.; Li, J. Domain transfer support vector ranking for person re-identification without target camera label information. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 3567–3574. [Google Scholar]
Dikmen, M.; Akbas, E.; Huang, T.S.; Ahuja, N. Pedestrian recognition with a learned metric. In Proceedings of the Asian Conference on Computer Vision, Queenstown, New Zealand, 8–12 November 2010; pp. 501–512. [Google Scholar]
Xiong, F.; Gou, M.; Camps, O.; Sznaier, M. Person re-identification using kernel-based metric learning methods. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 1–16. [Google Scholar]
Zheng, W.S.; Gong, S.; Xiang, T. Reidentification by relative distance comparison. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 653–668. [Google Scholar] [CrossRef] [PubMed]
Zheng, L.; Bie, Z.; Sun, Y.; Wang, J.; Su, C.; Wang, S.; Tian, Q. Mars: A video benchmark for large-scale person re-identification. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 868–884. [Google Scholar]
Xiao, T.; Li, H.; Ouyang, W.; Wang, X. Learning deep feature representations with domain guided dropout for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1249–1258. [Google Scholar]
Herzog, F.; Chen, J.; Teepe, T.; Gilg, J.; Hörmann, S.; Rigoll, G. Synthehicle: Multi-vehicle multi-camera tracking in virtual cities. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2023; pp. 1–11. [Google Scholar]
Hermans, A.; Beyer, L.; Leibe, B. In defense of the triplet loss for person re-identification. arXiv 2017, arXiv:1703.07737. [Google Scholar] [CrossRef]
Shankar, S.; Garg, V.K.; Cipolla, R. Deep-carving: Discovering visual attributes by carving deep neural nets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3403–3412. [Google Scholar]
Chen, Q.; Huang, J.; Feris, R.; Brown, L.M.; Dong, J.; Yan, S. Deep domain adaptation for describing people based on fine-grained clothing attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5315–5324. [Google Scholar]
Layne, R.; Hospedales, T.M.; Gong, S.; Mary, Q. Person re-identification by attributes. In Proceedings of the British Machine Vision Conference, Surrey, UK, 3–7 September 2012; Volume 2, p. 8. [Google Scholar]
Lin, Y.; Zheng, L.; Zheng, Z.; Wu, Y.; Hu, Z.; Yan, C.; Yang, Y. Improving person re-identification by attribute and identity learning. Pattern Recognit. 2019, 95, 151–161. [Google Scholar] [CrossRef]
Matsukawa, T.; Suzuki, E. Person re-identification using cnn features learned from combination of attributes. In Proceedings of the 2016 23rd International Conference on Pattern Recognition (ICPR), Cancún, Mexico, 4–8 December 2016; IEEE: Cancun, Mexico, 2016; pp. 2428–2433. [Google Scholar]
Zhang, S.; He, Y.; Wei, J.; Mei, S.; Wan, S.; Chen, K. Person re-identification with joint verification and identification of identity-attribute labels. IEEE Access 2019, 7, 126116–126126. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Sun, Y.; Chen, Y.; Wang, X.; Tang, X. Deep learning face representation by joint identification-verification. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 1988–1996. [Google Scholar]
Sun, Y.; Wang, X.; Tang, X. Deeply learned face representations are sparse, selective, and robust. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2892–2900. [Google Scholar]
Sun, Y.; Liang, D.; Wang, X.; Tang, X. Deepid3: Face recognition with very deep neural networks. arXiv 2015, arXiv:1502.00873. [Google Scholar] [CrossRef]
Ma, B.; Su, Y.; Jurie, F. Bicov: A novel image representation for person re-identification and face verification. In Proceedings of the British Machive Vision Conference, Surrey, UK, 3–7 September 2012; pp. 1–11. [Google Scholar]
Liu, C.; Gong, S.; Loy, C.C.; Lin, X. Person re-identification: What features are important? In Proceedings of the European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; Springer: Berlin/Heidelberg, Germany, 2012; pp. 391–401. [Google Scholar]
Klare, B.F.; Jain, A.K. Heterogeneous face recognition using kernel prototype similarities. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1410–1422. [Google Scholar] [CrossRef] [PubMed]
An, L.; Kafai, M.; Yang, S.; Bhanu, B. Reference-based person re-identification. In Proceedings of the International Conference on Advanced Video and Signal Based Surveillance, Krakow, Poland, 27–30 August 2013; pp. 244–249. [Google Scholar]
Cheng, D.; Gong, Y.; Zhou, S.; Wang, J.; Zheng, N. Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1335–1344. [Google Scholar]
Li, W.; Zhu, X.; Gong, S. Person re-identification by deep joint learning of multi-loss classification. arXiv 2017, arXiv:1705.04724. [Google Scholar]
Li, D.; Chen, X.; Zhang, Z.; Huang, K. Learning deep context-aware features over body and latent parts for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 384–393. [Google Scholar]
Li, W.; Zhu, X.; Gong, S. Harmonious attention network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2285–2294. [Google Scholar]
Zhao, H.; Tian, M.; Sun, S.; Shao, J.; Yan, J.; Yi, S.; Wang, X.; Tang, X. Spindle net: Person re-identification with human body region guided feature decomposition and fusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1077–1085. [Google Scholar]
Deng, W.; Zheng, L.; Ye, Q.; Kang, G.; Yang, Y.; Jiao, J. Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 994–1003. [Google Scholar]
Fan, H.; Zheng, L.; Yan, C.; Yang, Y. Unsupervised person re-identification: Clustering and fine-tuning. ACM Trans. Multimed. Comput. Commun. Appl. 2018, 14, 83. [Google Scholar] [CrossRef]
Liu, Z.; Wang, D.; Lu, H. Stepwise metric promotion for unsupervised video person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2429–2438. [Google Scholar]
Song, L.; Wang, C.; Zhang, L.; Du, B.; Zhang, Q.; Huang, C.; Wang, X. Unsupervised domain adaptive re-identification: Theory and practice. Pattern Recognit. 2020, 102, 107173. [Google Scholar] [CrossRef]
Shi, Z.; Hospedales, T.M.; Xiang, T. Transferring a semantic representation for person re-identification and search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4184–4193. [Google Scholar]
Layne, R.; Hospedales, T.M.; Gong, S. Towards person identification and re-identification with attributes. In Proceedings of the European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; Springer: Berlin/Heidelberg, Germany, 2012; pp. 402–412. [Google Scholar]
Layne, R.; Hospedales, T.M.; Gong, S. Attributes-based re-identification. In Person Re-Identification; Springer: Berlin/Heidelberg, Germany, 2014; pp. 93–117. [Google Scholar]
Su, C.; Yang, F.; Zhang, S.; Tian, Q.; Davis, L.S.; Gao, W. Multi-task learning with low rank attribute embedding for person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 3739–3747. [Google Scholar]
Su, C.; Yang, F.; Zhang, S.; Tian, Q.; Davis, L.S.; Gao, W. Multi-task learning with low rank attribute embedding for multi-camera person re-identification. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 1167–1181. [Google Scholar] [CrossRef] [PubMed]
Schumann, A.; Stiefelhagen, R. Person re-identification by deep learning attribute-complementary information. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 20–28. [Google Scholar]
Wang, J.; Zhu, X.; Gong, S.; Li, W. Transferable joint attribute-identity deep learning for unsupervised person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2275–2284. [Google Scholar]
Wang, G.; Yuan, Y.; Chen, X.; Li, J.; Zhou, X. Learning discriminative features with multiple granularities for person re-identification. In Proceedings of the ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 274–282. [Google Scholar]
Fu, Y.; Wei, Y.; Wang, G.; Zhou, Y.; Shi, H.; Huang, T.S. Self-similarity grouping: A simple unsupervised cross domain adaptation approach for person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6112–6121. [Google Scholar]
Zheng, L.; Shen, L.; Tian, L.; Wang, S.; Wang, J.; Tian, Q. Scalable person re-identification: A benchmark. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1116–1124. [Google Scholar]
Ristani, E.; Solera, F.; Zou, R.; Cucchiara, R.; Tomasi, C. Performance measures and a data set for multi-target, multi-camera tracking. In Computer Vision – ECCV 2016 Workshops Amsterdam, The Netherlands, October 8–10 and 15–16, 2016, Proceedings, Part II; Springer: Berlin/Heidelberg, Germany, 2016; pp. 17–35. [Google Scholar]
Gray, D.; Brennan, S.; Tao, H. Evaluating appearance models for recognition, reacquisition, and tracking. In Proceedings of the IEEE International Workshop on Performance Evaluation for Tracking and Surveillance (PETS), Rio de Janeiro, Brazil, 14–18 December 2007; Volume 3, pp. 1–7. [Google Scholar]
Hirzer, M.; Beleznai, C.; Roth, P.M.; Bischof, H. Person re-identification by descriptive and discriminative classification. In Image Analysis 17th Scandinavian Conference, SCIA 2011, Ystad, Sweden, May 2011. Proceedings; Springer: Berlin/Heidelberg, Germany, 2011; pp. 91–102. [Google Scholar]
Available online: https://pytorch.org/ (accessed on 5 August 2025).
Deng, Y.; Luo, P.; Loy, C.C.; Tang, X. Pedestrian attribute recognition at far distance. In Proceedings of the ACM International Conference on Multimedia, Orlando, FL, USA, 3–7 November 2014; pp. 789–792. [Google Scholar]
Peng, P.; Tian, Y.; Xiang, T.; Wang, Y.; Pontil, M.; Huang, T. Joint semantic and latent attribute modelling for cross-class transfer learning. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 1625–1638. [Google Scholar] [CrossRef] [PubMed]
Zheng, L.; Yang, Y.; Hauptmann, A.G. Person re-identification: Past, present and future. arXiv 2016, arXiv:1610.02984. [Google Scholar] [CrossRef]
Zheng, Z.; Zheng, L.; Yang, Y. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3754–3762. [Google Scholar]
Liao, S.; Hu, Y.; Zhu, X.; Li, S.Z. Person re-identification by local maximal occurrence representation and metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2197–2206. [Google Scholar]
Peng, P.; Xiang, T.; Wang, Y.; Pontil, M.; Gong, S.; Huang, T.; Tian, Y. Unsupervised cross-dataset transfer learning for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1306–1315. [Google Scholar]
Yu, H.X.; Zheng, W.S.; Wu, A.; Guo, X.; Gong, S.; Lai, J.H. Unsupervised Person Re-identification by Soft Multilabel Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 2148–2157. [Google Scholar]
Zhong, Z.; Zheng, L.; Luo, Z.; Li, S.; Yang, Y. Invariance matters: Exemplar memory for domain adaptive person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 598–607. [Google Scholar]
Su, C.; Zhang, S.; Xing, J.; Gao, W.; Tian, Q. Deep attributes driven multi-camera person re-identification. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 475–491. [Google Scholar]
Zhu, X.; Morerio, P.; Murino, V. Unsupervised Domain-Adaptive Person Re-Identification Based on Attributes. In Proceedings of the IEEE International Conference on Image Processing, Taipei, Taiwan, 22–25 September 2019; pp. 4110–4114. [Google Scholar]
Lin, S.; Li, H.; Li, C.T.; Kot, A.C. Multi-task mid-level feature alignment network for unsupervised cross-dataset person re-identification. In Proceedings of the British Machine Vision Conference, Newcastle, UK, 3–6 September 2018. [Google Scholar]
Zhong, Z.; Zheng, L.; Li, S.; Yang, Y. Generalizing a person retrieval model hetero-and homogeneously. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 172–188. [Google Scholar]
Qi, L.; Wang, L.; Huo, J.; Zhou, L.; Shi, Y.; Gao, Y. A novel unsupervised camera-aware domain adaptation framework for person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8080–8089. [Google Scholar]
Li, Y.J.; Yang, F.E.; Liu, Y.C.; Yeh, Y.Y.; Du, X.; Frank Wang, Y.C. Adaptation and re-identification network: An unsupervised deep transfer learning approach to person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 172–178. [Google Scholar]
Yu, H.X.; Wu, A.; Zheng, W.S. Cross-view asymmetric metric learning for unsupervised person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 994–1002. [Google Scholar]
Farenzena, M.; Bazzani, L.; Perina, A.; Murino, V.; Cristani, M. Person re-identification by symmetry-driven accumulation of local features. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; IEEE: San Francisco, CA, USA, 2010; pp. 2360–2367. [Google Scholar]
Zhao, R.; Ouyang, W.; Wang, X. Unsupervised salience learning for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 3586–3593. [Google Scholar]
Wang, H.; Gong, S.; Xiang, T. Unsupervised learning of generative topic saliency for person re-identification. In Proceedings of the British Machine Vision Conference, Nottingham, UK, 1–5 September 2014. [Google Scholar]
Lisanti, G.; Masi, I.; Bagdanov, A.D.; Del Bimbo, A. Person re-identification by iterative re-weighted sparse ranking. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 1629–1642. [Google Scholar] [CrossRef] [PubMed]
Kodirov, E.; Xiang, T.; Gong, S. Dictionary learning with iterative laplacian regularisation for unsupervised person re-identification. In Proceedings of the 26th British Machine Vision Conference, Swansea, UK, 7–10 September 2015; Volume 3, p. 8. [Google Scholar]
Fernando, B.; Habrard, A.; Sebban, M.; Tuytelaars, T. Unsupervised visual domain adaptation using subspace alignment. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 2960–2967. [Google Scholar]
Ma, A.J.; Li, J.; Yuen, P.C.; Li, P. Cross-domain person reidentification using domain adaptation ranking svms. IEEE Trans. Image Process. 2015, 24, 1599–1613. [Google Scholar] [CrossRef] [PubMed]
Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; Lempitsky, V. Domain-adversarial training of neural networks. J. Mach. Learn. Res. 2016, 17, 1–35. [Google Scholar]

Figure 1. Instances of some person images, each paired with attribute labels and matched with a second image depicting the same person [21].

Figure 3. The deep CNN model for pre-training on the source dataset [21].

Figure 4. Our deep CNN backbones for unsupervised domain adaptation on the target dataset.

Figure 5. Effect of weight parameter

λ_{3}

in Equation (11).

Figure 5. Effect of weight parameter

λ_{3}

in Equation (11).

Figure 6. Effect of weight parameter

λ_{4}

in Equation (11).

Figure 6. Effect of weight parameter

λ_{4}

in Equation (11).

Figure 7. Effectiveness of iterations on unsupervised DukeMTMC-ReID.

Figure 8. Visualization of retrieved matches on Market-1501: Example query images (left column) and their corresponding top-10 ranked gallery images. Green borders indicate true positives; red borders denote false positives [21].

Figure 9. Examples of pedestrian attribute recognition. The two tables present predicted attributes and classification scores, with red borders highlighting erroneous predictions.

Table 1. The attribute hierarchy of the global, upper-body, and lower-body parts.

		Global Part
	Upper-body Part				Lower-body Part
	colors of upper-body clothing, wearing hat, carrying handbag				length of lower-body, age, gender, hair length, sleeve length, carrying bag, carrying backpack, type of lower-body clothing, colors of lower-body clothing, shoe type, color of shoes

Table 2. Training setups.

Infra Details
GPU	NVIDIA Titan × 2
Training Hyperparameters (Supervised pre-training)
Input images	384 × 128 pixels
Batch size	32
Training epoch	320
Initial learning rate	$2 \times 10^{- 4}$
Iteration	30
Optimizer	SGD
$λ_{1}$ (for $L_{a t t r}$ in Supervised Pre-training)	2
$λ_{2}$ (for $L_{t r i p - b h}$ in Supervised Pre-training)	0.1
Training Hyperparameters (Unsupervised domain adaption)
Input images	384 × 128 pixels
Batch size	32
Training epoch	70
Initial Learning rate	from $6 \times 10^{- 4}$ to $6 \times 10^{- 5}$
Iteration	30
Optimizer	SGD
$λ_{3}$ (for $L_{a t t r}$ in Unsupervised Adaptation)	0.1
$λ_{4}$ (for $L_{a t t r - t r i p}$ in Unsupervised Adaptation)	0.2
Model Weights
ResNet-50	pre-trained with ImageNet

Table 3. The specifications of the four evaluated ReID datasets.

Datasets	VIPeR	PRID	Market-1501	DukeMTMC-ReID
#identities	632	934	1501	1812
#images	1264	1134	32,643	36,411
#cameras	2	2	6	8
#training IDs	316	100	750	702
#test IDs	316	100	751	702
#probe images	316	100	3368	2228
#gallery images	316	649	19,732	17,661
#attributes	105	105	27	23

Table 4. Experimental findings of the proposed method and comparison methods on the DukeMTMC-ReID dataset, trained with the Market-1501 source dataset. CMC (Top-1, Top-5, Top-10) and mAP accuracies are reported. “ID” and “Attr” denote supervision with identity and attribute labels in that order. Bold indicates the best performance among all methods.

Methods	Market-1501→DukeMTMC-ReID
Methods	Source Label	mAP (%)	Top-1 (%)	Top-5 (%)	Top-10 (%)
LOMO [58]	ID	4.8	12.3	21.3	26.6
Bow [49]	ID	8.3	17.1	28.8	34.9
UDML [59]	ID	7.3	18.5	31.4	37.6
SPGAN [36]	ID	26.4	46.9	62.6	68.5
HHL [65]	ID	27.2	46.9	61.0	66.7
PUL [37]	ID	16.4	30.4	44.5	50.7
UCDA-CCE [66]	ID	36.7	55.4	-	-
Theory [39]	ID	49.0	68.4	80.1	83.5
ARN [67]	ID	33.4	60.2	73.9	79.5
MAR [60]	ID	48.0	67.1	79.8	-
ENC [61]	ID	40.4	63.3	75.8	80.4
TJ-AIDL [46]	ID + Attr	23.0	44.3	59.6	65.0
MMFA [64]	ID + Attr	24.7	45.3	59.8	66.3
Present	ID + Attr	54.2	73.1	81.3	83.8

Table 5. Experimental findings of the proposed method and comparison methods on the target Market-1501 dataset, trained with the source DukeMTMC-ReID dataset. CMC (Top-1, Top-5, Top-10) and mAP accuracies are reported. “ID” and “Attr” denote supervision with identity and attribute labels in that order. Bold indicates the best performance among all approaches.

Methods	DukeMTMC-ReID→Market-1501
Methods	Source Label	mAP (%)	Top-1 (%)	Top-5 (%)	Top-10 (%)
LOMO [58]	ID	8.0	27.2	41.6	49.1
Bow [49]	ID	14.8	35.8	52.4	60.3
UDML [59]	ID	12.4	34.5	52.6	59.6
CAMEL [68]	ID	26.3	54.5	-	-
SPGAN [36]	ID	26.9	58.1	76.0	82.7
HHL [65]	ID	31.4	62.2	78.8	84.0
PUL [37]	ID	20.1	44.7	59.1	65.6
UCDA-CCE [66]	ID	34.5	64.3	-	-
Theory [39]	ID	53.7	75.8	89.5	93.2
ARN [67]	ID	39.4	70.3	80.4	86.3
MAR [60]	ID	40.0	67.7	81.9	-
ENC [61]	ID	43.0	75.1	87.6	91.6
TJ-AIDL [46]	ID + Attr	26.5	58.2	74.8	81.1
SSDAL [62]	ID + Attr	19.6	-	-	39.4
MMFA [64]	ID + Attr	27.4	56.7	75.0	81.8
Present	ID + Attr	55.8	78.2	88.2	90.8

Table 6. Experimental findings of the proposed method and comparison methods on target datasets VIPeR [51] and PRID, trained on the source Market-1501 dataset. CMC (Rank-1, Rank-5, Rank-10) accuracies are reported. “ID” and “Attr” denote supervision with identity and attribute labels in that order. Bold indicates best performance among all approaches.

Methods	Source Label	VIPeR			PRID
Methods	Source Label	Top-1 (%)	Top-5 (%)	Top-10 (%)	Top-1 (%)	Top-5 (%)	Top-10 (%)
SDALF [69]	ID	19.9	38.9	49.4	16.3	29.6	38.0
eSDC [70]	ID	26.7	50.7	62.4	-	-	-
GTS [71]	ID	25.1	50.0	62.5	-	-	-
ISR [72]	ID	27.0	49.8	61.2	17.0	34.4	42.0
DLLR [73]	ID	29.6	54.8	64.8	21.1	43.7	55.8
kLFDA_N [10]	ID + Attr	15.9	42.4	50.0	9.1	27.3	35.0
SADA [74] + kLFDA [10]	ID + Attr	15.2	41.4	49.8	8.7	26.4	34.8
AdaRSVM [75]	ID + Attr	10.9	23.7	33.1	4.9	13.1	18.4
Adversarial [76]	ID + Attr	22.8	38.6	50.3	-	-	-
JSLAM [55]	ID + Attr	34.6	60.1	69.5	25.6	47.9	58.5
SSDAL [62]	ID + Attr	37.9	-	-	20.1	-	-
TJ-AIDL [46]	ID + Attr	38.5	-	-	34.8	-	-
Present	ID + Attr	40.2	62.2	71.3	35.5	48.1	60.6

Table 7. Average attribute recognition accuracy (%) on VIPeR and GRID datasets. Bold denotes best performance among all methods.

Methods	VIPeR (%)	GRID (%)
SSDAL-Stage1 [62]	57.2	60.7
SSDAL-Stage1 and 3 [62]	57.1	61.1
SSDAL-Stage1 and 2 [62]	56.9	60.6
SSDAL [62]	58.6	62.7
Present	63.2	65.1

Table 8. Ablation study on DukeMTMC-ReID (target dataset), trained on Market-1501 (source dataset). Bold denotes best performance across all module combinations.

Methods	mAP (%)	Top-1 (%)	Top-5 (%)	Top-10 (%)
Supervised Person ReID on Market-1501
Ours-woID	62.8	82.3	91.7	94.7
Ours-woAttr	77.8	91.5	96.7	98.0
Ours-woTrip	71.4	88.7	95.7	97.2
Ours-woGF	75.6	85.4	93.8	96.1
Ours-woLF	74.7	85.1	92.7	95.8
Present	81.5	92.7	96.8	98.1
Unsupervised Person ReID on DukeMTMC-ReID
Ours-woAdapt	18.4	27.9	38.0	44.0
Ours-woST	28.0	42.7	55.9	61.2
Ours-woID	47.7	67.0	78.1	82.2
Ours-woTrip	35.0	51.8	65.0	69.6
Ours-woAttrT	51.9	71.4	79.6	82.8
Ours-woGF	53.2	69.5	76.6	80.4
Ours-woLF	51.7	62.2	75.1	75.8
Present	54.2	73.1	81.3	83.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, S.; Xu, Y.; Zhang, X.; Cheng, B.; Wang, K. Unsupervised Person Re-Identification via Deep Attribute Learning. Future Internet 2025, 17, 371. https://doi.org/10.3390/fi17080371

AMA Style

Zhang S, Xu Y, Zhang X, Cheng B, Wang K. Unsupervised Person Re-Identification via Deep Attribute Learning. Future Internet. 2025; 17(8):371. https://doi.org/10.3390/fi17080371

Chicago/Turabian Style

Zhang, Shun, Yaohui Xu, Xuebin Zhang, Boyang Cheng, and Ke Wang. 2025. "Unsupervised Person Re-Identification via Deep Attribute Learning" Future Internet 17, no. 8: 371. https://doi.org/10.3390/fi17080371

APA Style

Zhang, S., Xu, Y., Zhang, X., Cheng, B., & Wang, K. (2025). Unsupervised Person Re-Identification via Deep Attribute Learning. Future Internet, 17(8), 371. https://doi.org/10.3390/fi17080371

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Unsupervised Person Re-Identification via Deep Attribute Learning

Abstract

1. Introduction

2. Related Work

2.1. Person Re-Identification

2.2. Attribute-Based Learning

3. Model Description

3.1. Problem Formulation

3.2. Overview

3.3. Fully Supervised Attribute CNN Pre-Training

3.4. Unsupervised Adaptive Attribute Learning

3.5. Loss Function

3.5.1. Supervised Pre-Training

3.5.2. Unsupervised Adaptation

4. Experimental Results

4.1. Implementation Details

4.2. Datasets

4.2.1. Market1501

4.2.2. DukeMTMC-ReID

4.2.3. VIPeR and PRID

4.2.4. Evaluation Metrics

4.3. Comparative Study on Unsupervised Person ReID

4.4. Results on Unsupervised Attribute Recognition

4.5. Ablation Study

4.6. Visual Inspection

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI