CORE‑ReID: Comprehensive Optimization and Refinement through Ensemble Fusion in Domain Adaptation for Person Re‑Identification

: This study introduces a novel framework, “Comprehensive Optimization and Refinement through Ensemble Fusion in Domain Adaptation for Person Re‑identification (CORE‑ReID)”, to ad‑ dress an Unsupervised Domain Adaptation (UDA) for Person Re‑identification (ReID). The frame‑ work utilizes CycleGAN to generate diverse data that harmonize differences in image characteristics from different camera sources in the pre‑training stage. In the fine‑tuning stage, based on a pair of teacher–student networks, the framework integrates multi‑view features for multi‑level clustering to derive diverse pseudo‑labels. A learnable Ensemble Fusion component that focuses on fine‑grained local information within global features is introduced to enhance learning comprehensiveness and avoid ambiguity associated with multiple pseudo‑labels. Experimental results on three common UDAs in Person ReID demonstrated significant performance gains over state‑of‑the‑art approaches. Additional enhancements, such as Efficient Channel Attention Block and Bidirectional Mean Feature Normalization mitigate deviation effects and the adaptive fusion of global and local features using the ResNet‑based model, further strengthening the framework. The proposed framework ensures clarity in fusion features, avoids ambiguity, and achieves high accuracy in terms of Mean Average Precision, Top‑1, Top‑5, and Top‑10, positioning it as an advanced and effective solution for UDA in Person ReID.


Introduction
In the context of Person Re-identification (ReID) [1], where the objective is to match images of individuals (e.g., pedestrian, suspect, etc.) across non-overlapping camera views, the importance of efficient and accurate identification holds significant implications for applications in smart cities and large-scale surveillance systems [2][3][4].Recent advancements in deep learning techniques have shown promising improvements in ReID performance [5,6].However, these techniques often require a substantial amount of labeled data for effective training, which limits their applicability in real-world settings.The reliance on labeled data for training poses constraints, particularly in scenarios where manual labeling is resource-intensive and expensive.
The inherent limitations of supervised strategies stem from the need for manually labeled cross-view training data, a resource-intensive process that incurs significant expenses [7,8].In Person ReID, these limitations become particularly marked due to two primary reasons: (1) the reliability of manual labeling diminishes when dealing with a large number of images across multiple camera views, and (2) the exorbitant cost in terms of both time and money poses a formidable barrier to labeling the vast amount of data spanning disjoint camera views [9,10].Consequently, in practical scenarios, the applicability of supervised methods is limited, especially when confronted with a substantial amount of unlabeled data in a new context.
A viable solution can be considered by addressing the challenge of adapting a model trained on a labeled source domain to an unlabeled target domain, known as the Unsupervised Domain Adaptation (UDA) in Person ReID.However, it remains a formidable task due to the existing data distribution gap and the presence of non-overlapping identities between the source and target domains.Notably, methods of prevalent UDA Person ReID [10][11][12][13], often referred to as "fine-tuning", first pre-train the model on the labeled source domain, then perform clustering algorithms and similarity measurements to generate pseudo-labels on the unlabeled target domain to further refine the model.Despite their effectiveness in improving performance in the target domain, these methods tend to overlook the influence of camera variations in the source domain, which significantly affect the performance of the pre-trained model before entering the fine-tuning stage.Our approach is consistent with the concept of the "fine-tuning" approach, but we emphasize increasing camera awareness in the initial stage when training the model using the labeled data.Our motivation stems primarily from the necessity for a large amount of data in deep learning-based Person ReID.Annotating large-scale datasets to develop reliable features that can handle camera variations is beneficial.However, it is also prohibitively expensive.
This study introduces a Comprehensive Optimization and Refinement through Ensemble Fusion in Domain Adaptation for Person Re-identification (CORE-ReID) framework to refine the model on the target domain dataset during the fine-tuning stage.While Self-Similarity Grouping (SSG) [14] and Learning Feature Fusion (LF 2 ) [15] explore the use of both local and global features of the UDA in Person ReID, they have certain challenges.First, SSG uses a single network for feature extraction in clustering, which is susceptible to the generation of numerous noisy pseudo-labels.In addition, it performs clustering based on global and local features independently, resulting in unlabeled samples acquiring multiple different pseudo-labels, leading to ambiguity in identity classification during training.Second, LF 2 adopts a similar approach to the channel attention module of the Convolutional Block Attention Module (CBAM) [16] in its fusion module.However, the simplicity of the CBAM's design may not be optimal.Moreover, LF 2 does not optimize the final features using horizontally flipped images, which may result in suboptimal attention maps for distinguishing identity-related features from background features.In contrast, the CORE-ReID addresses these limitations by incorporating horizontally flipped images and employing an Efficient Channel Attention Block (ECAB).The advantage of ECAB is that it enhances feature representation through attention mechanisms that emphasize important and deterministic features when performing re-identification.Our model utilizes a channel attention map by exploiting inter-channel relationships within features, which will serve as a feature detector.In addition, Bidirectional Mean Feature Normalization (BMFN) is used to fuse features from both original and flipped images.These enhancements make CORE-ReID a promising approach to bridging the gap between supervised and unsupervised methods in Person ReID and provide valuable insights into UDA for this domain, potentially paving the way for future advancements.
Experimental results conducted on three widely used UDA Person ReID datasets demonstrate that our method outperforms state-of-the-art approaches in terms of performance.To summarize, our study makes the following major contributions:

•
Novel Dynamic Fine-Tuning Approach with Camera-Aware Style Transfer: We introduce a pioneering fine-tuning strategy that employs a camera-aware style transfer model for Re-ID data augmentation.This novel approach not only addresses disparities in images captured by different cameras but also mitigates the impact of Convolutional Neural Network (CNN) overfitting on the source domain;

Related Work
This chapter addresses related research work, including that in the field of Unsupervised Domain Adaptation (UDA) for Person ReID and the knowledge transfer through methods like knowledge distillation, highlighting approaches that aim to transfer expertise from well-trained models to enhance learning and adaptation in challenging domain scenarios.We can categorize the UDA methods into three main groups: style-transferred source-domain images, clustering-based approaches, and feature alignment methods.Each category presents unique strategies and challenges in adapting models to different domains, showcasing innovative techniques such as style transfer for domain-invariant features, iterative clustering for refined representations, and attribute alignment for knowledge transfer.Despite their successes, these methods face obstacles like image quality dependencies, noise in pseudo-labels, and domain shift adaptability.

Unsupervised Domain Adaptation for Person ReID
UDA has attracted significant interest due to its ability to reduce the need for costly manual annotation.This method can effectively leverage labeled data from a source domain to improve performance on a target domain without requiring annotations specific to the target domain.Generally, UDA falls into three main categories: style-transferred source-domain images, clustering-based, and feature alignment methods.
Style-transferred source-domain images: This category focuses on learning domaininvariant features using style-transferred source-domain images.The main idea is to transfer low-and mid-level target domain features, such as background, illumination, resolution, and clothing, to the images in the source domain.Techniques such as SPGAN [17], PTGAN [18], and PDA-Net [19] operate by transferring source-domain images to mimic the visual style of the target domain while preserving the underlying person identities.These style-transferred images, along with their corresponding identity labels, are then used to fine-tune the model for improved performance on the target domain.Another notable approach within this category is Hetero-Homogeneous Learning (HHL) [20], which focuses on learning camera-invariant features through the use of camera style-transferred images.By mitigating the influence of camera-specific variations, HHL aims to increase the model's ability to handle domain shifts.However, despite their effectiveness in matching visual styles across domains, these methods have limitations.The retrieval performance of these methods is highly dependent on the quality of the generated images.Additionally, these approaches often overlook the intricate relationships between different samples within the target domain, limiting their ability to capture the complex dynamics present in real-world scenarios.
Clustering-based methods: The second category, clustering-based approaches, continues to maintain state-of-the-art performance in the field.In particular, Fan et al. [12] introduced a method that alternately assigns labels to unlabeled training samples and optimizes the network using the generated targets.This iterative process facilitates effective domain adaptation by iteratively refining the model's representations to match with the target domain.Building on this foundation, Lin et al. [21] proposed a bottom-up clustering framework supplemented by a repelled loss mechanism.This approach aims to improve the discriminative power of the learned representations while mitigating the effects of intra-cluster variation.Similarly, SSG [14] and LF 2 [15] contributed to this category by introducing a technique that assigns pseudo-labels to both global and local features.Ge et al. [22] introduced the Mutual Mean-Teaching (MMT) method, which uses off-line refined hard pseudo-labels and on-line refined soft pseudo-labels in an alternative training approach to learn enhanced features from the target domain.This innovative method enhances the model's ability to adapt to domain shifts by iteratively refining pseudo-labels and feature representations during training.In addition, Zheng et al. [23] established the Uncertainty-Guided Noise-Resilient Network (UNRN), which explores the credibility of predicted pseudo-labels of target domain samples.By considering uncertainty estimates in the training process, the UNRN increases the model's performance to noisy annotations and improves its performance in domain adaptation scenarios.By using information from both levels of abstraction, these methods achieve improved performance in capturing finegrained distinctions within the target domain.However, despite their success, clusteringbased methods face challenges related to the noise inherent in the hard pseudo-labels generated by clustering algorithms.This noise can significantly hinder the training of neural networks and is often not addressed by existing methods.
Feature alignment: The third category of domain adaptation methods aims to align common attributes in both source and target domains to facilitate knowledge transfer.These attributes may include clothing items and other soft-biometric characteristics that are common in both domains.By aligning mid-level features associated with these attributes, these methods enable the learning of higher-level semantic features in the target domain.For instance, works such as TJ-AIDL [24] consider a fixed set of attributes for alignment.To enhance generalization capabilities, Lin et al. [25] propose the Multi-task Mid-level Feature Alignment (MMFA) technique.MMFA enables the method to learn attributes from both domains and align them for improved generalization on the target domain.Furthermore, UCDA [26] and CASCL [27] aim to align attributes by considering images from different cameras within the target dataset.Wang et al. [24] proposed a model capable of simultaneously learning an attribute-semantic and identity-discriminative feature representation space that is transferable to the target domain.This method contributes to advancing domain adaptation by effectively aligning attribute-level features and improving the transferability of learned representations between domains.However, challenges arise due to differences in pedestrian classes between the two domains, making it difficult for the model to learn a common feature representation space.

Knowledge Transfer
Knowledge distillation, the process of transferring knowledge from a well-trained neural network (often referred to as the teacher model) to another model or network (referred to as the student model), has received significant attention in recent years [28][29][30].The fundamental concept involves creating consistent training supervisions for labeled and unlabeled data through the predictions of different models.For instance, the meanteacher model introduced in [31] innovatively averages model weights across various training iterations to generate supervisions for unlabeled samples.In contrast, Deep Mutual Learning, proposed by Zhang et al. [32], diverges from the traditional teacher-student paradigm by employing a pool of student models.These student models are trained collaboratively and provide supervision to each other, thus promoting mutual learning and exploration of different representations.Ge et al. proposed MMT [22], which adopts an alternative training approach utilizing both off-line refined hard pseudo-labels and on-line refined soft pseudo-labels.MEB-Net [33] utilizes three networks (six models) to perform mutual mean teacher training to generate the pseudo-labels.However, despite their effectiveness, these methods face challenges.They rely heavily on pseudo-labels generated by the teacher network, which may be inaccurate or noisy, leading to suboptimal performance.Additionally, they may struggle to adapt effectively to significant domain shifts, especially when domains exhibit significant differences in lighting conditions, camera viewpoints, or background clutter, resulting in degraded performance.

Materials and Methods
We adopt the clustering-based method by separating the process into two stages: pretraining the model on the source domain in a fully supervised manner and fine-tuning the model on the target domain using an unsupervised learning approach (Figure 1).Our algorithm leverages a pair of teacher-student networks [34].After training the model using a customized source domain dataset, the parameters of this pre-trained model will be copied to student and teacher networks as an initialized step to prepare for the next stage.At the fine-tuning stage, we train the student model and then optimize the teacher model using the Nesterov momentum.To reduce the computation cost, only the teacher model will be used for inference.

Camera-Aware Image-to-Image Translation on Source Domain Dataset
For any two unordered image collections X and Y, comprised of training samples {x i } N i=1 where x i ∈ X and y j M j=1 where y j ∈ Y, their respective data distributions are denoted as x ∼ p data (x) and y ∼ p data (y).When learning to translate images from a source domain X to a target domain Y, the objective of CycleGAN [35] is to acquire a mapping G : X → Y that renders the distribution of images by G indistinguishable from the distribution Y, which is achieved through an adversarial loss.Due to the inherent underconstraint of this mapping, CycleGAN [35] introduces an inverse mapping F : Y → X , incorporating a cycle consistency loss to enforce F(G(X)) ≈ X and vice versa.CycleGAN further employs two adversarial discriminators, D X and D Y , where D X discerns between images {x} and translated images {F(y)}, and D Y distinguishes {y} from {G(x)}.By leveraging the GAN framework, CycleGAN jointly trains generative and discriminative models.The comprehensive CycleGAN loss function is expressed as: where L GAN (G, D Y , X, Y) and L GAN (F, D X , Y, X) are two loss functions, correspond- ing to the mapping functions G and F, as well as the discriminators D Y and D X .Additionally, L cyc (G, F) represents the cycle consistency loss, compelling the reconstructions F(G(X)) ≈ X and G(F(Y)) ≈ Y for each image after a cycle mapping.The parameter λ serves to penalize the significance attributed to L GAN compared to L cyc .This ensures a balanced consideration of adversarial and cycle consistency aspects.The method aims to solve: Further insights into the specifics of the CycleGAN framework can be found in [35].
Inspired by the CamStyle [36], we incorporate CycleGAN to generate new training samples, treating styles across different cameras as different domains.This approach involves learning image-image translation models using CycleGAN for images from different camera views C in the Person ReID dataset.To ensure color consistency between input and output during style transfer, similar to the painting → photo application, we add an identity mapping loss proposed in [37] to the CycleGAN loss function (Equation ( 1)).This additional loss term compels the generator to approximate an identity mapping when real images from the target domain are used as input, expressed as: The absence of L identity would grant the generator G and F the freedom to alter the tint of input images unnecessarily.As the results, the total loss used for training is: Our approach differs from CamStyle, which addresses Person ReID by utilizing and generating more data on the training dataset and evaluating the model within the same dataset (e.g., Market-1501, CUHK03).We aim to train on a source domain S and evaluate the algorithm during the fine-tuning stage on a different target domain T.This approach allows us to leverage the entirety of the data in S by incorporating test data into the training set, similar to that of DGNet++ [38].
For a source domain dataset containing images from C different cameras, the number of generative models used to produce data both X → Y and Y → X is C(C − 1).Consequently, the final training set comprises a blend of the original real images and the styletransferred images from both the training and test sets within the source domain dataset (Figure 2).These style-transferred images seamlessly adopt the labels from the original real images.Figure 3 shows two examples each from the training and test data in the Market-1501 dataset.The styles of these instances have been altered based on the camera view, classifying the method as a data augmentation scheme.This approach serves the dual purpose of mitigating disparities in camera styles and diminishing the impact of overfitting in Convolutional Neural Networks (CNNs).Moreover, the incorporation of camera information aids the model in learning pedestrian features with a camera-invariant property.

Source-Domain Pre-Training 3.2.1. Fully Supervised Pre-Training
As many existing the UDA approaches are based on a model pre-trained on a source dataset, our pre-training adopts a similar setup as described in [14,15,20,40].We use ResNet101 trained on ImageNet as the backbone network (Figure 4).The last fully connected (FC) layer is removed, and two additional layers are introduced.The first layer is a batch normalization layer with 2048 features, and the second is an FC layer with M S dimensions, where M S is the number of identities (classes) in the source dataset S. In our case, when training: where M original S,train and M original S,test are the number of identities in the original training and test sets of S. For each labeled image x S, i and its ground truth identity y S,i in the source domain i=1 with N S is the number of images, we train the model using the identity classification (cross-entropy) loss L S,ID and triplet loss L S,triplet .The identity classification loss is applied to the last FC layer, treating the training process as a classification problem.Subsequently, the triplet loss is employed using the output features after batch normalization, treating the training process as a verification problem (refer to Figure 4).The loss functions are defined as follows: where f (x S,i ) is the feature of the source image x S,i , L ce is the cross-entropy loss, C S is a learnable source-domain classifier: f (x S,i ) → {1, 2, . . . ,M S } .|||| 2 indicates the L 2 -norm distance, x + S,i and x − S,i denote the hardest positive and hardest negative feature index in each mini-batch for the sample x S,i .The triplet distance margin is represented as m.With the balance parameter κ, the total loss used in source-domain pre-training is: The model demonstrates good performance when trained with fully labeled data in the source domain.However, its direct application to the unlabeled target domain results in a significant drop in performance.
Prior to inputting the image into the network, we perform preprocessing by resizing the image to a specific size and applying various data augmentation techniques, including random horizontal flipping, random cropping, and edge padding.Additionally, we incorporate random color dropouts (random grayscale patch replacement) [41] to mitigate color deviation while preserving information, thereby reducing overfitting, and enhancing the model's generalization capability.

Implementation Details
For the camera-aware image-to-image translation to generate synthetic data, we train 30 and 2 generative models for Market-1501 and CUHK03, respectively, using the formula 6 × (6 − 1) = 30 and 2 × (2 − 1) = 2. Throughout training, we resize all input images to 286 × 286 and then crop them to a size of 256 × 256.We employ the Adam optimizer [42] to train the models from scratch for all experiments, setting the batch size to 8. The learning rate is initialized at 0.0002 for the Generator and 0.0001 for the Discriminator for the first 30 epochs and is linearly reduced to near zero over the remaining 20 epochs according to the lambda learning schedule policy.In the camera-aware style transfer step, we generate C − 1 additional fake training images (5 for Market-1501 and 1 for CUHK03) while preserving their original identity, thus augmenting the training data.
As the backbone, ResNet101 is adopted, the initial learning rate is set to 0.00035 and decreased by 0.1 at the 40th and 70th epochs.There are a total of 120 training epochs with the number of warmup epochs is 10.We randomly sampled 32 identities and 4 images per person to form a training batch.The final batch size equals 128.In the pre-processing step, we resize each image into 256 × 128 pixels and pad the resized image 10 pixels using edge-padding, and then, randomly crop it into a 256 × 128 pixels rectangular image.The augmentation methods are also applied, including random horizontal flipping of the image and random color dropouts [41] with a probability of 0.5 and 0.4, respectively.Each image is decoded into 32-bit floating point raw pixel values in [0; 1].Then, we normalize RGB channels by subtracting 0.485, 0.456, 0.406 and dividing by 0.229, 0.224, 0.225, respectively.The balance parameter κ is set to 1.

Target-Domain Fine-Tuning
In this phase, we use the pre-trained model to perform comprehensive optimization.We present our CORE-ReID framework (Figure 5) along with Efficient Channel Attention Block (ECAB) in the Ensemble Fusion and Bidirectional Mean Feature Normalization (BMFN) modules.Inspired by the techniques of SSG [14] and LF 2 [15], our aim is to enable the model to dynamically fuse global and local (top and bottom) features, resulting in feature representations that capture both global and local information.Additionally, by constructing multiple clusters based on global and fused features, we aim to generate more consistent pseudo-labels, thus preventing ambiguous learning.To refine the noisy pseudo-labels, we adopt a pair of teacher-student networks based on the mean-teacher approach [34].We input the same unlabeled image from the target domain to both the teacher and student networks.In the current iteration i, the parameters ρ ς of the student network are updated using Nesterov momentum, in which, the parameters ρ ς will be adjusted through back-propagation during training in the target domain.Meanwhile, the parameters ρ τ of the teacher network are computed as the temporal average of ρ ς , which can be expressed as: where η denotes the temporal ensemble momentum, which ranges between 0 and less than 1.

Ensemble Fusion Module and Overall Algorithm
To derive the fusion features, we horizontally split the last global feature map of the student network into two parts (top and bottom), resulting in ς top and ς bottom after global average pooling.However, the last global feature map τ global of the teacher network remains intact without any parting.These features, namely ς top and ς bottom from the student network and τ global from the teacher network, are selected for adaptively Learning Feature Fusion through the Ensemble Fusion module, which incorporates learnable parameters.In line with the approach of LF 2 , we design the Ensemble Fusion module (Figure 6), wherein we initially utilize ς top and ς bottom along with the spatial information of the student network's local features.The inputs ς top and ς bottom will be forwarded to ECAB for adaptive learning fusion.Each enhanced attention map (ψ top and ψ bottom ) outputted from ECAB will be merged with τ global through element-wise multiplication to generate the Ensemble Fusion feature maps: τ top global and τ bottom global .Subsequently, after applying Global Average Pooling (GAP) and batch normalization, we obtain the fusion features θ top and θ bottom to input into BMFN for predicting pseudo-labels using clustering algorithms in subsequent steps.The overall process in Ensemble Fusion can be expressed as: We apply the mini-batch K-means algorithm in clustering to predict the pseudo-labels.Consequently, each x T,i will have three pseudo-labels (global, top, and bottom, respectively).
We denote the target domain data as D T = x T,i , ŷT,i,j N T i=1 , with j ∈ {global, top, bottom} and N T is the number of images in target dataset T.Where ŷT,i,j ∈ 1, 2 . . ., M T,j expresses that the pseudo-label ŷT,i,j of the target-domain image x T,i is from the cluster result Ŷj = ŷT,i,j i = 1, 2, . . ., N T of the combined feature with its flipped image x ′ T,i outputted from BMFN: φ l , l ∈ {top, bottom}.M T,j denotes the number of identities (classes) in Ŷj .
Before calculating the loss function, we apply the BMFN to obtain each optimized feature from networks f ς j , f τ j , j ∈ {global, top, bottom} and φ l , l ∈ {top, bottom} from Ensem- ble Fusion.After acquiring multiple pseudo-labels, we can get three new target-domain datasets for training the student network.We use the pseudo-labels generated by local fusion features φ l , l ∈ {top, bottom} to calculate the soft-max triplet loss for corresponding local features f ς l from student network: where ρ τ and ρ ς are parameters in teacher and student networks.Optimized local feature from student network is denoted as where C T is the fully connected layer of the student network for classification: With α, β, γ and δ are weighting parameters, the total loss can be calculated as: During the inference phase, the Ensemble Fusion process is omitted, and we only use the optimized teacher network to save the computation cost.In detail, the global feature map from the teacher network is segmented into two parts, known as the top and bottom (it also serves as a student network).Following global average pooling, the resulting two local features and the global feature are concatenated.Subsequently, L 2 normalization and BMFN is applied to get the final optimal feature to facilitate inference.

Efficient Channel Attention Block (ECAB)
The importance of attention has been extensively explored in previous literature [43].Attention not only guides where to focus but also enhances the representation of relevant features.Inspired by the CBAM, we introduce an ECAB, a straightforward yet impactful attention module for feed-forward Convolutional Neural Networks.This module enhances representation power through attention mechanisms that emphasize crucial features while suppressing unnecessary ones.We generate a channel attention map by leveraging interchannel relationships within features.Each channel of a feature map serves as a feature detector, and channel attention directs focus towards the most meaningful aspects of an input image.To efficiently compute channel attention, we compress the spatial dimension of the input feature map.
While average-pooling has been commonly used to aggregate spatial information, Zhou et al. [44] suggest its effectiveness in learning object extents.Therefore, we utilize both average-pooling and max-pooling features simultaneously.Figure 7 shows the design of ECAB.Given an intermediate input feature map ς ∈ R C×W×H , where C, W, H de-note the number channel, width, and height respectively.After performing max-pooling and average-pooling, then fit the outputs ς max , ς avg into a Shared Multilayer Perceptron, we can obtain refined feature as ς max SMLP and ς avg SMLP .The Shared Multilayer Perceptron has multiple hidden layers with reduction rate r and the same expansion rate, activation function ReLU.The enhanced attention map ψ ∈ R C×1×1 is calculated as: where ς σ is the output of ς max SMLP + ς avg SMLP after the sigmoid function.where the first h−1 2 layers are reduced in size with the reduction rate r, and the last h−1 2 layers will be expanded with the same rate r.

Bidirectional Mean Feature Normalization (BMFN)
The use of horizontally flipped images has been studied in [45][46][47].We assume that the image can be captured in the opposite direction (left and right).By using the flipped image in training, the model can focus on the id-related features and ignore the background features.
Given an image x T,i in target domain dataset, and its flipped image x′ T,i .After getting the feature map F m j and its paired flipped image's feature map F′ m j , j ∈ {global, top bottom}, m ∈ {ς, τ} .The outputs from BMFN can be calculated as: The optimal feature maps on the branch of Ensemble Fusion θ l , l ∈ {top, bottom} can be intended after applying the BMFN are:

.4. Detailed Implementation
Our training regimen spans 80 epochs, with each epoch consisting of 400 iterations.Throughout training, the learning rate remains fixed at 0.00035, and we employ the Adam optimizer with a weight decay of 0.0005 to facilitate stable convergence.The utilization of the K-means algorithm aids in initializing cluster centers effectively.For temporal ensemble regularization, we set the momentum parameter (η) to 0.999.To balance the various components of our loss function, we assign the following weights: α: 1; λ: 1; γ: 0.5; δ: 0.5.In our Ensemble Fusion block, we maintain a reduction ratio and expand rate (r) of 4 and the number of hidden layers h is set to 5. Like pre-training stage, the processing step is also applied.In addition, randomly erasing with the probability of 0.5 is conducted in fine-tuning stage.

Results
In this section, we will show the experimental results compared to state-of-the-art (SOTA) methods on popular datasets for the task of UDA for Person ReID.
Market-1501 comprises 32,668 photos featuring 1501 individuals captured from 6 different camera views.The training set encompasses 12,936 images representing 751 identities.The testing set includes 3368 query images and 19,732 gallery images, encompassing the remaining 750 identities.
CUHK03 consists of 14,097 images depicting 1467 unique identities, captured by six campus cameras, with each identity being recorded by two cameras.This dataset offers two types of annotations: manually labeled bounding boxes and those generated by an automatic detector.We employed manually annotated bounding boxes for both training and testing purposes.Additionally, we adopted a more rigorous testing protocol proposed in [40] for CUHK03.This protocol involves splitting the dataset into 767 identities (7365 images) for training and 700 identities (5332 images and 1400 images in gallery and query sets respectively) for testing.
MSMT17, a large-scale dataset, consists of 126,441 bounding boxes representing 4101 identities captured by 12 outdoor and 3 indoor cameras (a total of 15 cameras) in three periods (morning, afternoon, and noon) throughout the day on four different days.The training set incorporates 32,621 images showcasing 1041 identities, while the testing set consists of 93,820 images featuring 3060 identities used for evaluation.The testing set is divided into 11,659 images for the query set and 82,161 images for the gallery set.Notably, MSMT17 surpasses both Market-1501 and CUHK03 in scale.
The comprehensive overview of the open datasets utilized in this document is presented in Table 1.Experimental results can be found in Tables 2 and 3.

Benchmark
Our study initially focused on comparing CORE-ReID against SOTA methods on two domain adaptation tasks: Market → CUHK and CUHK → Market (Table 2).Subsequently, we then expanded our evaluation to include two additional tasks: Market → MSMT and CUHK → MSMT (Table 3)."Baseline" denotes the method in which we only use the global feature, without using ECAB and BMFN; CORE-ReID denotes our framework.The evaluation metrics are mAP(%) and rank (R) at k accuracy (%).
The results demonstrate that our framework significantly outperforms existing SOTA methods, validating the effectiveness of our method.Specifically, by integrating the Ensemble Fusion component, ECAB and BMFN, our framework outperforms SOTA methods.In particular, we observed significant improvements over PAOA+, with margins of 12.6% and 6.0% mAP on both Market → CUHK and CUHK → Market tasks, respectively, despite PAOA+ incorporating additional training data.

Ablation Study
Feature Maps Visualization: To verify our method, we visualized the feature map of Grad-CAM [73] at the global feature level.Important features of each person are represented as heatmaps, as shown in Figure 8.The rainbow color describes the level of less important (blue) to most important (red) parts used for Person Re-identification.In the Market → CUHK and CUHK → Market scenarios (Figure 8a,b), important features are observed in the target person's body.The heatmaps show identical distributions in the original and flipped images.This observation is considered to be consistent with the accuracy of our method as shown in Table 2. On the other hand, in the Market → MSMT and CUHK → MSMT scenarios, the Market → MSMT model shows a slightly better extraction of important features, where the heatmap is distributed in the middle and lower body regions in both the original and the flipped images.This fact could be considered as the reason for the higher accuracy achieved by the Market → MSMT model over the CUHK→ MSMT model, as shown in Table 3.
K-means Clustering Settings: we used the K-means approach for clustering to make the pseudo-labels on the target domain.The settings varied depending on the datasets.
As shown in Table 4, our framework performs best on Market → CUHK, CUHK → Market, Market → MSMT and CUHK → MSMT with the settings of 900, 900, 2500, and 2500, respectively.Figure 9 shows that the performance of our method varies depending on the dataset pairs and the clustering parameter values (M T,j ) used.ECAB and BMFN Settings: ECAM improves representation power by using attention mechanisms to highlight important features and suppress irrelevant ones.To validate the effectiveness of ECAB, we performed an experiment where it is removed from our network, as shown in Table 5.As mentioned earlier, we used BMFN to merge features from the original image and its flipped counterpart, allowing the model to concentrate on ID-related features while disregarding background features.Table 5 demonstrates that incorporating BMFN enhances accuracy.
Figure 10 shows the results that utilizing ECAB and BMFN in our framework led to performance improvement.
Backbone Configurations: we also evaluated the performance of different backbone architectures (ResNet50, ResNet101, and ResNet152) in modeling the network for Unsupervised Domain Adaptation in Person Re-ID.By systematically comparing the performance of these models, we aimed to identify the most effective backbone architecture for our task.Through extensive experimentation and analysis, we gained valuable insights into the impact of the backbone architecture on the overall performance of the unsupervised domain adaptation framework for Person Re-ID as shown in Table 6.The ResNet101 setting gives the best performance on both Market → CUHK and CUHK → Market scenarios.All of these experiments were performed on two machines with dual Quadro RTX 8000 GPUs (Nvidia Corporation, California, US) each.

Conclusions
In this paper, we present a multifaceted approach to solve the problem of the UDA.Firstly, we propose a dynamic fine-tuning strategy that employs a camera-aware style transfer model to augment Re-ID data.This not only mitigates disparities in camera styles but also combats CNN overfitting on the source domain.Additionally, we introduce an Efficient Channel Attention Block (ECAB) that leverages inter-channel relationships to prioritize meaningful structures in input images, improving feature extraction.Furthermore, we establish a Comprehensive Optimization and Refinement through Ensemble fusion (CORE-ReID) framework.This framework utilizes a pair of teacher-student networks to fuse global and local features adaptively, generating diverse pseudo-labels for multi-level clustering.Finally, we incorporate a Bidirectional Mean Feature Normalization (BMFN) module to enhance feature-level discriminability.
In addition to achieving SOTA performance, our method notably narrows the gap between supervised and unsupervised performance in Person Re-ID.We expect that our approach will offer valuable insights into the UDA for Person Re-ID, potentially paving the way for further advances in the field.
However, our approach has limitations.A major challenge is the dependence on the quality of the camera-aware style transfer model, which can affect the overall performance if not properly optimized.Additionally, the complexity of our CORE-ReID framework may lead to increased computational cost and training time.Future work will focus on optimizing the efficiency of the style transfer model and simplifying the CORE framework without sacrificing performance.We also plan to explore more advanced techniques for noise reduction in pseudo-labels to further enhance the robustness of our model.

Figure 1 .
Figure 1.The model proposed in this study.First, the model is trained on a customized source domain dataset; subsequently, the parameters of this pre-trained model are transferred to both the student and teacher networks as an initialization step for the next stage.During fine-tuning, we train the student model and then update the teacher model using momentum updates.To optimize computational resources, only the teacher model is used for inference purposes.

Figure 2 .
Figure 2. Our pipeline of creating the full training set for the source domain.Initially, we combine both the training set (green boxes) and the test set (dark green boxes) within the source dataset to form the total training set consisting of real images.This combined set is then used to train the camera-aware style transfer model.For each real image, the trained transfer model is applied to generate images (blue boxes for the training set and dark blue boxes for the test set) that align with the stylistic characteristics of the target cameras.Subsequently, the real images (green and dark green boxes) and the style-transferred images (blue and dark blue boxes) are merged to produce the final training set within the source domain.

Figure 3 .
Figure 3.Some style-transferred samples in Market-1501 [39].Each image, originally taken by a specific camera, is transformed to align with the styles of the other five cameras, both within the training and test data.The real images are shown on the left, while their corresponding style-transferred counterparts are shown on the right.

Figure 4 .
Figure 4.The overall training process in the fully supervised pre-training stage.ResNet101 is used as the backbone in our training process.

Figure 5 .
Figure 5.An overview of our CORE-ReID framework.We combined local features and global features using Ensemble Fusion.The ECAB in Ensemble Fusion promotes the enhancement of the features.By using BMFN, the framework can merge the feature from the original image x T,i and its paired flipped image x ′ T,i , then produce the fusion feature φ l , l ∈ {top, bottom}.The student net- work is optimized using pseudo-labels in a supervised manner, while the teacher network is updated by computing the temporal average of the student networks via the update momentum.The orange rounded rectangles indicate the steps where the features of the flipped image are used in the same way as the original image, up until the application of BMFN.

Figure 6 .
Figure 6.The Ensemble Fusion component.ς stop and ς bottom features are passed through the ECAB to produce the channel attention maps by exploiting the inter-channel relationship of features which helps to enhance the features.

Figure 7 .
Figure 7.The structure of our ECAB.The Shared Multilayer Perceptron has odd h hidden layers, where the first h−12 layers are reduced in size with the reduction rate r, and the last h−1 2 layers will be expanded with the same rate r.

Figure 11
Figure11shows that the performance does not vary much across different backbone configurations, indicating the stability of our framework regardless of the settings used.

Figure 11 .
Figure 11.Impact of the backbone configurations.Results on (a) Market → CUHK, (b) CUHK → Market show that ResNet101 backbone gives the best overall results.The evaluation metrics are mAP(%) and rank (R) at k accuracy (%).

• CORE Framework with Ensemble Fusion of Global and Local Features:
We establish the CORE (Comprehensive Optimization and Refinement through Ensemble Fusion) framework, which utilizes a novel pair of teacher-student networks to perform an adaptive fusion of global and local (top and bottom) features for multi-level clustering with the objective of generating diverse pseudo-labels.By proposing the Bidirectional Mean Feature Normalization (BMFN), the model can increase its discriminability at the feature level and address key limitations in existing methods.
{top, bottom}.x+ T,i and x − T,i are the hardest positive and negative samples of the anchor target-domain image x T,i , respectively.Similar to supervised learning approach, we use the cluster result Ŷglobal of the global ς global .The formulas are expressed as:

Table 1 .
Details of datasets used in this manuscript.

Table 2 .
Experimental results of the proposed CORE-ReID framework and SOTA methods (Acc %) on Market-1501 and CUHK03 datasets.Bold denotes the best while Underline indicates the secondbest results.a indicates the method uses multiple source datasets.

Table 3 .
Experimental results of the proposed CORE-ReID framework and SOTA methods (Acc %) from Market-1501 and CUHK03 source datasets to target domain MSMT17 dataset.Bold denotes the best while Underline indicates the second-best results.a indicates the method uses multiple source datasets, b denotes the implementation is based on the author's code.

Table 4 .
Experimental results on different settings of number of pseudo identities in K-means clustering algorithm.Bold denotes the best results.

Table 5 .
Experimental results to validate the effectiveness of utilizing ECAB and BMFN in our proposed framework.The clustering parameter values (M T,j ) is carried out from the study of K-means clustering settings.Bold denotes the best results.

Table 6 .
Experimental results on different settings of ResNet backbones in Market → CUHK, CUHK → Market scenarios.Bold denotes the best results.