A Multi-Attention Approach for Person Re-Identification Using Deep Learning

Person re-identification (Re-ID) is a method for identifying the same individual via several non-interfering cameras. Person Re-ID has been felicitously applied to an assortment of computer vision applications. Due to the emergence of deep learning algorithms, person Re-ID techniques, which often involve the attention module, have gained remarkable success. Moreover, people’s traits are mostly similar, which makes distinguishing between them complicated. This paper presents a novel approach for person Re-ID, by introducing a multi-part feature network, that combines the position attention module (PAM) and the efficient channel attention (ECA). The goal is to enhance the accuracy and robustness of person Re-ID methods through the use of attention mechanisms. The proposed multi-part feature network employs the PAM to extract robust and discriminative features by utilizing channel, spatial, and temporal context information. The PAM learns the spatial interdependencies of features and extracts a greater variety of contextual information from local elements, hence enhancing their capacity for representation. The ECA captures local cross-channel interaction and reduces the model’s complexity, while maintaining accuracy. Inclusive experiments were executed on three publicly available person Re-ID datasets: Market-1501, DukeMTMC, and CUHK-03. The outcomes reveal that the suggested method outperforms existing state-of-the-art methods, and the rank-1 accuracy can achieve 95.93%, 89.77%, and 73.21% in trials on the public datasets Market-1501, DukeMTMC-reID, and CUHK03, respectively, and can reach 96.41%, 94.08%, and 91.21% after re-ranking. The proposed method demonstrates a high generalization capability and improves both quantitative and qualitative performance. Finally, the proposed multi-part feature network, with the combination of PAM and ECA, offers a promising solution for person Re-ID, by combining the benefits of temporal, spatial, and channel information. The results of this study evidence the effectiveness and potential of the suggested method for person Re-ID in computer vision applications.


Introduction
Person re-identification (Re-ID) is one of the computer vision tasks that aims to match a target individual across many camera perspectives. It has become an increasingly significant field of research in recent years, particularly in the area of surveillance and security. The main motivation for person Re-ID is to enable effective tracking of individuals in complex and crowded environments, such as airports, train stations, and public places [1,2]. However, the mission of person Re-ID faces several challenges that make it difficult to achieve high levels of accuracy. These challenges include variations in lighting conditions, CNNs have proven to be an efficacious tool for addressing the issue of person Re-ID. They are capable of learning and capturing the discriminative features of the input images, and can be learned from end-to-end on large datasets [5]. Additionally, CNNs can be finetuned for specific datasets, making it possible to improve their performance in challenging scenarios [8]. By leveraging the ability of CNNs to automatically learn and extract features, person Re-ID algorithms have achieved significant improvements in accuracy, making them an important tool for overcoming the perplexing problem of person Re-ID.
In the past decade, person Re-ID has attracted a great deal of interest, due to its utility in a range of computer vision applications, such as video surveillance and person tracking [6]. Re-ID attempts to identify a person of interest across numerous, non-overlapping cameras. Recently proposed methods in person Re-ID tasks show good performance while using the attention mechanism, by focusing on more relevant characteristics [2,6]. In addition, most Re-ID methods depend on global features, that focus on the overall information in the image of a person and ignore the spatial structure of that person, so recently, many Re-ID methods have mainly extracted local features for re-identification, to improve the extracted features [9,10].
Despite the success of person recognition methods, identifying the same person in different cameras remains a difficult task, particularly in scenarios where the features of a person repeatedly change. To tackle this challenge, we provide a new person Re-ID method, that combines attention learning with a pre-trained model, which is a deep CNN that has already been trained to find informative and strong features in images, making the Re-ID process much easier and faster compared to models that are learned from scratch. Our system employs an attention mechanism that combines the position attention module (PAM) and the efficient channel attention (ECA). The PAM captures spatial, temporal, and channel context information, which improves the representation capability of the local features. The ECA reduces the model's complexity while maintaining accuracy, by capturing local cross-channel interactions.
Our contributions in this paper are twofold: (1) we introduce attention learning combined with a pre-trained model, for person Re-ID, which outperforms existing methods, and (2) we present an attention mechanism that combines the PAM and ECA, which improves the representation capability and decreases the complexity of the model, while preserving accuracy.
The remainder of the article is structured as follows: Section 2 covers relevant research in person Re-ID. Section 3 introduces the suggested method. Section 4 displays the research results. Section 5 shows the analysis study. Finally, Section 6 wraps up the paper and suggests future directions.

Related Work
In recent years, person Re-ID has become a crucial task in video observation, and has gained significant consideration in computer vision. Several approaches, including metric learning, hand-crafted features, and deep learning, have been proposed for this problem. In this part, we provide a summary of the most recent and relevant research in this area, with a focus on deep learning methods.

Hand-Crafted Feature-Based Person Re-ID
Manual feature extraction and metric learning design are person Re-ID's traditional methods; they rely on detecting low-level appearance features from the requisite image characteristics, such as shapes, colors, and textures [11]. Support vector machines (SVMs), neural networks (NN), nearest neighbors (KNN), and others, are metric learning types that minimize the distance between traits of the same person. Feature descriptors and metric learning are two independent stages. Liao et al. [9] presented a method that incorporates effective feature detection with metric learning. They suggested local maximal occurrence (LOMO) as a traits descriptor, that represents the image by extracting the histogram for colors using the texture histogram and sliding window with scale-invariant local ternary mode. Also, they used cross-view quadratic discriminant analysis (XQDA) for matching between features. Yang et al. [11] presented a method for extracting the features dependent on colors, that are called salient color names-based color descriptors (SCNCD), and they used the KISSME technique for metric learning. SCNCD divides the image into six parts equally and then computes the histogram for different spaces of color on all parts, to make the definitive extracted features sensitive to changes in illumination.

Hybrid Feature-Based Person Re-ID
The hybrid method combines deep learning with metric learning. The authors extract the features by utilizing a convolutional neural network and metric learning for classification. Saber et al. [6] used VGG-Net as a person representation, which provides a deep learning mechanism for person identification, and they selected the most estimated layers, to gain a useful feature description for the person. Subsequently, for person matching, a support vector classifier (SVC) was used, which eliminated the issue of using a small dataset. Jayapriya et al. [10] used CNN to extract traits from sequential information. This strategy combined the prioritized chromatic texture image (PCTimg) with the original images, then entered them into the CNN to detect the traits. XQDA is employed for the classification. Wang et al. [12] developed a Siamese model, that employed XQDA to learn a discriminant metric, and extracted traits from deep networks to obtain spatiotemporal information about the person.

Deep Learned Feature-Based Person Re-ID
Deep learning is based on neural network algorithms and has become a prevalent offshoot of machine learning [13]. Deep learning algorithms employ multiple transformation layers with intricate constructions, in an effort to demonstrate high-level characteristics in data. In contrast to traditional methods, deep learning methods incorporate feature descriptions and similarity measures into an entire model. There are different kinds of architectures for deep learning-based methods, like attention-based methods and part-based methods.
Attention-based methods aim to carefully choose high-interest areas from input data, while disregarding other areas, with weak or no discriminative features. Attention modules concentrate on extracting regions with extremely distinguishing characteristics. Guodong et al. [14] proposed a hybrid architecture for CNN, that allows the network to concentrate on global and local discriminatory features for a person's image. They introduced a method called feature mask network (FMN). Wei et al. [15] established the global-local-alignment descriptor (GLAD) network, that appreciates the skeletons and splits the image by using the deeper cut. GLAD is intended to detect both local traits from separated images and global traits from the whole body. Masked graph attention network (MGAT) is a network designed by Bao et al. [16], that concentrates on the relationship between individual images and their labels, while ignoring the global mutual information present in the full sample set. The MGAT is dependent on a plenary network that extracts features, where nodes can concentrate on the characteristics of others in a directly navigable mode in the form of a mask matrix, with label information for guiding.
Part-based Re-ID approaches, elicit image areas to discover distinctive part-level features, established on accurate part-level cues that are often neglected when retrieving global traits. Part-based convolutional baseline (PCB) network was suggested in [17], which uses uniform segmentation on the convolution layer to interpret part-level data, by dividing the entire body into six horizontally running stripes in the feature map. Each component feature vector is supplied to a classifier, which generates an ID-prediction loss, that is independent for each part. Tian et al. [18] proposed a joint learning network that focuses on learning more distinctive and powerful features. They applied a global branch to learn the most distinctive global-level traits, and they divided the extracted map of traits into N parts, which are taken as inputs into a distinctive network that comprehends the local-level features. Afterward, they generate a local loss by combining Npart losses. They can then obtain a desirable total loss by combining local and global losses. A Siamese multiple granularity network (SMGN), with two major branches, was proposed by Li et al. [19], for learning the local and global characteristics of a person independently. The retrieved features of the two branches are combined as multiple features for personal images, and multiple loss functions are employed to enhance their performance.
From the above discussion, it is seen that previous studies have tried to enhance the person Re-ID performance using different methods. However, most of these methods have limitations, and do not perform well on large datasets. Our proposed method overcomes these limitations, by combining attention learning with a pre-trained model, which outperforms the existing methods on large datasets. The main difference between the proposed work and the related work, is that the proposed method combines the PAM and ECA, to extract features from temporal, spatial, and channel contexts. This is a new approach that has not been explored in previous studies. The proposed multi-part feature network, with the combination of PAM and ECA, has great potential to solve the problem of person Re-ID successfully, as it combines the benefits of temporal, spatial, and channel information. To summarize, the proposed method differs from previous studies, in that it combines the PAM and ECA to extract features from multiple contexts, with a high potential to achieve better results than existing methods. Table 1 summarizes the main differences between the proposed work and related work in the field of person Re-ID.

Methodology
In this section, we depict the overall structure of a multi-part feature network for a person Re-ID task, that can independently learn extensive information from different parts of features, and the features from these parts can be merged for prediction. Then, we describe the two attention modules that are utilized to reduce the impact of irrelevant background, while concentrating on discriminative features of a person's appearance. Finally, we describe the loss functions that are utilized. OSNet [20] acts as the foundation for our network structure, as shown in Figure 2.

Baseline Configuration
We utilized OSNet [20] as a feature extractor for combining heterogeneous and homogeneous features, as well as a relatively lightweight network capable of developing performance, while avoiding over-fitting. OSNet [20] is built by stacking the bottleneck layer by layer, to decrease the parameter numbers, thereby lowering the computational cost.

Position Attention Module
In the person scenario, we observed that distinctive trait representations are fundamental for person Re-ID, which may be achieved by understanding contextual information.
To extract contextual information from local traits, we utilized a position attention module (PAM), which extracts much information derived from local characteristics, thereby improving their ability to represent the features.
The structure of the position attention module (PAM) [21], which is made for detecting and collecting the relevant pixels in the spatial domain, is depicted in Figure 3. The feature F ∈ R C×H×W , where C is the number of channels, H is the spatial dimension height, and W is the spatial dimension width of an input tensor. We first feed the feature maps in the first branches into a convolution layer, to produce the new feature maps F1 ∈ R C/16×H×W , then we reshape F1 to R C/16×N , where N is the number of the pixels, which is equal to H × W. To obtain F2 for the second branch, we apply the same mechanism as for the first branch. Following that, we multiply the transpositions of F2 and F1 using matrix multiplication, and then utilize a softmax layer to compute the attention map S ∈ R N×N . Then, we execute matrix multiplication between S and the reshaping of the input feature, to get the feature to R ∈ C×H×W . Ultimately, the definitive output ∈ R C×H×W is obtained by applying the batch normalization and then executing an operation of element-wise sum with the input features. Generally, in the original PAM, the third branch began with the 2D convolution layer, and we removed this layer to decrease the training time and increase the accuracy of our Re-ID method.

Efficient Channel Attention (ECA)
The channel attention module has shown significant potency to enhance the effectiveness of deep CNN. Channel attention is utilized to ameliorate the features of different channels, by simulating the significance of all channels in the feature. One of these channel attention modules is efficient channel attention (ECA). ECA detects interactions on the local cross-channel, by analyzing the channel and its neighbors. ECA minimizes the parameter numbers and reduces the model's complexity, while maintaining precision.
ECA's structure was proposed in [22]. To begin, as illustrated in Figure 4, a global average pooling (GAP) method is used, to reduce the size dimension of the input feature. After that, the weights of the channel are derived by a 1D convolution with a kernel size of three. Lastly, a sigmoid function is used, to obtain the final attention weights. Channels' local interactive information can be reserved in this manner.

Loss Functions
As the final description for the person Re-ID features, we concatenate the feature vectors from the GAP and feature selection. Our loss function is gathered from ID loss (softmax loss) [23] for the six parts of the selected feature, and from a hard-margin triplet loss [24] and a center loss [25], for the concatenated feature. As demonstrated in Figure 2, each classifier forecasts the identification of the input image, namely, where β and α are weighting factors.
The cross-entropy loss (softmax loss) for the learned features, f i , with label smoothing [23], is given as: where N is the batch size, C is the identity class number, f i is the extracted feature, w i and b i are the weighted and bias for class i, respectively, and q yi is the ground truth of the labels. By obtaining many centers for all identity classes, hard triplet loss [24] outperforms softmax loss. However, the max function is required, to find the closest center for each identity class, and it is not smooth, thus the function can be sensitive between several centers. Smoothing of the max function in the softmax loss, can be utilized to enhance robustness. The hard triplet loss for the learned feature f i , is given as: where λ is compensated to optimize a smoothed triplet loss, δ is a predefined margin, and S (i,j) is the similarity between feature f i and the class j.
The center loss [25] is used to decrease intra-class variance between each sample in the mini-batch, while maintaining the features of the various classes separately. It can also reduce the distance within the class, so the compression of the samples within the class can be realized. The center loss function is written as follows: where f i is the detected feature, and C y i is the updated deep feature.

Experimental Results and Discussion
In this section, we will carry out comprehensive experiments to confirm the viability of the suggested procedure. This section is arranged as follows: 1. provides three common datasets; 2. explains the specifics of implementation; 3. elucidates the protocols employed to test our strategy; and 4. compares the introduced approach to competing approaches on the relevant datasets.

The Utilized Datasets
To evolve and test the introduced model, we employed three diverse common datasets, as shown in Table 2, which are the fundamental datasets employed for the person Re-ID task.
CUHK03 [26]: was the first considerable dataset for a person Re-ID task. Images in this dataset contain the person detected by manual labeling and deformable part models (DPM). It contains 1467 identities, captured by two non-overlapping cameras.
Market-1501 [27]: was gathered by six separate cameras, at Tsinghua University. It contains 1501 identities, and images in this dataset contain the person detected by manual labeling and deformable part models (DPM), it also has 2793 false images because of the DPM detector.
DukeMTMC [7]: is one of the large-scale datasets. Eight cameras were utilized in the DukeMTMC dataset, to track multiple targets. It contains 1812 persons, and the person in each image is manually labeled.

Specifics of Implementation
Our introduced network was tested on a PC that uses NVIDIA RTX3060 12GB. OS-Net [20] was pre-trained on ImageNet [28], where we omitted the GAP layer and fully connected layers. To be more specific, all image sizes were changed to 384 × 128 before being entered into the network. For training, our proposed network extracted features, and the optimizer of the network was the stochastic gradient descent (SGD) algorithm, with a learning rate of 0.04, decay rate of 0.1, and momentum of 0.9. In the training set, CUHK03 has 767 individuals and 7368 photos, while in the testing set, there are an additional 700 individuals and 6732 images. Market-1501 has 751 persons in the training set, with 12,936 images, and another 750 persons in the testing set, with 16,482 images. DukeMTMC has 702 identities, with 16,522 images in the training set, and another 702 identities in the testing set, with 16,426 images.

Metric Protocol
We employed the single-shot approach in our experiment, which allows a thorough comparison. The cumulative matching characteristic (CMC) [29] and mean average precision (mAP) [30] were utilized to evaluate the person Re-ID performance. To improve performance even further, we added the re-ranking method [31], dependent on k-reciprocal encoding, to our method. The re-ranking operation was utilized in the testing phase.

Evaluation on the Used Datasets
The introduced method appears to have excellent results compared to the preceding methods. Prior to discussing its accuracy on the three datasets, the introduced approach is evaluated against the state-of-the-art methods as follows: Market-1501 database: Table 3 shows the competitive fineness results for the proposed technique and other person Re-ID methods, using the Market-1501 dataset. Our proposed method affords enhanced outcomes compared to the other methods. The introduced method achieves 95.93%, compared to the highest score achieved by DM-OSNet [8], of 95.61%. However, DRL-Net [32] achieves the highest mean average precision (mAP) score, of 89.9%, while the proposed method achieves a score of 87.57%. By utilizing the re-ranking [31], the proposed method achieves even higher results, with a rank-1 accuracy of 96.41% and an mAP of 94.15%. The results demonstrate that the proposed method performs well compared to other state-of-the-art techniques. However, there are still some limitations, as some methods perform better in certain aspects, such as DRL-Net for mAP. The findings of this study could have important implications for the development of more accurate person re-identification systems in real-world applications.

DukeMTMC-reID database:
This has more challenges than in the Market-1501 dataset, due to the greater number of camera views and noisy backgrounds, that gives more variation within classes. Table 3 also presents the results of various person re-identification methods on the DukeMTMC dataset, including our proposed method, with and without reranking. In terms of rank-1 accuracy, our introduced method achieves 89.77%, compared to the highest score achieved by AET-Net [33], of 89.5%. However, AET-Net [33] achieves the highest mean average precision (mAP) score, 80.1%, while the proposed method achieves a score of 78.62%. Using the re-ranking technique, our proposed method improved its performance to 94.08% and 92.22%, in rank-1 and mAP, respectively.
CUHK03 database: Table 4 presents the results of various person re-identification methods on the CUHK03 dataset, including our proposed method, with and without re-ranking. Our proposed method achieves impressive performance on both labeled and detected types, outperforming other methods by 3.01% and 2.65%, respectively. Moreover, the use of the re-ranking technique results in a substantial improvement in performance, with an increase of 18% for labeled and 17.51% for detected types. These results demonstrate the effectiveness of our proposed method and highlight its potential to improve on state-of-the-art person re-identification methods. When comparing our proposed method to other state-of-the-art methods, it is clear that our approach presents several strengths. For instance, our method outperforms the widely used PCB [17] method by a significant margin, achieving an improvement of 10.55% in rank-1 accuracy for detected types. Additionally, our proposed method outperforms the HAN [34] method by 26.71% in mAP, for labeled types. However, our method does have some limitations, such as being computationally expensive, due to the high-dimensional feature extraction required. Despite these limitations, our proposed method demonstrates superior performance, and the results indicate that it has the potential to be a useful tool for person re-identification in real-world scenarios.

Research Analysis
In this section, we analyze the parameters of the Market-1501 dataset, including the effect of image size, the number of image parts, batch size, loss type, attention module type, and epoch number.

Comparison of Loss Function Change
In the training stage, our loss function gathers from cross-entropy, the hard triplet loss, and the center loss. To inspect the effect of the loss function, we performed experiments in which we performed cross-entropy loss with the triplet loss, center loss, or a combination of them, to confirm the efficacy of employing multiple losses. Table 5 showcases the performance of the proposed multiple loss function combinations on the three different datasets-Market-1501, DukeMTMC, and CUHK03 (both labeled and detected). As seen in the experimental results, utilizing many losses causes the network to exhibit varying degrees of accuracy enhancement on the three datasets, when compared to using only the softmax loss. For instance, using the combination of losses, outperformed the competition by 0.99% and 1.11% in rank-1 and mAP, respectively, on the DukeMTMC dataset. Similarly, the CUHK03 dataset increased by 1.14% and 1.64%, for labeled and detected sets, respectively, using the proposed method. In combination, the loss functions are fused, making them interactive, resulting in improved performance at the cost of speed, and the network converges towards greater performance. Generally, the results demonstrate that the proposed method is effective in enhancing the accuracy of the network and has the potential to improve state-of-the-art person re-identification.

Comparison of Attention Change
Many cutting-edge methodologies for person Re-ID tasks, make use of attention modules. To extract global features, we added the attention module, which consists of PAM and ECA, into the network. To investigate the efficiency of the suggested attention module in our framework, we conducted experiments on the Market-1501 dataset. Six structures are compared: only the network without the attention, the network with only one attention module (PAM or ECA), the network with changing the order of the attention modules (PAM after or before ECA), the network with the average of the attention modules (PAM and ECA), and the complete network. Table 6 shows the experimental results for the Market-1501 dataset, comparing the use of attention mechanisms with different configurations, against a baseline without attention mechanisms. The configurations of the attention module are denoted by the numbers in parentheses. As seen in the table, the use of attention mechanisms improves the network's performance, achieving higher rank-1 and mAP scores compared to the baseline. Specifically, the best performing configuration is (1) (2), which utilizes both PAM and ECA attention mechanisms, achieving a rank-1 score of 95.93% and mAP of 87.57%, which represents a significant improvement, of 2.42% and 6.92%, respectively, compared to the baseline. Additionally, the results show that the PAM attention mechanism contributes more to the improvement than the ECA attention mechanism. Configuration (1)-which uses only PAM-achieved a higher rank-1 score than configuration (2) (1), which uses only ECA. This suggests that PAM is more effective in capturing long-range dependencies between features. The experimental results demonstrate the effectiveness of using attention mechanisms in improving the performance of person re-identification networks. The use of PAM and ECA attention mechanisms with appropriate configurations, can significantly improve the rank-1 and mAP scores, which are important performance metrics for person re-identification systems.

Comparison of Using Different Pre-Trained Models
To investigate the usefulness of the baseline that we chose, we compare the results of several baselines on different datasets. Our baselines for comparison are various versions of OSNet [20] and VGG16. Table 7 presents the experimental results obtained for the Market-1501, DukeMTMC, CUHK03-labeled, and CUHK03-detected datasets. Each row of the table corresponds to a different baseline network, while each column shows the rank-1 and mAP scores for a specific dataset. Our results demonstrate that the addition of suggested branches improves the performance of all baseline networks, with the most significant gains observed in the OSNet X 1 network. In particular, our method achieved a rank-1 accuracy of 95.93% and a mAP score of 87.57% on the Market-1501 dataset, outperforming all other baseline networks. Our results also show that the VGG16 baseline network performed relatively poorly, with a rank-1 accuracy of only 90.25% and a mAP score of 74.86%. When comparing the results of each baseline network to the proposed method, it is evident that our method outperformed all baseline networks on all datasets, except for DukeMTMC, where OSNet X 1 , with our suggested addition, achieved the best performance. These results highlight the effectiveness of our proposed method in enhancing the performance of existing baseline networks. Furthermore, we observed some interesting trends and patterns in our data. For example, we found that OSNet X 1 performed significantly better than other OSNet baselines after adding all the branches, except for rank-1 for DukeMTMC. Additionally, the OSNet X 0.75 and OSNet X 1 networks achieved higher performance than OSNet X 0.5 and OSNet X 0.25 , respectively, suggesting that larger networks may better capture complex features in person re-identification. Overall, our study provides valuable insights into the effectiveness of our proposed method and the relative performance of different baseline networks in person re-identification.

Comparison of Network Architectural Change
To further interpret the results presented in Table 8, we can observe that the introduced strategy for applying the attention module after layer 4, provided the best performance in terms of rank-1, rank-5, rank-10, rank-20, and mAP scores. This indicates that the retrieved feature should include both coarse and fine information for a person's representation, to make the attention module more successful. The results also show a clear trend of increasing performance with deeper layers, as adding the attention module after layers 3 and 4 improves performance, compared to adding it after layer 2. Moreover, the rank-1 score of 95.93%, achieved by applying the attention module after layer 4, is particularly noteworthy, as it represents a significant improvement over the other positions tested. These results demonstrate the effectiveness of the proposed strategy for integrating attention mechanisms into person re-identification models and suggest that future work in this area should explore the use of attention modules in conjunction with deeper network architectures.

Comparison of Image Size Change
To better understand the impact of image size on the performance of the proposed method, we conducted experiments using different image sizes, and evaluated the results in terms of rank-1, rank-5, rank-10, rank-20, and mAP scores, as shown in Table 9. It can be seen, that resizing the image to 384 × 128 provided the best performance in terms of rank-1 accuracy, with a score of 95.93%. The other image sizes had rank-1 scores ranging from 95.24% to 95.86%. This suggests that a larger image size can capture more detailed information about the person's appearance, leading to better recognition performance. It is noteworthy that the choice of image size can also impact the overall computational cost of the system, and this factor should be considered when selecting the optimal image size for a given application.

Comparison of Feature Part Number
We explored the impact of the part number of feature selection on overall Re-ID performance and tested it using the Market-1501 dataset. We attempted to train the provided model with a varied number of feature selection components. The output feature is a global feature if the part number is set to 1. Having six parts exhibits the best performance on the Market-1501 dataset, according to the results presented in Figure 5. The Re-ID performance begins to fall with adding further parts, indicating that too many components load the model training and therefore lower performance.

Comparison of Batch Size Change
Here, we examined the effects of modifying the batch size in the training stage, where the batch size represents the number of images fed into the network. To examine the impacts of various batch sizes on the efficiency of our introduced network, comparative experiments were conducted. The largest batch size that could be used was 64, because of GPU limitations. Figure 6 illustrates the results of the experiment. As seen, performance changes as the batch size changes. The accuracy of the Market-1501 dataset may reach its highest value when the batch size is 64. Comparing the improvement to a batch size of 48, it is slight. As a result, performance varies by altering the batch size, and accuracy will continue to improve. We draw the conclusion that the processing of the samples' derived features can be helped by increasing the batch size.

Impact of Numbers of Epoch
The empirical results for our introduced network during the training stage are illustrated in Figure 7, to test the effect of changing the number of epochs. Three different datasets were utilized in the experiment. This experiment comprised 100 training epochs and was evaluated every 5 training epochs. As illustrated in Figure 7, both rank-1 and mAP performance improve by increasing epoch numbers in the training stage, but the difference is slight until epoch 35 in the Market-1501, DukeMTMC, and CUHK03-labeled datasets, unlike the CUHK03-detected dataset, which needs to reach epoch 55 before the change becomes small.

Conclusions
This research presents a multi-part feature network for individual Re-ID, which combines the position attention module with efficient channel attention, to improve the robustness and discrimination of the features. The suggested attention mechanism utilizes temporal, spatial, and channel context information, to extract a broader variety of contextual information from local features, hence enhancing their capacity for representation. Under the restrictions of numerous losses, the methods we propose can produce resilient feature representations. Extensive testing on three datasets revealed that the proposed strategy outperformed state-of-the-art techniques and was highly generalizable. The results indicate that the suggested strategy enhances both quantitative and qualitative methods for re-identifying individuals. In the future, we intend to investigate and expand the introduced method, to improve the precision and efficacy of person Re-ID.