1. Introduction
Power grid infrastructure underpins modern industry and daily life, with high-voltage substations serving as critical nodes that regulate electricity transmission and distribution. Within these environments, field workers perform routine inspection and maintenance tasks in close proximity to energized equipment, where safety violations—including unauthorized entry into restricted zones or improper grounding operations—can result in severe equipment damage, widespread power outages, or fatal injuries. Ensuring worker safety therefore represents not only an operational priority but a societal imperative.
Modern substations are increasingly equipped with multi-camera surveillance networks and body-worn sensing devices, generating substantial volumes of visual data continuously. However, the ability to extract actionable safety intelligence from this data remains limited. Manual inspection of large-scale video streams is impractical, while conventional rule-based video analytics systems suffer from high false alarm rates. In addition, simple object detection-based approaches are prone to both false positives and missed detections in complex substation environments. More critically, existing methods lack the capability to associate the same individual across multiple camera views. As a result, they fail to provide a coherent spatiotemporal understanding of worker behavior, particularly for safety events that are only partially observable from a single viewpoint. This gap between data availability and analytical capability motivates the development of more principled and robust automated approaches. From a deployment perspective, practical substation monitoring systems must also satisfy hardware constraints imposed by grid-edge infrastructure: low-power embedded processors and network-bandwidth-limited environments prohibit cloud-offloaded inference, requiring that perception models remain computationally lightweight. Furthermore, grid safety standards such as IEC 61850 [
1] (substation communication and automation) and IEEE 1686 [
2] (intelligent electronic device security) constrain the integration of third-party software into control environments, underscoring the importance of minimal-footprint, annotation-free approaches that can be validated and certified without reliance on proprietary labeled data pipelines.
Person re-identification (Re-ID) offers a compelling solution to this cross-camera association problem. Formally, Re-ID aims to retrieve all visual instances of a given individual across a distributed camera network. In particular, unsupervised person Re-ID, which requires no manual identity annotations, is especially attractive for deployment in real-world industrial settings where labeled training data is difficult or costly to obtain. Despite rapid progress in general-purpose surveillance scenarios [
3], applying unsupervised Re-ID to substation environments poses unique and underappreciated challenges. First, dense equipment layouts and confined workspaces introduce persistent, structured occlusion that creates substantial intra-class appearance variation for each worker. Second, mandated safety uniforms including standardized helmets and protective clothing severely reduce inter-person discriminability, compounding fine-grained recognition difficulty [
4]. Third, substations span heterogeneous lighting conditions across outdoor, indoor, and underground zones, amplifying appearance inconsistency for the same individual across views. Together, these factors make substation Re-ID a significantly harder problem than existing benchmarks suggest.
Existing unsupervised Re-ID methods fall into two major paradigms. The first, unsupervised domain adaptation (UDA) [
5,
6,
7,
8,
9], transfers knowledge from a labeled source domain to an unlabeled target domain through a two-stage pre-training and fine-tuning procedure. The second, pure unsupervised learning (USL) [
10,
11,
12,
13,
14,
15], directly learns discriminative representations from entirely unlabeled data, typically by assigning pseudo-labels via clustering and using them for contrastive or classification-based training. This work focuses on the USL paradigm, where representation quality is central to performance. A common design in recent USL methods is to maintain a memory dictionary of instance-level features for contrastive learning. Depending on how this dictionary is updated, existing approaches can be broadly categorized into two categories: those that treat all pseudo-labeled samples within a cluster as equally reliable [
8,
13], and those that prioritize the hardest positive samples to enhance intra-class compactness [
10,
12]. However, both strategies exhibit inherent limitations. Equal-sample selection strategies, which treat all pseudo-labeled samples as equally reliable, are susceptible to noise and often converge slowly. In contrast, hard-sample selection strategies, while intuitively appealing, are particularly vulnerable in early training stages when pseudo-labels are unreliable. In such cases, the hardest samples are often the noisiest, and their inclusion can distort cluster representations—a problem that becomes more pronounced on large-scale and challenging datasets such as MSMT17 [
16]. This trade-off between sample informativeness and reliability remains a central challenge in unsupervised Re-ID.
To address this trade-off, we draw on the principle of curriculum learning [
17], which proposes organizing training samples from easy to hard in accordance with the model’s evolving capability. While existing curriculum and self-paced learning strategies [
18] operate at the dataset level and apply a uniform difficulty schedule across all training instances, this coarse granularity is ill-suited to unsupervised Re-ID, where clusters exhibit markedly different intra-class densities and difficulty profiles. We argue that sample selection should be performed adaptively at the cluster level, guided by a continuous estimate of the model’s current discriminative capability for each cluster.
In this work, we propose AdaInCV, an unsupervised Re-ID framework built around adaptive intra-class variation estimation. For each cluster, we quantify the model’s learning progress by measuring the similarity gap between the hardest and easiest positive pairs—a large gap indicates underdeveloped representations, while a small gap signals sufficient learning. This signal drives two novel components: (1) Adaptive Sample Mining (AdaSaM), which continuously calibrates the difficulty of samples selected for memory dictionary updates at the cluster level; and (2) Adaptive Outlier Filtering (AdaOF), which dynamically incorporates informative outliers—instances typically discarded by existing methods—as hard negatives to strengthen contrastive learning. Rather than treating outliers as noise to be eliminated, AdaOF leverages them selectively when the model is capable of learning from their challenging characteristics, such as severe occlusion or extreme illumination.
We validate AdaInCV on two large-scale public Re-ID benchmarks and on a newly collected in-house Substation Worker Re-ID (SWRID) dataset that reflects the real-world challenges described above. Experimental results demonstrate that AdaInCV achieves competitive performance on standard benchmarks while showing strong generalization to the demanding substation setting, supporting its practical potential for intelligent safety supervision in power grid operations. The main contributions of this work are as follows:
We identify adaptive, cluster-level sample difficulty estimation as a missing and critical capability in unsupervised Re-ID, and propose AdaInCV as a principled framework to address it.
We introduce AdaSaM, which performs cluster-aware difficulty-calibrated sample mining for memory updates, and AdaOF, which dynamically integrates informative outliers as hard negatives based on the model’s learning state.
Extensive experiments on MSMT17, Market-1501, and the SWRID dataset demonstrate the effectiveness and practical applicability of AdaInCV in both general and industrial-safety Re-ID scenarios.
2. Materials and Methods
2.1. Problem Formulation
Let denote an unlabeled dataset of worker images captured from multiple fixed cameras and body-worn recorders across a substation site, without identity annotations. The objective is to learn a feature extractor that maps images of the same worker—despite variations in viewpoints, occlusion levels, and illumination conditions—into nearby embeddings, while ensuring separability between different identities. Such a formulation naturally supports safety monitoring applications by enabling cross-camera identity association, continuous activity tracking, and multi-view violation verification.
However, the highly heterogeneous visual conditions in substation environments lead to significant variations in intra-class feature distributions across pseudo-clusters. For instance, a cluster corresponding to a worker captured in well-lit outdoor areas may exhibit compact feature distributions, whereas clusters formed from mixed environments (e.g., outdoor substations and underground cable trenches) tend to be highly dispersed. Under such conditions, applying a uniform global training strategy is suboptimal, as it fails to account for cluster-specific learning dynamics. To address this issue, AdaInCV introduces a per-cluster adaptive mechanism that dynamically adjusts training difficulty according to the intra-class variation of each pseudo-cluster.
2.2. Pipeline Overview
As shown in
Figure 1, the proposed framework incorporates an adaptive sample mining strategy inspired by curriculum learning, which enables the model to select samples with appropriate difficulty levels according to the learning status of each cluster. This mechanism improves the quality of cluster-specific feature updates.
Each training epoch consists of the following steps. (1) L2-normalized embeddings are computed for all images using the current student network. (2) DBSCAN clustering is performed to generate pseudo-labels. (3) The memory dictionary is initialized or updated using cluster centroid features. (4) For each mini-batch, reliable positive and negative samples are selected through the Adaptive Sample Mining and Adaptive Outlier Filter modules (as illustrated in
Figure 1). The HybridNCE contrastive loss (Equation (2)) and the mean squared error (MSE) self-distillation loss (Equation (3)) are computed. (5) The student network is updated via back-propagation, while the teacher network is updated using an exponential moving average (EMA) strategy.
The centroid of the
-th cluster is defined as:
where
denotes the set of samples belonging to cluster
.
To enhance the diversity of negative samples, the proposed HybridNCE loss incorporates adaptively filtered outliers
in addition to cluster representations:
where
is the unique representation vector of the
-
cluster.
is the outliers filtered based on the current model’s ability. For any query person image
,
represents the positive cluster feature to which
belongs. The temperature
is empirically set to 0.05, and ⟨ ·, · ⟩ denotes the inner product between two feature vectors, used to measure their similarity.
is the number of clusters and
is the number of un-clustered instances.
Inspired by MoCo [
19], we adopt a teacher–student framework composed of
and
. The student is optimized via back-propagation, while the teacher is updated as an EMA of the student. The model is trained in a self-supervised manner by minimizing the mean squared error (MSE) between the predicted class probability distributions
.
where
and
denote the class probability distributions predicted by the student and teacher networks for image
, respectively.
The overall training objective combines contrastive learning and self-distillation, where
is a balancing hyperparameter controlling the relative weight of the self-distillation loss
(Equation (3)). In our experiments,
is set to 1.0, as preliminary experiments showed that
and
contribute at comparable magnitudes, making equal weighting a natural and stable choice.
2.3. Adaptive Model Capability Acquisition
For a mini-batch containing identities, each with instances, we denote the normalized feature of the -th sample in class as . Existing memory update strategies exhibit complementary limitations: average-based methods (e.g., SpCL, CC) fail to emphasize hard samples in later stages, while hardest-sample strategies (e.g., ICE, HDCRL) are prone to noise in early training due to insufficient model capacity. To address this trade-off, we introduce a curriculum-inspired adaptive similarity formulation that dynamically balances easy and hard positive pairs according to intra-class variation.
From a geometric perspective, the hardest positive pair determines the radius of the largest hypersphere enclosing all samples of class
, reflecting the worst-case intra-class compactness. Its similarity is approximated as:
Conversely, each sample can form a local hypersphere with its farthest positive, and the aggregation of these spheres characterizes easier intra-class relations. The least-hard positive similarity, capturing the best-case intra-class compactness, is defined as:
In practice, large intra-class variations (e.g., illumination changes, occlusion, or viewpoint shifts) often lead to unreliable hardest pairs, particularly in early training stages. In contrast, smaller variations indicate that the model is sufficiently discriminative and can benefit from harder samples. To adaptively balance these two regimes, we exploit the discrepancy between
and
as a proxy for intra-class variation and define an adaptive weight via their harmonic mean. The harmonic mean is preferred over the arithmetic mean because it is more sensitive to asymmetry between the two terms. When intra-class variation is large, the similarity of the hardest pair decreases substantially, whereas the similarity of the least-hard pair may remain relatively high. In such cases, the arithmetic mean tends to overestimate cluster reliability. By contrast, the harmonic mean penalizes this imbalance by biasing the result toward the smaller value, thereby producing a lower weight,
. This conservative weighting strategy delays the emphasis on hard samples until the model has learned sufficiently discriminative representations:
Finally, the weighted positive similarity is formulated as:
This formulation enables a smooth transition from easy-sample dominance in early training to hard-sample emphasis in later stages, thereby improving robustness against noisy positives while maintaining strong discriminative learning.
2.4. Adaptive Sample Mining (AdaSaM)
Unlike instance-level memory dictionaries, our memory-based feature dictionary stores cluster-level representations, where each entry corresponds to a pseudo-label (i.e., a cluster). To adapt the memory update to intra-class variability, we measure the dispersion of samples within each cluster based on the weighted positive pair similarity defined in the previous section.
Specifically, we quantify the intra-class variation by normalizing the relative position of the current cluster similarity
between the hardest and least-hard positive similarities, yielding a difficulty score:
This formulation provides an interpretable measure of intra-class compactness. When , the cluster exhibits tight and well-separated representations, indicating that the model is sufficiently discriminative and can benefit from emphasizing the hardest positive samples. In contrast, when , the cluster is highly scattered—typically due to occlusion, illumination variation, or viewpoint changes—suggesting that relying on hard samples may introduce noise into the memory.
To address this, we design a difficulty-aware sampling strategy that adaptively selects the update feature according to the current cluster state. Instead of always using the hardest or average sample, we interpolate between them by ranking samples according to their similarity to the cluster centroid. Concretely, we define a selection coefficient:
which determines the relative position of the selected sample in the ranked list.
The memory entry
is then updated via a momentum scheme:
where
denotes the
-th sample after sorting all instances in cluster
i by their similarity to the cluster centroid in descending order.
This mechanism enables a smooth transition from easy-sample-dominated updates in early training (low ) to hard-sample-focused updates as the model becomes more robust (high ). Consequently, it mitigates error accumulation caused by noisy hard samples while preserving the discriminative benefits of challenging examples in later stages.
2.5. Adaptive Outlier Filter (AdaOF)
We further define a global difficulty indicator to characterize the overall learning status of the model. Specifically, the global difficulty score is computed as the average of all cluster-wise difficulty values:
This metric reflects the model’s current discriminative capability across all pseudo-labels and serves as a key signal for adaptive outlier utilization. In practice, clustering algorithms (e.g., DBSCAN) inevitably produce outliers, which often correspond to samples with extreme intra-class variations, such as severe occlusion, illumination inconsistency, or viewpoint changes. Rather than treating these samples as pure noise, we consider them as informative hard negatives that can enhance contrastive learning when appropriately incorporated.
To this end, we propose an adaptive outlier filtering (AdaOF) strategy guided by . Specifically, we rank all samples (including outliers) according to their distance to cluster centroids in descending order. During early training stages, when is low and the model lacks robustness, we preferentially select samples that are farther from all clusters, as they are more likely to be reliable negative instances. As training progresses and increases, the selection criterion is gradually relaxed, allowing outliers to be incorporated from far to near, thereby increasing the diversity and hardness of negative samples.
To further stabilize training, we introduce a curriculum factor:
which modulates the influence of
. This design effectively slows down outlier incorporation in early epochs—where intra-class variance is typically high—thus preventing noisy samples from prematurely contaminating the memory dictionary. As the model becomes more robust, the influence of
increases, enabling a smoother and more reliable transition toward harder negative mining.
Notably, this curriculum factor is not limited to outlier handling; it is consistently integrated into the difficulty-aware learning process, including the cluster-level difficulty estimation (), forming a unified mechanism for progressive sample selection. This design is especially beneficial in scenarios with significant intra-class variability, while introducing negligible overhead for relatively simpler distributions.
3. Results
3.1. Datasets
Market-1501 [
20] consists of 32,668 annotated images of 1501 identities from 6 cameras, with 12,936 training images of 751 identities and 19,732 test images of 750 identities.
MSMT17 [
16] is the largest publicly available Re-ID dataset, containing 126,441 images of 4101 identities from 15 cameras, with 32,621 training images and 93,820 testing images.
Substation Worker Re-ID (SWRID). To evaluate the proposed method in realistic industrial scenarios, we construct a dedicated dataset termed Substation Worker Re-ID (SWRID). Each substation is equipped with 21 to 29 fixed high-definition surveillance cameras (1080p) deployed across outdoor transformers, indoor GIS rooms, and control buildings. In total, the dataset comprises 98 cameras across all substations. Data collection spans eight months, covering spring, summer, and autumn operating conditions, including varying natural illumination (daytime and dusk) and artificial lighting (night-shift indoor environments). All pedestrian crops are generated automatically using an off-the-shelf detector without any manual bounding box annotation.
We adopt a subject-disjoint training and testing protocol to rigorously assess cross-camera generalization. Specifically, 55 identities (approximately 70%) from all four substations are used for training, yielding 8743 unlabeled images for unsupervised representation learning without accessing identity annotations. The remaining 24 identities are reserved for evaluation, from which a gallery set of 2304 images (96 per identity) and a query set of 480 images (20 per identity) are constructed. Query images are sampled from camera views disjoint from the gallery. Training and evaluation identities are strictly non-overlapping, preventing any potential label leakage during unsupervised training. Note that SWRID is not publicly available due to the data security policies of State Grid Corporation of China, which limits direct reproducibility on this dataset. Researchers interested in accessing the dataset may contact the corresponding author to discuss formal data-sharing arrangements.
3.2. Implementation Details
Network and Training. We adopt ResNet-50 as the backbone encoder, initialized with parameters pre-trained on ImageNet. The network outputs 2048-dimensional L2-normalized feature representations via a global average pooling layer followed by batch normalization. The model is optimized using the Adam optimizer with a weight decay of and an initial learning rate of . A linear warm-up strategy is applied during the first 10 epochs, with no subsequent learning rate decay. The model is trained for 70, 30, and 60 epochs on Market-1501, MSMT17, and SWRID, respectively. Each mini-batch contains 256 images, sampled from 16 pseudo-identities with 16 instances per identity. During training, images are first resized to 256 × 128, padded by 10 pixels, randomly cropped back to 256 × 128, randomly flipped horizontally (p = 0.5), and randomly erased (p = 0.5). During inference, only resizing to 256 × 128 is applied. In both cases, images are normalized using ImageNet mean and standard deviation (mean = [0.485, 0.456, 0.406], std = [0.229, 0.224, 0.225]).
Clustering and Memory Update. Pseudo-labels are generated using DBSCAN [
21], following commonly adopted settings in prior work. The neighborhood thresholds are set to 0.5, 0.7, and 0.65 for Market-1501, MSMT17, and SWRID, respectively. For SWRID, the parameter is selected following the same validation protocol as in existing benchmarks, reflecting its distinct cross-substation and cross-device data distribution. The exponential moving average (EMA) momentum for memory updates is fixed to m = 0.999, following standard practice in momentum-based representation learning.
Considering the varying degrees of intra-class variation and outlier ratios across datasets, we incorporate a curriculum factor into the difficulty estimation to adaptively regulate the influence of hard samples during training. This factor is applied to the difficulty score , enabling a progressive transition from conservative to more challenging sample selection.
Evaluation Metrics. Following standard person Re-ID protocols, we evaluate all methods using mean Average Precision (mAP) and Cumulative Matching Characteristic (CMC) accuracy at Rank-1, Rank-5, and Rank-10. mAP captures holistic retrieval quality, while CMC metrics reflect the probability of finding at least one correct match within the top-k results.
Hardware. All experiments are conducted on a server equipped with 4 NVIDIA RTX 2080 Ti GPUs using the PyTorch framework (Version 1.11.0).
3.3. Comparison with State-of-the-Art Methods
We first evaluate the proposed method on widely used benchmarks, Market-1501 and MSMT17, with results summarized in
Table 1. Our method achieves state-of-the-art performance among fully unsupervised learning (USL) methods, reaching 87.4% and 38.8% mAP on Market-1501 and MSMT17, respectively. It outperforms all existing USL methods on Market-1501 and all camera-agnostic USL methods on MSMT17. Compared with the baseline HDCRL, the proposed method improves mAP by 2.9% and 18.1%, respectively. In addition, it consistently surpasses existing contrastive learning-based USL approaches. For reference, we also include several UDA methods (marked with *) that leverage additional labeled source-domain data; while direct comparison is not strictly fair due to this extra supervision, our method nonetheless outperforms them, demonstrating that our adaptive contrastive learning strategy is competitive even against approaches with privileged access to labeled data. Compared with ISE, which requires generating auxiliary samples from cluster centroids, our method achieves superior performance without introducing extra synthetic samples. Unlike ICE [
10] and CAP [
11], our method does not exploit camera information; under this camera-agnostic setting, it achieves 38.8% mAP and 69.8% Rank-1 accuracy on MSMT17, significantly outperforming prior methods.
To further validate its effectiveness in more challenging real-world scenarios, we conduct experiments on the SWRID dataset, as shown in
Table 2. AdaInCV achieves 68.9% mAP and 80.2% Rank-1 accuracy, outperforming all baseline methods by a significant margin. Notably, the Hardest update strategy (ICE-style) yields the lowest performance on SWRID (41.3% mAP), with a more pronounced degradation than that observed on MSMT17. This observation supports our hypothesis that hardest-sample mining can be detrimental during early training stages, particularly for datasets with extreme intra-class variations caused by occlusion and multi-zone illumination. In contrast, the Adaptive strategy (without outlier filtering) already surpasses all fixed-strategy baselines. Further incorporating AdaOF leads to an additional 5.8% improvement in mAP, which can be attributed to the informative and discriminative cues contained in the high proportion of outlier samples in SWRID.
3.4. Ablation Study
The performance gains of Adaptive Intra-Class Variation Contrastive Learning (AdaInCV) mainly stem from the proposed Adaptive Sample Mining (AdaSaM) and Adaptive Outlier Filtering (AdaOF) strategies. To evaluate the contribution of each component, we conduct ablation studies on Market-1501 and MSMT17, as summarized in
Table 3. For conciseness,
Table 3 reports mAP and Rank-1, which are the primary indicators of overall retrieval quality and top-match accuracy respectively; these two metrics are most sensitive to changes in the memory update strategy and are therefore most informative for component-level analysis. The Rank-5 and Rank-10 trends are consistent with those observed in
Table 1.
Among different memory update strategies, the proposed Adaptive method consistently outperforms CM, Hardest, and Linear. Specifically, CM updates the memory using all intra-class embedding features; Hardest selects the instance with the lowest cosine similarity to the query; Linear follows a curriculum learning paradigm with a fixed easy-to-hard progression; and Adaptive dynamically adjusts the sample selection strategy. As illustrated in
Figure 2, these strategies differ in their sample selection mechanisms.
Even without outlier handling, AdaSaM achieves the best performance, improving mAP by 1.1% and 3.0% on Market-1501 and MSMT17, respectively. Compared with Linear curriculum learning, the superior performance indicates that fixed easy-to-hard schedules are suboptimal, while adaptive selection better aligns with the model’s learning dynamics. Although both Linear and Adaptive strategies are inspired by curriculum learning, experimental results demonstrate that the samples selected by the Adaptive strategy are more appropriate.
This can be attributed to the fact that, in the early stages of training, selecting only the hardest samples may lead to biased optimization trajectories, whereas focusing solely on easy samples is insufficient to continuously improve model performance. Only by dynamically selecting samples that match the model’s current capability can the rationality and stability of the optimization process be ensured.
Notably, the Hardest strategy performs worst on both datasets, suggesting that selecting only the most difficult samples can mislead the model in early training stages. In addition, directly incorporating all outliers degrades performance compared with AdaSaM alone, highlighting that premature inclusion of noisy samples is detrimental. By contrast, AdaOF effectively mitigates this issue through adaptive outlier integration, leading to further performance gains.
In terms of training efficiency, our method achieves competitive or fewer training epochs while maintaining superior performance. We note that epoch count is used here as a proxy for convergence speed; direct learning-curve plots were not included in this study. As shown in
Table 4, on Market-1501, our method achieves the best mAP (87.4%) with a comparable number of training epochs to ClusterContrast. Notably, when trained for the same number of epochs as ClusterContrast, our method already achieves a higher mAP (83.2%) than CC’s converged performance. On the more challenging MSMT17 dataset, our method converges in fewer epochs (30 epochs) while outperforming all compared unsupervised SOTA methods in final accuracy, demonstrating its effectiveness in learning discriminative representations with improved training efficiency.
3.5. Occlusion and Illumination Robustness
To systematically evaluate AdaInCV under varying levels of difficulty, we partition the SWRID test set into three subsets based on occlusion severity and illumination conditions: (a) low difficulty (≥60% body visibility under favorable lighting), (b) medium difficulty (20–60% body visibility or moderately constrained indoor environments), and (c) high difficulty (≤20% visibility and/or adverse lighting conditions).
Table 5 presents a comparison between AdaInCV and HDCRL across these subsets. Although ISE achieves higher overall performance on SWRID (
Table 2), HDCRL is selected for subset-level comparison because it is a dynamic hybrid contrastive learning method specifically designed for hard-sample scenarios, making it the most directly comparable reference for difficulty-stratified analysis.
AdaInCV’s advantage increases consistently with difficulty, achieving gains of +2.8, +9.3, and +20.1 percentage points, respectively. This trend indicates that the proposed adaptive curriculum is particularly effective under severe occlusion and challenging lighting, where fixed-strategy baselines tend to saturate and fail to improve.
4. Discussion
The results across Market-1501, MSMT17, and SWRID consistently indicate that a key challenge in unsupervised person Re-ID lies in the large intra-class feature variation within pseudo-clusters. While this issue is relatively moderate in standard benchmarks such as Market-1501, it becomes more pronounced in MSMT17 due to its multi-camera and multi-condition setting, and is further exacerbated in SWRID by structured occlusion, appearance homogeneity, and extreme illumination variation.
The consistent improvements achieved by AdaInCV across all datasets suggest that per-cluster adaptive curriculum learning provides a general and principled solution for controlling intra-class variance under progressively more challenging conditions.
Failure Analysis of Hard-Sample Mining. The Hardest strategy exhibits a severe performance drop (41.3% for hardest vs. 68.9% for Ours), indicating a critical failure mode. In substation scenarios, the “hardest” samples are often caused by illumination mismatch or occlusion rather than true semantic variation. Using such samples to update cluster centroids in early training distorts feature representations and propagates clustering errors. AdaInCV mitigates this issue by adaptively down-weighting unreliable clusters, delaying the influence of hard samples until more robust representations are learned.
Implications for Multi-View Safety verification. A robust Re-ID model enables cross-camera retrieval of the same worker, addressing the limitations of single-view safety monitoring. Given a query from one camera, AdaInCV can retrieve corresponding instances from other views, facilitating more reliable multi-view analysis without requiring labeled data.
Contribution of Adaptive Outlier Utilization. The adaptive inclusion of outliers in AdaOF not only improves contrastive learning, but also emphasizes inherently ambiguous samples. These typically correspond to safety-critical scenarios (e.g., heavy occlusion or confined spaces), suggesting that the learned curriculum implicitly prioritizes difficult yet operationally important cases. A qualitative analysis of the visual characteristics and identity coverage of AdaOF-identified outliers is provided in
Appendix A.
Limitations and future work. SWRID is limited to substations within a single region, and its generalization to diverse environments remains to be validated. Future work includes incorporating multi-modal data such as thermal imagery for illumination-invariant representation, and integrating Re-ID with downstream action recognition to enable automated safety violation detection.
5. Conclusions
We present AdaInCV, a fully unsupervised person Re-ID framework designed for substation worker safety monitoring. The key idea is to quantify intra-class feature variation within each pseudo-cluster after DBSCAN clustering, which implicitly reflects the degree of occlusion and illumination complexity associated with each identity. Based on this signal, AdaInCV performs cluster-level adaptive curriculum learning to control sample difficulty during training.
This framework is realized through two complementary components: AdaSaM (Adaptive Sample Mining), which dynamically adjusts positive sample selection for memory updates, and AdaOF (Adaptive Outlier Filtering), which progressively incorporates informative outlier samples—primarily heavy-occlusion and extreme-illumination images—as hard negatives.
Extensive experiments on Market-1501, MSMT17, and the SWRID dataset demonstrate that AdaInCV consistently achieves state-of-the-art performance among fully unsupervised learning (USL) methods. In particular, it yields substantial improvements under challenging conditions, including a +20.1 percentage-point gain in Rank-1 accuracy under high-occlusion and extreme-illumination settings compared with HDCRL. These results indicate that explicitly modeling cluster-wise difficulty is critical for robust unsupervised Re-ID in complex real-world environments.
Overall, AdaInCV provides a practical solution for annotation-free worker re-identification in power grid scenarios, enabling reliable cross-camera identity association that can support downstream applications such as multi-view verification and safety monitoring.