Person Re-Identification Enhanced by Super-Resolution Technology

Liu, Yue; Li, Zewen; Leng, Lu; Kim, Cheonshik

doi:10.3390/electronics14234647

Open AccessArticle

Person Re-Identification Enhanced by Super-Resolution Technology

¹

Jiangxi Provincial Key Laboratory of Image Processing and Pattern Recognition School of Software, Nanchang Hangkong University, Nanchang 330063, China

²

School of Software, Nanchang Hangkong University, Nanchang 330063, China

³

Department of Computer Engineering, Sejong University, Seoul 05006, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Electronics 2025, 14(23), 4647; https://doi.org/10.3390/electronics14234647

Submission received: 31 October 2025 / Revised: 21 November 2025 / Accepted: 24 November 2025 / Published: 26 November 2025

(This article belongs to the Special Issue Digital Security and Privacy Protection: Trends and Applications, 3nd Edition)

Download

Browse Figures

Versions Notes

Abstract

With rising demand for cross-camera person re-identification (ReID) in smart cities, low-resolution (LR) images severely hinder practical ReID performance due to detail loss and weakened identity features. This paper proposes two solutions to address this bottleneck: (1) super-resolution (SR) techniques, including hybrid attention transformer (HAT), pixel-level and semantic-level adjustable SR (PiSA-SR), and omni aggregation networks for lightweight image SR (Omni-SR), are used to enhance image visual quality, and the enhanced images are applied to three ReID methods, including semantically controllable self-supervised learning framework-REID (SOLIDER-REID), light-REID, and relation-aware global attention (RGA), for performance assessment. (2) An end-to-end framework integrating HAT and SOLIDER-REID is designed, in which HAT enhances LR images via multi-scale attention to restore discriminative details, while SOLIDER-REID’s semantic controller suppresses background noise to focus on the pedestrian regions. Extensive experiments on the Market-1501 dataset show that the first solution slightly improves ReID accuracy, e.g., PiSA-SR + SOLIDER-REID achieves 92.0% mAP, 0.4% higher than SOLIDER-REID alone, while slightly sacrificing speed. The second solution significantly boosts LR ReID performance at the cost of a certain increase in time. For LR images, even 32 × 32 images, HAT-SOLIDER achieves 59.8% mAP and 80.4% Rank-1, 18.5% higher in mAP and 19.2% higher in Rank-1 than SOLIDER-REID alone. This work provides effective solutions for LR-induced performance degradation in cross-camera ReID.

Keywords:

low-resolution person re-identification; super-resolution image enhancement; cross-camera person retrieval; hybrid attention transformer; semantic-controllable learning

1. Introduction

1.1. Background

The global smart surveillance market has experienced remarkable growth in recent years, driven by the expansion of smart city initiatives and rising public safety concerns. According to relevant materials [1], countries around the world have continuously increased the number of surveillance devices to ensure the security of public areas. For instance, London added over 500,000 surveillance cameras for public security during the 2012 Olympics. As of 2021, it is roughly estimated that the number of surveillance cameras deployed worldwide had reached approximately 1 billion, with annual shipments of surveillance cameras alone hitting 72 million units that year. This surge in deployment reflects the urgent need for reliable video analytics, among which cross-camera person re-identification (ReID) is a key capability.

ReID aims to retrieve the same individual across non-overlapping camera views, enabling continuous tracking of pedestrian trajectories in large-scale environments, e.g., airports, shopping malls, and university campuses [2]. Unlike facial recognition, which requires high-quality frontal face images, ReID leverages holistic pedestrian features [3], including clothing color, body shape, and accessory styles to match identities, making it more adaptable. This approach makes ReID more adaptable to unconstrained surveillance scenarios. However, traditional ReID systems face a fundamental challenge: low-resolution (LR) images.

1.2. Motivation and Contributions

In practical surveillance setups, LR images are unavoidable due to two primary factors:

Hardware Constraints: Budget limitations often force the use of low-cost cameras with limited sensor resolution, especially in large-scale deployments [4], e.g., street-level monitoring. To control upfront procurement costs, many projects prioritize low-cost camera models. These cameras usually come with small-sized image sensors and have limited native resolution, usually only up to 720p or lower. An image captured by a low-cost camera is shown in Figure 1.

Distance Limitations: Pedestrians are frequently captured from long distances, leading to blurred images. The need to balance monitoring coverage and imaging clarity often results in LR pedestrian images [5]. To cover wider areas, surveillance cameras are usually mounted at high positions or far from pedestrian activity zones. This long-distance capture means pedestrians appear as small targets in the frame, often occupying only a few dozen or even a few pixels, which directly leads to blur, blurring of edge details, and loss of discriminative features. An image taken from a long distance is shown in Figure 2.

The Impact of Low Resolution on ReID Performance. LR images severely degrade ReID performance due to the loss of discriminative details and weakened identity features [6]. For example, critical visual cues such as stripe patterns on clothing or the shape of a backpack—essential for distinguishing visually similar individuals—are often lost in LR inputs. Empirical evidence from the Market-1501 dataset [7] shows that SOLIDER-REID [8] suffers a dramatic 51.3% drop in mean Average Precision (mAP) when evaluated on 32 × 32 LR images compared to 64 × 128 high-resolution (HR) images. This performance gap highlights the critical need for effective strategies to mitigate LR-induced degradation in real-world ReID applications.

Super-Resolution (SR) technology, a core branch of computer vision, offers a promising solution by reconstructing HR details from LR inputs through learned non-linear mappings [9]. Although SR has been widely applied in image restoration, its integration with ReID remains underexplored. Key questions—such as which SR techniques best complement different ReID architectures, and how to design end-to-end frameworks that align SR and ReID objectives—have not been adequately addressed.

This paper attempts to address the LR problem in person ReID, and the contributions are summarized as follows.

Systematic Evaluation of SR Techniques for ReID: We compare three state-of-the-art (SOTA) SR methods, namely hybrid attention transformer (HAT) [10], pixel-level and semantic-level adjustable SR (PiSA-SR) [11], and omni aggregation networks for lightweight image SR (Omni-SR) [12], on three representative ReID models, specifically semantically controllable self-supervised learning framework-REID (SOLIDER-REID) [13], light-REID [14], and relation-aware global attention (RGA) [15], quantifying their impact on accuracy metrics such as mAP, Rank-n, and inference speed.

End-to-End HAT-SOLIDER Framework: We propose a novel hybrid framework that jointly optimizes SR based on HAT and ReID based on SOLIDER-REID. This integrated approach eliminates information loss from standalone SR preprocessing and enhances discriminative feature learning in LR scenarios through semantic-guided attention and joint loss optimization.

The rest of this paper is organized as follows: The related works are reviewed in Section 2. The methodology is specified in Section 3. The experimental results are provided in Section 4. In Section 5, we conclude this work.

2. Related Works

2.1. SR Enhancement

SR technology aims to reconstruct HR images from LR inputs, with the methods evolving from traditional interpolation to deep learning-driven approaches.

(1): Traditional Interpolation-Based SR

Traditional interpolation methods are lightweight and real-time, making them suitable for resource-constrained devices, e.g., edge cameras. Common techniques include:

Nearest Neighbor Interpolation maps each LR pixel to HR by copying values to neighboring pixels, but produces blocky artifacts.

Bilinear Interpolation typically uses a 2 × 2 neighborhood to compute pixel values via linear weighting, reducing blockiness but blurring edges.

Bicubic Interpolation generally extends to a 4 × 4 neighborhood with cubic polynomials, improves edge smoothness, but still fails to recover high-frequency details [16].

These methods are computationally efficient, with a computational complexity of O(W × H), where W and H represent the width and height of the image, respectively. However, they suffer from inherent limitations. They only upsample pixels through local neighborhood mapping but do not learn semantic or structural information, which leads to over-smoothing and poor visual quality for ReID [17].

(2): Deep Learning-Based SR

Deep learning has revolutionized SR by enabling end-to-end learning of LR-HR mappings. Dong et al. [18] proposed the first convolutional neural network (CNN)-based SR model, called SRCNN (2014). This model employed three convolutional layers to perform feature extraction, map LR images to HR images, and reconstruct the images. While it outperformed traditional SR methods, it had a limited capacity for recovering high-frequency details. Kim et al. [19] developed very deep SR (VDSR) (2016), a deep CNN consisting of 20 layers. They incorporated residual learning into the model to address the gradient vanishing problem. VDSR achieved SOTA Peak Signal-to-Noise Ratio (PSNR) by leveraging large receptive fields to capture global contextual information. Ledig et al. [20] integrated a generative adversarial network (GAN) into SR technology (2017) (SRGAN). This model used a generator to produce HR images and a discriminator to distinguish between real and generated HR images. Ledig et al. observed that SRGAN improved perceptual quality, evaluated via Mean Opinion Score (MOS), but was prone to mode collapse. Lim et al. [21] proposed enhanced deep SR (EDSR) (2017), in which they removed batch normalization layers and added residual scaling to enhance training stability. EDSR achieved higher PSNR than VDSR on benchmark datasets.

These models outperform traditional methods, but they still face some critical bottlenecks for ReID integration:

Computational Cost: Deep CNN-based SR models, e.g., VDSR, require large floating-point operations per second (FLOPs), which slow down ReID inference.

Data Dependency: They rely on large-scale LR-HR paired datasets, which are scarce for pedestrian-specific images.

(3): Semantic-Guided Adaptive SR Optimization

Recent SR methods have incorporated semantic information to prioritize the reconstruction of task-relevant regions, for example, the pedestrian bodies in ReID, and this addresses the limitations of generic SR.

Sun et al. [22] presented spatially adaptive feature modulation (SAFM) (2023). This approach dynamically adjusted feature weights according to semantic regions, for example, pedestrians and background. SAFM enhanced detail recovery in critical areas while suppressing noise. Li et al. [23] developed a Semantic-Aware Discriminator (SeD) (2024). This discriminator used pre-trained vision models, such as the vision transformer (ViT), to extract semantic labels. These labels guided the SR generator to learn texture details consistent with human perception. Li et al. validated the effectiveness of this semantic guidance in improving SR performance. Sun et al. [11] proposed PiSA-SR (2025), a dual low-rank adaptation (LoRA) approach. This method decoupled pixel-level objectives, such as denoising and PSNR optimization, and semantic-level objectives, such as texture generation and learned perceptual image patch similarity (LPIPS) optimization. PiSA-SR allows adjustable control over denoising intensity and detail freedom, making it flexible for person ReID tasks.

These methods are compared in Table 1. Among them, only semantic-guided SR methods are particularly suitable for ReID, as they focus on restoring pedestrian-specific semantic details rather than generic image quality.

2.2. Person ReID

ReID techniques have evolved from manual feature engineering to deep learning-driven feature learning [24], with a focus on robustness to view changes, occlusion, and LR conditions.

(1): Traditional ReID relies on handcrafted features and statistical learning.

Color Features use red, green, blue (RGB) and hue, saturation, value (HSV) histograms to capture clothing color distribution, but are sensitive to illumination changes [16].

Shape Features use Histogram of oriented gradients (HOG) models for body contour, but are less effective for LR pedestrians [16].

Texture Features typically use filters, such as Gabor filters, to extract local texture patterns such as stripes and checks, but they are prone to noise in LR images [16].

These methods are computationally cheap but suffer from limited feature expression, so their performance degrades sharply in complex scenarios such as cross-camera view changes or partial occlusion [7].

(2): Deep Learning-based ReID dominates current research, which can be categorized according to supervision type and modality.

Supervised ReID uses manually annotated identity labels to train models, such as residual networks (ResNet) and transformers, to map pedestrians to a feature metric space where the same identities are clustered.

Sun et al. [25] introduced a part-based convolutional baseline (PCB) (2018) for person ReID. This model divided pedestrians into horizontal parts and extracted local features to handle occlusion. Liu et al. [26] developed ReID with Swin Transformer (Swin-ReID) (2022), which adopted the shifted windows mechanism of Swin Transformer to capture both global and local features. This design enhances the robustness to cross-view variations, thereby improving cross-camera person ReID performance.

Unsupervised ReID is a paradigm that reduces annotation costs by using clustering. Zheng et al. [27] proposed a group-aware label transfer (GLT) (2021) method. This approach integrated clustering and feature learning to refine pseudo-labels. Zheng et al. discovered that GLT achieved SOTA performance in unsupervised domain adaptation for ReID. Chen et al. [13] presented SOLIDER-REID (2023), a self-supervised ReID framework. It leveraged human image prior knowledge, such as body part information and a semantic controller, to adapt to downstream ReID tasks. SOLIDER-REID outperformed supervised models on LR datasets by utilizing pixel-level semantic supervision.

Single-Modal ReID only uses visible light images, which is the most common scenario in surveillance [6].

Cross-Modal ReID bridges different modalities, e.g., visible-infrared, RGB-depth, to handle low-light or nighttime conditions. Ren and Zhang [28] proposed an implicit discriminative knowledge learning (IDKL) (2024) method. This model extracted modality-shared features to reduce cross-modal discrepancy.

These methods are compared in Table 2. Deep learning-based ReID significantly improves robustness, but LR images still pose a major challenge, especially for models relying on fine-grained details such as texture and accessories.

2.3. Integration of SR and ReID

Existing efforts to integrate SR and ReID can be categorized into two paradigms.

(1): SR as Preprocessing: Most prior works treat SR as a standalone preprocessing step: LR images are first enhanced via SR, then fed into a ReID model. Wu et al. [29] developed SR-DSFF (2022), a two-stage approach for cross-resolution person ReID. It first used SR to enhance LR images, then fused multi-scale features for ReID. SR-DSFF improved mAP by 5–8% on LR datasets but suffered from information loss during the feature transfer between the SR module and the ReID module. Wang et al. [30] proposed SR-ReID (2018), which combined SRGAN with a ResNet-based ReID model. However, SRGAN’s inherent mode collapse led to inconsistent detail generation, which limited the overall ReID performance.
(2): End-to-End Integration: Few works have explored the joint optimization of SR and ReID end-to-end. Dong et al. [18] introduced the SR-ReID Joint Framework (2014), which integrated a lightweight SR CNN with a ReID Transformer in an end-to-end manner. Nevertheless, this framework lacked semantic guidance for pedestrian-specific regions, which restricted its performance in LR person ReID scenarios.

3. Methodology

This paper designs two solutions, and the code is released at https://github.com/lllyyyxxxs/ReID.git (accessed on 29 October 2025). The first solution enhances dataset quality via the SR method, namely HAT, PiSA-SR, and Omni-SR, to boost image resolution, then uses these HR images for ReID methods, namely SOLIDER-REID, light-REID, and RGA. The second solution is an improvement framework, synergizing HAT and SOLIDER-REID to balance detection accuracy and efficiency. This implementation integrates the HAT SR network as a pre-processing or feature-extraction module inside the Swin Transformer architecture, forming a module-level (structural) model fusion. LR input images are processed by HAT’s multi-scale attention pyramid, and SOLIDER-REID’s semantic controller dynamically suppresses background noise to enhance discriminative features of pedestrian regions, improving ReID accuracy in LR scenarios. Both solutions optimize ReID model accuracy and reduce the impact of low image visual quality.

3.1. Model Paradigm Definitions

To clarify the experimental design, we explicitly define the two paradigms for integrating SR and ReID models used in this work:

Two-Stage Independent Model (denoted as A + B, e.g., HAT + SOLIDER): In this paradigm, the SR model (e.g., HAT) and the ReID model (e.g., SOLIDER) are independently trained and executed. The LR input image is first upscaled by the pre-trained SR model to generate an HR image. This generated image is then fed into the pre-trained ReID model for feature extraction and identity matching. There is no gradient backpropagation between the two models; the SR process is non-differentiable and opaque to the ReID task.

End-to-End Integrated Model (denoted as A-B, e.g., HAT-SOLIDER): This paradigm represents our core proposed improvement. As detailed in Section 3.4, it embeds the HAT module into the shallow layers of the SOLIDER-ReID backbone, forming a unified network trained end-to-end. Here, SR feature enhancement and ReID feature extraction occur within a single forward pass and are co-optimized via a joint loss function, allowing gradients from the ReID task to directly guide the SR module’s learning. The key distinctions between these paradigms are summarized in Table 3.

3.2. SR Enhancement

Surveillance images often suffer from indistinct pedestrian features due to LR, artifacts, and noise, reducing recognition accuracy. Traditional image enhancement methods lack detail recovery capability, so three SOTA deep learning-based SR methods are selected, which can learn the non-linear mappings between LR and SR images, recovering high-frequency details.

HAT combines channel attention and window self-attention, optimized via a cross-attention block (OCAB). OCAB reduces edge discontinuities through cross-window feature interaction, with channel attention guiding global enhancement and window self-attention focusing on local details to boost reconstruction precision.

PiSA-SR uses dual LoRA modules. Pixel-level LoRA addresses noise and blurriness via L2 loss to improve PSNR; semantic-level LoRA generates human-perceptible details such as hair texture, using LPIPS loss and classifier score distillation (CSD), with adjustable parameters for denoising intensity and detail freedom.

Omni-SR, with an omni-scale aggregation group (OSAG), replaces convolution with spatial-channel interaction to reduce parameters. OSAG extracts edges and captures textures to enhance encoding efficiency in complex scenes.

The standard HAT, PiSA-SR, and Omni-SR models are mainly designed for isotropic SR. Isotropic SR refers to upscaling an LR image by the same scaling factor in both the length and width directions. For example, a 32 × 32 image is upscaled to 64 × 64, and the magnification is isotropic.

3.3. Person ReID

Traditional ReID heavily relies on manually designed features and statistical models, which degrade in complex scenarios. In contrast, deep learning-based ReID enhances robustness via end-to-end feature learning. We compare three ReID methods.

SOLIDER-REID performs pixel-level feature clustering to generate pseudo-labels, uses occlusion augmentation (Masked Autoencoder) to boost robustness, and integrates a semantic controller in ResNet to adjust semantic weight, suitable for high-precision scenarios.

Light-REID achieves efficient inference via model compression and knowledge distillation, with TensorRT acceleration for low latency.

RGA uses multi-scale fusion and adversarial training to handle occlusion, with strong generalization via global attention.

3.4. Model Improvements

Although PiSA-SR performs slightly better, its independent dual-LoRA modules need preset parameters for pixel and semantic optimization, so they only suit standalone preprocessing rather than ReID model improvement. HAT has modular hybrid attention that combines channel attention and window attention. It can be embedded into ReID backbones, such as SOLIDER-REID’s Swin Transformer, aligning with ReID’s focus on pedestrian details and supporting end-to-end training, which makes it more suitable for the SR task. HAT’s structure is shown in Figure 3.

SOLIDER-REID excels in accuracy, so we select it for the ReID task. The SOLIDER-REID network structure is shown in Figure 4.

This paper uses HAT for image enhancement, its hybrid attention and pre-training to ensure superior detail recovery and scalability. For ReID, SOLIDER-REID is chosen for its semantic controller and self-supervised pre-training, enabling dynamic matching in complex scenes. An improved framework of HAT and SOLIDER-REID is proposed, in which HAT’s hybrid attention block activates more pixel information, providing finer semantic details, significantly improving LR ReID quality.

The HAT-SOLIDER framework proposed in this paper achieves structural-level deep fusion. The HAT-SOLIDER fusion model network structure is shown in Figure 5. Its core innovation lies in embedding HAT as a feature enhancement module into the backbone network of SOLIDER and performing end-to-end optimization through a joint loss function, rather than adopting a simple pipelined concatenation. The specific integration mechanism and advantages are detailed as follows:

Embedded Integration Instead of Concatenation: Unlike two-stage models (e.g., HAT + SOLIDER) that treat HAT as an independent preprocessor, this framework embeds HAT’s hybrid attention module into the shallow layers of the Swin Transformer. The specific data flow is:

(1): Feature Extraction and Enhancement: The LR input image first passes through a shallow convolutional layer to extract initial features $F_{0}$ . These features are then fed into the integrated HAT module.
(2): Attention-Driven Feature Refinement: Inside HAT, the hybrid attention mechanism (combining channel attention and window self-attention) processes $F_{0}$ . Channel attention guides global enhancement, while window self-attention focuses on local detail recovery. This generates detail-rich features $F_{D F}$ .
(3): Feature Fusion: The initial features $F_{0}$ and the enhanced features $F_{D F}$ are fused via a residual connection to form the high-quality feature map, as shown in Equation (1).
(4): Semantic-Guided Attention: The fused feature map is then processed by SOLIDER-REID’s semantic controller. This controller dynamically analyzes the feature map, amplifying the weights of features in pedestrian regions while suppressing background noise, thereby focusing the model on identity-discriminative semantics.
(5): Backbone Processing: The refined feature map is subsequently passed to the deeper layers of the Swin Transformer for further feature extraction and ReID matching.

Joint Optimization and Bidirectional Guidance: We design a joint loss function, as shown in Equation (3). The key of this function lies in two aspects. First, the ReID task guides the SR process: gradients from ReID losses (

L_{d i n o}

,

L_{s m}

) are directly backpropagated to the HAT module, guiding it to recover details most effective for identity discrimination and realizing “SR for ReID”. Second, the SR task assists the ReID process: the SR loss

L_{S R}

ensures that the recovered features conform to visual authenticity, providing high-quality inputs for ReID.

Mechanism advantages: Compared with concatenated models, the mechanistic advantages of this framework are threefold.

Avoiding information loss: Lossless enhancement is performed in the feature space, eliminating the information loss caused by image reconstruction and re-encoding in concatenated models.

Task-driven learning: The SR process is directly optimized by ReID objectives, generating features more favorable for recognition.

Global optimality: It supports the collaborative update of parameters in both the SR and ReID modules, enabling the search for a globally optimal solution.

The proposed “structural-level deep fusion” of HAT-SOLIDER is fundamentally different from a well-designed serial model with skip connections in the following qualitative ways:

(1): Optimization Objective: Task-Driven and Reconstruction-Driven

In a serial model (A + B), the SR module is optimized for pixel-level fidelity (e.g., PSNR/SSIM), independent of the ReID task.

In HAT-SOLIDER (A-B), the embedded HAT module is co-optimized end-to-end with the ReID objective. Gradients from the ReID loss directly guide HAT to recover identity-discriminative features rather than just visually pleasing details.

(2): Information Flow: Feature Enhancement and Pixel Reconstruction

A serial model operates in pixel space: it reconstructs an HR image, which is then re-encoded by the ReID backbone, inevitably losing information.

Our fusion operates in feature space: HAT enhances the shallow features

F_{0}

directly, bypassing explicit HR image generation. This feature-level fusion avoids the “reconstruction-and-re-encoding” bottleneck.

(3): Gradient Flow: Synergistic and Interrupted

In a serial model, gradient flow from ReID to SR is weak or non-existent, preventing the SR process from adapting to the needs of ReID.

Our joint training creates a synergistic feedback loop. The ReID loss shapes the SR feature enhancement, and the SR loss regularizes the feature learning, leading to a unified model where both components work towards a single, superior ReID performance, especially on extreme LR inputs.

SOLIDER-REID adopts the Swin Transformer as the ReID backbone network, utilizing its global attention mechanism to capture cross-resolution semantic correlations, and introduces multi-granularity feature constraints to force the SR images and HR images to align in the semantic space. The SR stage can be expressed as

I_{H Q} = H_{R e c} (F_{0} + F_{D F})

(1)

where

F_{0}

is shallow convolutional features and

F_{D F}

is mixed attention features.

Feature extraction for person ReID can be expressed as

F = B a c k b o n e (I_{H Q})

(2)

where Backbone is the Swin Transformer.

Semantic clustering and loss calculation can be expressed as

L_{t o t a l} = α L_{d i n o} + (1 - α) L_{s m} + λ L_{S R}

(3)

where

L_{t o t a l}

is the total loss

L_{d i n o}

is the distillation and no labels (dino) loss,

L_{s m}

is the self-supervised loss of SOLIDER,

L_{S R}

is the SR reconstruction loss,

α

is the weight coefficient and

λ

is 0.5.

4. Experiments and Results

4.1. Datasets

Dataset: Market-1501 [8] contains 12,936 training images and 19,732 testing images. The sizes of all images are 64 × 128. Some original images of Market-1501 are shown in Figure 6a.

The reasons for choosing the Market-1501 dataset are as follows: (1) this dataset contains a large number of pedestrian images captured across multiple cameras, meeting the cross-camera ReID requirements of real-world surveillance scenarios; (2) the original size of 64 × 128 can be downsampled to generate 32 × 32 extreme LR samples, effectively simulating real scenarios of ‘long-distance shooting’ or ‘LR cameras’; and (3) the dataset has a sufficient number of samples (12,936 for training and 19,732 for testing), ensuring the stability and generalization of model training.

LR Test: To test LR images, we downsampled the images of Market-1501 dataset to 32 × 32 pixels, ultimately forming an LR dataset. Some downsample images from Market-1501 are shown in Figure 6b.

4.2. Parameter Settings

The HAT is pretrained on ImageNet, with the SR module upsampling LR images by 2× to a size of 128 × 256. It is trained for 200 epochs with a batch size of 8 using the Adam optimizer with a learning rate of 1 × 10⁻⁵. The SOLIDER-REID is pretrained on DF2K, with an input size of 384 × 128 (as required by DF2K). It employs data augmentation and hybrid sampling techniques and is trained using stochastic gradient descent (SGD) with a learning rate of 8 × 10⁻⁴, which undergoes 20 epochs of cosine warm-up. The model is trained for 120 epochs with a batch size of 64, and evaluated using metrics of mAP and Rank-n. For the HAT-SOLIDER, the input size is 32 × 32 LR. It is trained with a learning rate of 5 × 10⁻⁴, L2 regularization of 1 × 10⁻⁴, and the bias learning rate is set to twice that of the weights. The model uses joint loss, extracts features before the classification layer, and is trained for 200 epochs with a batch size of 8.

4.3. Evaluation Indicators

4.3.1. Image SR Reconstruction

PSNR is used to measure image quality, with a larger value indicating higher image quality.

SSIM evaluates the similarity between two images from three aspects: brightness, contrast, and structure. The value range is [−1, 1], with a value closer to 1 indicating better SR results.

4.3.2. Person ReID

MAP is obtained by calculating the average precision (AP) of each category and then taking the average, comprehensively reflecting the model’s performance in multi-category scenarios.

Rank-n Hit Rate represents the probability that the correct match is included in the Top n results returned by the model. For example, Rank-1 refers to the proportion of correct matches in the first position, and Rank-5 refers to the probability that the correct result exists in the top five positions.

4.4. Experimental Results and Analysis

4.4.1. SR Enhancement

The PSNR and SSIM results of three SR models (HAT, PiSA-SR, and Omni-SR) on the Market-1501 dataset are shown in Table 4.

Experiments show that the quality of images enhanced by PiSA-SR is similar to that of HAT, and the images are smoother; the images processed by Omni-SR retain some noise and have artifacts. The original SR images of the Market-1501 dataset are shown in Figure 7.

4.4.2. Person ReID

The accuracies of SOLIDER under different SR technologies are shown in Table 5. On the Market-1501 dataset, the mAP of PiSA-SR + SOLIDER is the highest at 92.0%, which is 0.4% higher than that of SOLIDER without SR.

Under the same training settings with a batch size of 128, we recorded the training throughput (in samples per second). The original SOLIDER model achieved an average throughput of 17.9 samples/s. HAT + SOLIDER achieved an average throughput of 15.5 samples/s. PiSA-SR + SOLIDER achieved an average throughput of 14.1 samples/s. Omni-SR + SOLIDER achieved an average throughput of 15.0 samples/s.

The accuracies of Light-REID under different SR technologies are shown in Table 6. On the Market-1501 dataset, the mAP of PiSA-SR + light-REID is 89.3%, which is 0.3% higher than light-REID without SR.

The accuracies of RGA under different SR technologies are shown in Table 7. On the Market-1501 dataset, the mAP of PiSA-SR + RGA is 85.3%, which is 0.3% higher than RGA without SR.

The accuracy of SOLIDER under the HAT and end-to-end frameworks on the LR test is shown in Table 8. On the 32 × 32 LR Market-1501 dataset, with the SR expanded by 2 times to a size of 64 × 64, the improved model HAT-SOLIDER has a mAP of 59.8% and a Rank-1 of 80.4%. Compared with the SOLIDER (mAP 40.3%, Rank-1 61.2%), it has a significant improvement, with Rank-1 increasing by 19.2% and mAP increasing by 19.5%. The LR and SR images of the LR test are shown in Figure 8.

4.5. Discussion and Analysis

(1): Improvements with SR: The models relying on multi-scale features (e.g., SOLIDER, light-REID) show slight improvements on mAP and Rank-1 with SR (HAT, PiSA-SR), as the restored details strengthen fine-grained features.
(2): Accuracy Degradation with SR on Global Relational Models: The models using global relational modeling (e.g., RGA) may perform worse than expected, as SR can disrupt global relationships.
(3): Advantage of the End-to-End Framework: HAT-SOLIDER excels even on extreme LR (32 × 32) via end-to-end optimization, avoiding independent SR’s information loss, with 19.2% Rank-1 and 19.5% mAP gain. The superior performance of the proposed HAT-SOLIDER framework on extreme LR images (32 × 32) stems from its task-driven, end-to-end design, which enables synergistic collaboration between the SR module and the semantic controller. Unlike the two-stage HAT + SOLIDER model, which suffers from information loss during the separate image reconstruction and re-encoding steps, our integrated model performs feature-level enhancement directly within the ReID backbone. This allows the semantic controller to provide immediate feedback, guiding the HAT module via backpropagation to recover identity-discriminative details in pedestrian regions while suppressing irrelevant background noise. The joint optimization via the combined loss function ensures that super-resolution is explicitly tailored for the re-identification task, resulting in significantly more robust feature representations for low-resolution inputs.
(4): Why End-to-End Excels in Extreme LR: At 32 × 32 resolution, most high-frequency details are permanently lost. However, our HAT-SOLIDER framework does not “recover” lost pixels but learns to augment discriminative features in the latent space. The semantic controller guides the HAT module to emphasize pedestrian regions, while the joint loss ensures that enhanced features align with ReID objectives.

To validate this, we visualize the attention heatmaps of HAT-SOLIDER on the LR and SR images of the LR test, which are shown in Figure 9. The color contrast of SR is more pronounced, with warmer tones in the human body section compared to LR, and the color transitions are also clearer. This allows the heat distribution differences in the data to be displayed more intuitively, making it easier for observers to quickly identify key information. In contrast, the colors of LR are relatively softer, and the visual impact and differentiation of high-heat areas are not as strong as those of SR.

5. Conclusions

This paper addressed LR-induced performance degradation in cross-camera person ReID via two solutions. The first solution is SR preprocessing. HAT, PiSA-SR, and Omni-SR enhanced image quality, slightly improving ReID accuracy (0.3~0.4% mAP gain) for SOLIDER-REID and light-REID, while slightly sacrificing speed. PiSA-SR performed best due to its adjustable pixel/semantic-level optimization. The second solution is the end-to-end HAT-SOLIDER framework. Joint optimization of SR and ReID significantly boosted LR ReID performance (19.2% Rank-1 and 19.5% mAP gain even on 32 × 32 images), eliminating information loss from independent preprocessing. The findings confirm that SR technology is a viable solution for LR ReID, and end-to-end integration of SR and ReID is more effective than two-stage strategies, which provides a practical reference for smart city surveillance systems.

Limitations: This work has limitations that point to valuable future research. We lack detailed computational metrics (e.g., FLOPs, FPS), limiting precise assessment of the model’s practical cost-effectiveness. Future Work: Cross-Modal Extension: We plan to extend the semantic-guided enhancement mechanism to cross-modal tasks (e.g., visible-infrared ReID), leveraging SR to bridge the modality gap. And we will supplement the standard deviations of experiments with multiple randomized seeds and conduct statistical tests to enhance result credibility, while also adding more comparative experiments with end-to-end or lightweight baselines.

Author Contributions

Methodology, Z.L. and Y.L.; formal analysis, Y.L. and L.L.; data curation, Y.L.; writing—original draft preparation, Z.L., L.L. and C.K.; Conceptualization, Z.L., Y.L. and C.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by National Natural Science Foundation of China (No. 62466038), Jiangxi Provincial Key Laboratory of Image Processing and Pattern Recognition (No. 2024SSY03111), Jiangxi Provincial Natural Science Foundation (Key Program) (No. 20242BAB26015), Open Foundation of Jiangxi Provincial Key Laboratory of Image Processing and Pattern Recognition (ET202404437), High Performance Computing Service of Information Center, Nanchang University.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: https://pan.baidu.com/s/1ntIi2Op (accessed on 9 October 2025).

Acknowledgments

We thank the anonymous reviewers for their valuable suggestions that improved the quality of this article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ReID	Person Re-identification
SR	Super-Resolution
LR	Low-Resolution
HR	High-Resolution
HAT	Hybrid Attention Transformer
PiSA-SR	Pixel-level and Semantic-level Adjustable SR
Omni-SR	Omni Aggregation Networks for Lightweight Image SR
SOLIDER-REID	Semantically Controllable Self-Supervised Learning Framework-REID
RGA	Relation-Aware Global Attention
PSNR	Peak Signal-to-Noise Ratio
SSIM	Structural Similarity Index
mAP	Mean Average Precision
CNN	Convolutional Neural Network
GAN	Generative Adversarial Network
LoRA	Low-Rank Adaptation
LPIPS	Learned Perceptual Image Patch Similarity
CSD	Classifier Score Distillation
SGD	Stochastic Gradient Descent
SOTA	state-of-the-art

References

Yu, Z.; Cai, Y.; Xu, H.; Chen, L.; Yang, M.; Sun, H.; Zhao, X. An Attention-Enhanced Network for Person Re-Identification via Appearance–Gait Fusion. Electronics 2025, 14, 4142. [Google Scholar] [CrossRef]
Chen, Y.C.; Zhu, X.T.; Zheng, W.S.; Lai, J.-H. Person Re-identification by Camera Correlation Aware Feature Augmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 392–408. [Google Scholar] [CrossRef] [PubMed]
Liu, H.Z.; Liu, J.; Zhang, Y.H.; Chen, M.; Liu, J.; He, H.H. Progressive Feature Refining with the Deformable-Guidance Describer for Dense Video Captioning. Expert Syst. Appl. 2025, 294, 128778. [Google Scholar] [CrossRef]
Wang, Q.; Feng, G.; Li, Z. A Lightweight Person Detector for Surveillance Footage Based on YOLOv8n. Sensors 2025, 25, 436. [Google Scholar] [CrossRef]
Tang, Y.; Yang, X.; Jiang, X.; Wang, N.; Gao, X. Dually Distribution Pulling Network for Cross-Resolution Person Reidentification. IEEE Trans. Cybern. 2022, 52, 12016–12027. [Google Scholar] [CrossRef] [PubMed]
Yan, L.C.; Wang, F.; Leng, L.; Teoh, A.B.J. Toward Comprehensive and Effective Palmprint Reconstruction Attack. Pattern Recognit. 2024, 155, 110655. [Google Scholar] [CrossRef]
Zheng, L.; Shen, L.Y.; Tian, L.; Wang, S.; Wang, J.; Tian, Q. Scalable person re-identification: A benchmark. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; IEEE Press: Piscataway, NJ, USA, 2015; pp. 1116–1124. [Google Scholar]
Wei, W.Y.; Yang, W.Z.; Zuo, E.G.; Qian, Y.; Wang, L. Person Re-identification Based on Deep Learning—An Overview. J. Vis. Commun. Image Represent. 2022, 82, 103418. [Google Scholar] [CrossRef]
Wang, Z.H.; Chen, J.; Hoi, S.C.H. Deep Learning for Image Super-Resolution: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 40, 3365–3387. [Google Scholar] [CrossRef] [PubMed]
Chen, X.Y.; Wang, X.T.; Zhou, J.T.; Qiao, Y.; Dong, C. Activating More Pixels in Image Super-Resolution Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2023; IEEE Press: Piscataway, NJ, USA, 2023; pp. 22367–22377. [Google Scholar]
Sun, L.C.; Wu, R.Y.; Ma, Z.Y.; Liu, S.; Yi, Q.; Zhang, L. Pixel-Level and Semantic-Level Adjustable Super-Resolution: A Dual-Lora Approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Los Angeles, CA, USA, 16–22 June 2025; IEEE Press: Piscataway, NJ, USA, 2025; p. 33357. [Google Scholar]
Wang, H.; Chen, X.H.; Ni, B.B.; Liu, Y.; Liu, J. Omni Aggregation Networks for Lightweight Image Super-Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2023; IEEE Press: Piscataway, NJ, USA, 2023; pp. 22378–22387. [Google Scholar]
Chen, W.H.; Xu, X.Z.; Jia, J.; Luo, H.; Wang, Y.; Wang, F.; Jin, R.; Sun, X. Beyond Appearance: A Semantic Controllable Self-Supervised Learning Framework for Human-Centric Visual Tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2023; IEEE Press: Piscataway, NJ, USA, 2023; pp. 15050–15061. [Google Scholar]
Wang, G.A.; Huang, X.W.; Gong, S.G.; Zhang, J.; Gao, W. Faster Person Re-identification: One-Shot-Filter and Coarse-to-Fine Search. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 3013–3030. [Google Scholar] [CrossRef] [PubMed]
Zhang, Z.Z.; Lan, C.L.; Zeng, W.J.; Jin, X.; Chen, Z. Relation-Aware Global Attention for Person Re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE Press: Piscataway, NJ, USA, 2020; pp. 3183–3192. [Google Scholar]
Xing, E.P.; Ng, A.Y.; Jordan, M.I.; Russell, S. Distance Metric Learning with Application to Clustering with Side-Information. In Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 9–14 December 2002; MIT Press: Cambridge, MA, USA, 2002; pp. 521–528. [Google Scholar]
Lin, W.J.; Chu, J.; Leng, L.; Miao, J.; Wang, L.F. Feature Disentanglement in One-Stage Object Detection. Pattern Recognit. 2024, 145, 109878. [Google Scholar] [CrossRef]
Dong, C.; Loy, C.C.; He, K.; Tang, X.O. Learning a Deep Convolutional Network for Image Super-Resolution. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; Springer: Cham, Switzerland, 2014; pp. 184–199. [Google Scholar]
Kim, J.W.; Lee, J.K.; Lee, K.M. Accurate Image Super-Resolution Using Very Deep Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE Press: Piscataway, NJ, USA, 2016; pp. 1646–1654. [Google Scholar]
Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.P.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE Press: Piscataway, NJ, USA, 2017; pp. 105–114. [Google Scholar]
Lim, B.; Son, S.; Kim, H.; Nah, S.; Lee, K.M. Enhanced Deep Residual Networks for Single Image Super-Resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; IEEE Press: Piscataway, NJ, USA, 2017; pp. 136–144. [Google Scholar]
Sun, L.; Dong, J.X.; Tang, J.H.; Pan, J. Spatially-Adaptive Feature Modulation for Efficient Image Super-Resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–8 October 2023; IEEE Press: Piscataway, NJ, USA, 2023; pp. 13144–13153. [Google Scholar]
Li, B.C.; Li, X.; Zhu, H.X.; Jin, Y.; Feng, R.; Zhang, Z.; Chen, Z. SeD: Semantic-Aware Discriminator for Image Super-Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–24 June 2024; IEEE Press: Piscataway, NJ, USA, 2024; pp. 25784–25795. [Google Scholar]
Zhang, X.L.; Liu, J.; Chen, C.C.; Gong, P.Z.; Wu, Z.D.; Guo, L. Modeling Temporal Continuity of Spatial Interactions for Vessel Trajectories Prediction in Maritime Transportation Systems. Eng. Appl. Artif. Intell. 2025, 158, 111378. [Google Scholar] [CrossRef]
Sun, Y.; Zheng, L.; Li, Y.; Yang, Y.; Tian, Q.; Wang, S. Learning Part-based Convolutional Features for Person Re-Identification. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 902–917. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; IEEE Press: Piscataway, NJ, USA, 2021; pp. 9992–10002. [Google Scholar]
Zheng, K.C.; Liu, W.; He, L.X.; Mei, T.; Luo, J.; Zha, Z.-J. Group-Aware Label Transfer for Domain Adaptive Person Re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE Press: Piscataway, NJ, USA, 2021; pp. 5306–5315. [Google Scholar]
Ren, K.J.; Zhang, L. Implicit Discriminative Knowledge Learning for Visible-Infrared Person Re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–24 June 2024; IEEE Press: Piscataway, NJ, USA, 2024; pp. 393–402. [Google Scholar]
Wu, Z.; Yu, X.; Zhu, D.; Pang, Q.; Shen, S.; Ma, T.; Zheng, J. SR-DSFF and FENet-ReID: A Two-Stage Approach for Cross Resolution Person Re-identification. Comput. Intell. Neurosci. 2022, 2022, 4398727. [Google Scholar] [CrossRef] [PubMed]
Wang, Z.; Ye, M.; Yang, F.; Bai, X.; Satoh, S.I. Cascaded SR-GAN for scale-adaptive low resolution person re-identification. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI), Stockholm, Sweden, 13–19 July 2018; AAAI Press: Palo Alto, CA, USA, 2018; pp. 3891–3897. [Google Scholar]

Figure 1. An image captured by a low-cost camera.

Figure 2. An image taken from a long distance.

Figure 3. HAT network structure. Among them, solid arrows represent the transmission paths of data or features.

Figure 4. SOLIDER-REID network structure. Among them, solid arrows represent the transmission paths of data or features, dashed arrows represent the feedback path of clustering results, and the black areas denote regions used to occlude part of the image content for evaluating the model’s performance in semantic understanding or recognition tasks.

Figure 5. HAT-SOLIDER fusion model network structure. Among them, solid arrows represent the transmission paths of data or features, and dashed arrows represent the feedback path of clustering results.

Figure 6. Some images from the Market-1501 dataset. (a) Original images, (b) Downsampled images.

Figure 7. The original and SR images of the Market-1501 dataset.

Figure 8. The original and SR images of the LR test.

Figure 9. Heatmaps of original and SR images of the LR test. In the figures, deeper colors (e.g., red) indicate better performance.

Table 1. Summary of SR enhancement methods.

Category	Method Examples	Key Characteristics
Traditional Interpolation	Nearest Neighbor, Bilinear, Bicubic	Fast computation (O(W × H)) but causes blockiness or blurring; lacks semantic learning.
Deep Learning-based SR	SRCNN, VDSR, SRGAN, EDSR	Learns LR-to-HR mappings; improves PSNR/MOS but computationally costly and data-dependent.
Semantic-Guided Adaptive SR	SAFM, SeD, PiSA-SR	Uses semantic cues (e.g., pedestrian regions) to guide detail recovery; improves ReID relevance.

Table 2. Summary of ReID Methods.

Category	Supervision or Modality	Method Examples	Key Characteristics
Traditional	-	Color (RGB/HSV), Shape (HOG), Texture (Gabor)	Handcrafted features; computationally cheap but limited robustness in complex scenes.
Deep Learning	Supervised	PCB, Swin-ReID	Uses identity labels; employs part-based or transformer architectures for robustness.
Deep Learning	Unsupervised	GLT, SOLIDER-REID	Uses clustering for pseudo-labels; reduces annotation cost; adapts to LR via semantics.
Deep Learning	Single-Modal	-	Uses only visible light images.
Deep Learning	Cross-Modal	IDKL	Bridges modalities (e.g., visible infrared) for all-condition ReID.

Table 3. Summary of these paradigms.

Characteristic	Two-Stage Model (A + B)	End-to-End Model (A − B)
Structural Relation	Sequential, models independent	Parallel, module embedded
Gradient Flow	Interrupted, no backpropagation	Continuous, joint backpropagation
Optimization Target	Independent SR quality (PSNR/Structural Similarity Index (SSIM)) and ReID accuracy	Unified joint loss, SR for ReID
Information Flow	Image space (pixel reconstruction)	Feature space (feature enhancement)
Advantage	Simple implementation, modular	Avoids information loss, is task-driven, and superior performance

Table 4. PSNR and SSIM results of SR image enhancement.

Method	PSNR	SSIM
HAT	34.36	0.940
PiSA-SR	34.95	0.945
Omni-SR	34.12	0.900