Next Article in Journal
A 3D Microfluidic Paper-Based Analytical Device with Smartphone-Based Colorimetric Readout for Phosphate Sensing
Previous Article in Journal
Radar HRRP Sequence Target Recognition Based on a Lightweight Spatiotemporal Fusion Network
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Hybrid Multi-Scale Transformer-CNN UNet for Crowd Counting

1
School of Information Network Security, People’s Public Security University of China, Beijing 100038, China
2
School of Cyber Security and Smart Policing, Zhengzhou Police University, Zhengzhou 450000, China
3
Research Center for Cybersecurity and Artificial Intelligence, People’s Public Security University of China, Beijing 100038, China
*
Author to whom correspondence should be addressed.
Sensors 2026, 26(1), 333; https://doi.org/10.3390/s26010333
Submission received: 1 December 2025 / Revised: 29 December 2025 / Accepted: 1 January 2026 / Published: 4 January 2026
(This article belongs to the Section Sensing and Imaging)

Abstract

Crowd counting is a critical computer vision task with significant applications in public security and smart city systems. While deep learning has markedly improved accuracy, persistent challenges include extreme scale variations, severe occlusion, and complex background clutter. To address these issues, we propose a novel Hybrid Multi-Scale Transformer-CNN U-shaped Network (HMSTUNet). Our key contributions are: a hybrid architecture integrating a Multi-Scale Vision Transformer (MSViT) for capturing long-range dependencies and a Dynamic Convolutional Attention Block (DCAB) for modeling local density patterns; and a U-shaped encoder–decoder with skip connections for effective multi-level feature fusion. Extensive evaluations on five public benchmarks show that HMSTUNet achieves the best Mean Absolute Error (MAE) on all five datasets and the best Mean Squared Error (MSE) on three. It sets new state-of-the-art records, attaining MAE/MSE of 49.1/77.8 on SHA, 6.2/10.3 on SHB, 142.1/192.7 on UCF_CC_50, 77.9/132.5 on UCF-QNRF, and 43.2/119.6 on NWPU-Crowd. These results demonstrate the model’s strong robustness and generalization capability.

1. Introduction

Crowd counting has emerged as a prominent research focus in computer vision, aiming to accurately estimate the number of individuals in densely populated scenes. This capability plays a pivotal role in domains such as public safety [1] and smart city management [2]. For instance, in mass-transit hubs like airports and railway terminals, real-time crowd density estimation can deliver early warnings to prevent stampedes caused by overcrowding. From a broader urban perspective, these models generate high-resolution population-distribution maps, guiding resource allocation and informing smarter city planning and governance. However, crowd counting presents significant challenges, including substantial scale and density variations, severe occlusion, and complex backgrounds. To tackle these issues, diverse deep learning architectures have been developed, employing distinct strategies for multi-scale feature extraction and representation.
Deep learning models have substantially boosted crowd-counting accuracy; among them, Convolutional Neural Network (CNN) [3] and Transformer [4] architectures constitute the two dominant streams. Zhang et al. [5] proposed the Multi-column CNN (MCNN), which adapts to drastic scale variations caused by perspective or image resolution by equipping each column with filters of different receptive-field sizes; this design accepts images of arbitrary size, and the features learned by each column are inherently scale-adaptive. Li et al. [6] proposed a tandem extractor composed of a front-end standard CNN for 2D feature extraction and a back-end dilated CNN to enlarge the receptive field. Liu et al. [7] presented the Deep Structured Scale Integration (DSSI) Network that further alleviates scale variation through structured feature representation learning and hierarchically structured loss optimization. Nevertheless, CNNs rely on inductive biases such as locality and spatial invariance; although this yields fast inference, their limited receptive fields fail to capture long-range dependencies within the image.
Inspired by sequence modeling in natural language processing, researchers have designed the Transformer architecture, renowned for its long-range dependency modeling, into computer vision, achieving remarkable performance across diverse visual tasks. Vision Transformers (ViTs) leverage global self-attention to capture dependencies between arbitrary image patches, thereby overcoming the limited receptive fields of CNNs. However, the global attention mechanism incurs quadratic computational complexity, while modeling interactions between all patches often introduces redundancy in mainstream vision tasks. To mitigate computational overhead and attention redundancy, recent studies incorporate CNN-inspired inductive biases and adopt local self-attention within restricted receptive fields, as exemplified by EdgeViT [8] and RepViT [9]. Nevertheless, such locality-driven integrations inevitably compromise global modeling capacity, diminishing the ability to encode long-range dependencies. Consequently, designing hybrid architectures that effectively balance the computational efficiency of CNNs with the global representational power of Transformers remains a critical research frontier.
To address persistent challenges in crowd counting, we propose a Hybrid Multi-Scale Transformer-CNN U-shaped Net (HMSTUNet). Our model synergistically integrates the complementary strengths of ViTs and CNNs to achieve enhanced accuracy in crowd counting. Our principal contributions are threefold:
1. We propose a novel U-shaped encoder–decoder architecture, HMSTUNet. The encoder leverages the ConvNeXt backbone to extract hierarchical multi-level features from dense crowd images. The decoder incorporates a hybrid multi-scale network that effectively combines ViT and CNN components to capture both local and global features across different scales and levels. This U-shaped design ensures effective preservation and propagation of semantically critical features.
2. We design a new dual-branch fusion network that integrates transformer and CNN pathways to significantly boost feature representation capacity and counting accuracy. The dual-branch structure comprises a Multi-Scale Vision Transformer (MSViT) block, which employs a multi-scale multi-head attention mechanism, and a Dynamic Convolutional Attention Block (DCAB) that integrates a multi-dimensional attention mechanism with dynamic convolution operations. Furthermore, a multi-scale pyramid decoder (DecoderFPN) is developed to efficiently aggregate and refine multi-level features.
3. We demonstrate the effectiveness of HMSTUNet through extensive experiments on five widely used benchmarks. Our model achieves the best Mean Absolute Error (MAE) on all five datasets and the best Mean Square Error (MSE) on three of them, demonstrating a clear and comprehensive superiority. These results across diverse scenes underscore the robustness and strong generalization capability of our approach.

2. Related Work

2.1. Model Design for Crowd Counting

In computer vision, the predominant approach to crowd counting is density map regression. This method employs deep neural networks to predict a continuous density map from an input image, with the total count obtained by summing up the pixel values over the entire map. However, this paradigm faces three core challenges: (1) significant scale and density variations, where individuals in the foreground are larger and more sparse, while those in the background are smaller and densely packed; (2) pervasive occlusion and overlap, which often obscure the full-body visual cues of most individuals; and (3) considerable background clutter and variability due to varying illumination conditions and complex environments.
To mitigate these challenges, researchers have explored incorporating multi-scale feature fusion, attention mechanisms, and Transformer architectures to enhance model robustness and counting accuracy. Multi-scale feature fusion methods effectively mitigate scale variation by integrating features from different hierarchical levels. Liu et al. [10] proposed an end-to-end architecture that leverages a spatial pyramid to fuse multi-scale features, thereby adaptively encoding contextual information at appropriate scales. Han et al. [11] developed a selective inheritance learning method, which identifies optimal-scale features and employs a progressive strategy to inherit critical characteristics, demonstrating remarkable scale generalization. Attention mechanisms address imbalanced person density by enabling the model to focus on salient spatial regions or channels. Jiang et al. [12] designed a dual-branch network consisting of a density-aware attention network and an attention scaling network; the fusion of their outputs generates a refined density map, effectively mitigating uneven density distribution. Lin et al. [13] proposed a multifaceted attention network that integrates global and local attention, achieving superior performance. The self-attention mechanism inherent in Transformers excels at dynamically capturing global dependencies, providing distinct advantages for modeling complex scenes with scale variations, severe occlusion, and cluttered backgrounds. Tian et al. [14] employed a pyramid ViT as a backbone to capture global features, complemented by multi-scale dilated convolutions for density map prediction. Qian et al. [15] proposed a multi-scale U-Transformer network to capture multi-level semantic and fine-grained spatial features, leading to improved counting accuracy. Based on the aforementioned research, we have developed a U-shaped network architecture that incorporates a DCAB to extract multi-scale attention features and effectively integrates multi-level representations, thereby enhancing the overall model performance.

2.2. Vision Transformer

ViTs have demonstrated remarkable success across a range of visual tasks. The underlying mechanism involves applying self-attention to image patches and leveraging large-scale datasets together with data augmentation strategies [16], achieving outstanding performance in tasks such as image classification, object detection, and semantic segmentation. To handle large-scale variations among visual entities and high-resolution input images, the hierarchical Swin Transformer architecture with shifted windows was proposed [17]. It limits self-attention computation to local windows to reduce computational cost and employs a cross-window shifting mechanism to model relationships between patches across different windows. Within the inherently “columnar” topology of ViTs, incorporating multi-scale information has been shown to effectively improve visual task accuracy. As a result, several studies have focused on integrating multi-scale features. For instance, Conformer [18] integrates parallel self-attention and convolutional branches to capture multi-scale representations. MPViT [19] employs multi-scale patch embeddings and cascaded transformer blocks for multi-scale feature extraction, while CrossFormer [20] adopts cross-scale embedding layers combined with long-short distance attention mechanisms to enhance cross-granular dependency modeling. Inspired by these developments, we propose a multi-scale ViT block, termed MSViT, which significantly improves multi-scale feature extraction in dense crowd images and consequently boosts the overall network performance.

2.3. Loss Function

In response to the significant challenges posed by highly non-uniform spatial distributions and severe scale variations in crowd scenes, the development of specialized loss functions has become crucial for achieving accurate counting performance. Lempitsky et al. [21] established the density-map paradigm by reformulating counting as a per-pixel density estimation task. Building on this, Wan et al. [22] developed an adaptive density-map generator that dynamically integrates multi-scale blurred density maps, resulting in substantial improvements in map fidelity while maintaining computational efficiency in end-to-end training. Wang et al. [23] subsequently proposed DM-Count, a distribution matching framework that minimizes the Wasserstein distance between predicted and ground-truth density distributions to enforce spatial consistency. Building on the DM-Count approach, we formulated its corresponding loss function and conducted experimental evaluations, which demonstrated superior performance.

3. Methods

3.1. Overall Architecture

Figure 1 illustrates the architectural design of HMSTUNet, which adopts an encoder–decoder U-Net structure tailored for crowd counting. The network is specifically designed to capture both high-level semantic information and low-level spatial details through enhanced skip connections. For the encoder, we employ ConvNeXt-Small, a state-of-the-art convolutional neural network recognized for its computational efficiency and superior performance in computer vision tasks. To handle the significant scale variations inherent in crowd scenes, we extract features from three distinct stages of ConvNeXt-Small. This multi-scale feature extraction strategy enables comprehensive representation learning across multiple hierarchical levels and resolutions, thereby enhancing the network’s capability for crowd counting. The decoder incorporates a sophisticated architectural innovation, consisting of two parallel blocks: a Multi-scale Vision Transformer (MSViT) block and a multi-dimensional feature extraction DCAB. This innovative network structure serves as the cornerstone for a multi-level feature fusion mechanism. Finally, the fused multi-level feature representations are transformed into a single-channel feature map, culminating in the generation of a density map that encapsulates both crowd count and spatial localization information. The proposed architecture represents a synergistic approach to crowd counting, integrating advanced neural network designs to address traditional challenges in complex, scale-variant crowd scenes.

3.2. Multi-Scale Vision Transformer Block

To expand the receptive field while preserving spatial resolution, we propose the Multi-Scale Vision Transformer (MSViT) block. As illustrated in Figure 2, the block captures multi-scale contextual features by incorporating multiple dilation rates. The proposed architecture consists of N sequentially connected MSViT blocks, each containing s distinct dilation rates. By assigning different dilation rates to the self-attention heads, multi-scale features can be integrated within a single self-attention operation. This design not only facilitates effective interaction across scales but also significantly reduces computational redundancy in self-attention mechanisms, thereby lowering overall computational cost. Specifically, the input feature map is first projected from c channels to 3 c channels via a 1 × 1 convolutional layer, generating Query ( Q ), Key ( K ), and Value ( V ) matrices. Subsequently, Sliding Window Multi-Head Self-Attention (SWMSA) is performed under s different dilation rates to enable multi-scale feature interaction. The resulting multi-scale features are then concatenated and fused. Finally, the channel-wise features at each spatial location are processed through a linear layer for transformation and fusion, followed by a 1 × 1 convolution that reduces the channel dimension by half to produce the output feature map. The calculation formulas for the MSViT block are as follows:
Q i , K i , V i = C o n v 1 × 1 ( X i n )       1 i N h e a d
X i = S W M S A ( Q i , K i , V i , r s )
X o u t = σ R e L U ( B N ( C o n v 1 × 1 ( L i n e a r ( C o n c a t [ X 1 , X 2 X N h e a d ] ) ) ) )
In Equations (1) and (3), X i n R H × W × C denotes the input feature map; X o u t R H × W × C 2 denotes the output feature map; S W M S A ( ) denotes the Sliding Window Multi-Head Self-Attention operation; Q i R H × W × C , K i R H × W × C and V i R H × W × C denote the query, key, and value feature matrices, respectively; X i R H × W × C denotes the intermediate feature map; and r s denotes the set of dilation rates. In our implementation, the number of attention heads N h e a d is set to 12, the dilation rates r s are [1, 2, 3], and the number of block repetitions N is 2. This specific combination of parameters is experimentally verified to deliver optimal model performance.

3.3. Dynamic Convolutional Attention Block

Inspired by the compelling effectiveness of dynamic convolution, we propose a novel Dynamic Convolutional Attention Block (DCAB) designed to operate in parallel with the MSViT block, thereby synergistically enhancing feature representation. The core innovation of our DCAB lies in its integrated multi-dimensional attention mechanism and parallel processing strategy, which dynamically learns complementary attention weights across four key dimensions of the convolution kernel: spatial size, number of kernels, input channels, and output channels. This design enables effective multi-scale feature extraction and significantly strengthens feature representation capabilities.
As depicted in Figure 3, the input feature map X i n first undergoes channel-wise attention enhancement. This step allows each channel to adaptively integrate global contextual information while aggregating features across all spatial positions. The resulting features are then processed by the Multi-dimensional Dynamic Convolution Attention (MDCA) block, which dynamically generates convolutional kernel weights conditioned on both the input data and the network’s learning state. This mechanism comprehensively captures feature interactions across the four kernel dimensions. By stacking multiple MDCA layers, the model can more effectively capture both dynamic and static feature variations in the scene, further improving its representational capacity. To optimize efficiency, a channel bottleneck mechanism is incorporated, where the number of channels is first reduced to one-quarter and subsequently restored to half of the original dimensionality, significantly lowering parameter count and computational cost. Residual connections are also employed to enhance generalization. Finally, a spatial attention block guides the model to focus on semantically salient regions in crowded images while suppressing irrelevant background noise, thereby enhancing robustness to occlusions and complex backgrounds. The computational flow of the proposed DCAB is defined by the following formulations:
X 1 = X i n σ s i g m o i d ( σ R e L U ( C o n v 1 × 1 ( W A v g P o o l ( X i n ) ) ) )
X 3 = M D C A ( M D C A ( X 1 ) + X 1 )
X 4 = M D C A ( M D C A ( X 3 ) + X 3 )
X o u t = X 4 σ s i g m o i d ( C o n v 7 × 7 ( M e a n C ( X 4 ) ) )
The core MDCA operation is implemented as:
X 2 = M D C A ( X 1 ) = X 1 k = 1 n ( α s k α n k α i k α o k W k )
Here, X i n R H × W × C and X o u t R H × W × C 2 denote the input and output feature maps, respectively; A v g P o o l denotes represent average pooling; W denotes the channel attention weight matrix; M e a n C represents the mean operation along the channel dimension; X 1 R H × W × C , X 2 R H × W × C , X 3 R H × W × C 4 and X 4 R H × W × C 2 are intermediate feature maps; M D C A ( ) refers to the Multi-Dimensional Convolutional Attention block; α s k , α n k , α i k and α o k denote the attention weights along the spatial size, number, input channel, and output channel dimensions, respectively; W k represents the weights of the k-th convolutional kernel.

3.4. Loss Function

Drawing inspiration from DM-Count, we propose a composite loss function that strategically combines three complementary components: a counting loss, an Optimal Transport (OT) loss, and a Variation (V) loss, as mathematically formulated in Equation (9):
L = L C + λ 1 L O T + λ 2 L V
where L represents the overall composite loss, L C denotes the counting loss, L O T corresponds to the optimal transport loss, and L V indicates the total variation loss. The hyperparameters λ 1 and λ 2 , which control the relative contributions of the loss components, were both set to 0.1 in our experiments, a configuration that yielded the optimal performance.
The counting loss is designed to measure the absolute deviation between the predicted total count and the ground-truth total count, thereby enforcing global consistency in the overall cardinality estimation. It is formally defined as:
L C = z 1 z ^ 1
where z R n signifies the vectorized ground-truth density map, z ^ R n represents the vectorized predicted density map, and 1 denotes the L1 norm of a vector, representing the ummation of all pixel values in the density map.
The OT loss captures the discrepancy between the predicted density distribution and the ground-truth density distribution in the probability measure space. Specifically, we employ the Sinkhorn algorithm [24] to compute the optimal transport plan from the normalized predicted distribution to the normalized ground-truth distribution. The resulting optimal transport cost is then utilized as the loss value:
L O T = β z ^ 1 β , z ^ z ^ 1 2 , z ^
where β denotes the optimal dual variable derived from the Sinkhorn iterations, and , represents the vector inner product.
The Variation loss focuses on enhancing local structural similarity by comparing the intensity variations between adjacent pixels in the predicted and ground-truth density maps, thereby promoting spatial coherence. It is formulated as:
L V = z 1 z z 1 z ^ z ^ 1 1

4. Experiments

4.1. Datasets and Implementation Details

To comprehensively evaluate the performance of the proposed HMSTUNet, we conducted extensive experiments on five publicly available benchmark datasets and compared the results with state-of-the-art methods. The selected datasets—ShanghaiTech Part A and Part B [5], UCF_CC_50 [25], UCF-QNRF [26], and NWPU-Crowd [27]—vary considerably in scene complexity, crowd density distribution, and image resolution, as summarized in Table 1.
ShanghaiTech comprises 1198 annotated images with a total of 330,165 annotated instances, divided into two subsets: Part A (SHA) and Part B (SHB). SHA includes 482 images (300 for training and 182 for testing) collected from the Internet, while SHB consists of 716 images (400 for training and 316 for testing) obtained from real surveillance footage. Given its diverse scenes and highly imbalanced density distributions, this dataset represents a challenging benchmark in crowd counting.
UCF_CC_50 is a small yet highly variable dataset containing only 50 images with 63,075 annotated instances, where the number of people per image ranges from 94 to 4543. Encompassing various complex scenes and perspectives, this dataset presents significant challenges. Due to the limited data size, we followed the 5-fold cross-validation strategy recommended by the creators.
UCF-QNRF contains 1535 annotated images with a total of 1,251,642 annotated instances, split into 1201 training images and 334 test images. This dataset is characterized by dramatic variations in both crowd density and image resolution across diverse scenes, posing a rigorous test of model generalization capability.
NWPU-Crowd is currently the largest dataset, comprising 5109 images with 2,133,375 annotated instances. It is partitioned into training (3109 images), validation (500 images), and test (1500 images) sets. This dataset covers a wide spectrum of illumination conditions, viewing angles, and scene categories, and includes 351 negative samples (scenes without crowds). As the ground-truth annotations for the test set are not publicly released, we utilize the validation set for performance evaluation.
The encoder of HMSTUNet was initialized with the official ConvNeXt-Small model pretrained on ImageNet-1k. For data augmentation, we employed only random cropping and horizontal flipping. The crop sizes were set to 256 for SHA, 512 for SHB, UCF_CC_50, and UCF-QNRF, and 384 for NWPU-Crowd. All models were trained using the AdamW optimizer with an initial learning rate of 1 × 10−5. A batch size of 8 was used for the ShanghaiTech (SHA and SHB), UCF_CC_50, and NWPU-Crowd datasets, whereas a batch size of 32 was applied to the UCF-QNRF dataset. L2 regularization was incorporated with a coefficient of 0.005 for the ShanghaiTech datasets and 0.0001 for the UCF_CC_50, UCF-QNRF, and NWPU-Crowd datasets. All evaluations were performed on a single NVIDIA A100 GPU.

4.2. Evaluation Metrics

In crowd counting tasks, Mean Absolute Error (MAE) and Mean Square Error (MSE) are employed as standard evaluation metrics to quantify model performance. MAE measures the average absolute deviation between predicted counts and ground-truth annotations, providing an intuitive indicator of prediction accuracy. MSE, defined as the square root of the average squared errors, imposes a higher penalty on large discrepancies and thus effectively assesses model robustness against outliers and noisy samples. The mathematical formulations are expressed as:
M A E = 1 N i = 1 N | y i y ^ i |
M S E = 1 N i = 1 N y i y ^ i 2
where N denotes the total number of test images, y i represents the ground-truth person count, and y ^ i signifies the predicted count for the i-th image.

4.3. Comparisons with State-of-the-Art Methods

In this section, we evaluate the performance of HMSTUNet by comparing it with 17 mainstream methods on five public crowd-counting datasets: SHA, SHB, UCF_CC_50, UCF-QNRF, and NWPU-Crowd. The comprehensive results are summarized in Table 2.
On the SHA dataset, our HMSTUNet achieved the best performance with an MAE of 49.1 and an MSE of 77.8, reducing the MAE by 0.2 and the MSE by 1.0 compared to the second-best method, PET [28]. For the SHB dataset, our model attained an MAE of 6.2 and an MSE of 10.3, matching the optimal MAE performance of PET and surpassing FGENet [29] and P2PNet [30] by 0.1 in MAE. In terms of MSE on SHB, HMSTUNet remained highly competitive, trailing the best result (PET) by only 0.5. On the UCF_CC_50 dataset, HMSTUNet achieved an MAE of 142.1 and an MSE of 192.7, reducing the MAE and MSE of the runner-up FGENet by 0.5 and 23.2, respectively, thus setting a new state-of-the-art. Similarly, on UCF-QNRF, our method achieved leading results with an MAE of 77.9 and an MSE of 132.5, surpassing the second-best method FGENet by 1.6 in MAE and 11.8 in MSE. On the NWPU-Crowd dataset, HMSTUNet attained an MAE of 43.2 and an MSE of 119.6, achieving the best MAE which is 8.2 lower than that of FIDTM [31], while its MSE ranked second, trailing FIDTM by 12.0.
Overall, HMSTUNet achieved the best MAE on all five datasets, demonstrating superior prediction accuracy. Regarding MSE, it attained the best performance on three datasets and remained highly competitive on the remaining two, evidencing robustness against noise and outliers.
Table 2. Performance comparison of different methods on SHA, SHB, UCF_CC_50, UCF-QNRF and NWPU-Crowd datasets. The best performance is in boldface, and the second best is underlined.
Table 2. Performance comparison of different methods on SHA, SHB, UCF_CC_50, UCF-QNRF and NWPU-Crowd datasets. The best performance is in boldface, and the second best is underlined.
MethodVenueParams(M)SHASHBUCF_CC_50UCF-QNRFNWPU-Crowd
MAEMSEMAEMSEMAEMSEMAEMSEMAEMSE
MCNN [5]CVPR’160.13110.2173.226.441.3377.6509.1277.0426.0218.5700.6
CSRNet [6]CVPR’1816.368.2115.010.6 16.0266.1397.5120.3208.0104.9433.5
CANNet [10]CVPR’1918.162.3100.07.812.2212.2243.7107.0183.093.6489.9
SFCN+ [32]CVPR’1938.664.8107.57.613.0214.2318.2102.0171.495.5608.3
BL [33]CVPR’1921.562.8101.87.712.7229.3308.288.7154.893.6470.4
AMRNet [34]ECCV’2059.361.698.47.011.0184.0265.886.6152.2--
DM-Count [23]NeurIPS’2021.559.795.77.411.8211.0291.585.6148.388.4388.6
LibraNet [35]ECCV’2017.955.997.17.311.3181.2262.288.1143.7--
Semi [36]ICCV’2116.766.9125.612.317.9--130.3226.3105.8445.3
P2PNet [30]ICCV’2119.252.785.16.39.9172.7256.285.3154.577.4362.0
CLTR [37]ECCV’2241.056.995.26.510.6--85.8141.361.9246.3
TransCrowd [38]SCIS’2290.466.1105.19.316.1--97.2168.588.4400.5
FIDTM [31]TMM’2266.657.0103.46.911.8171.4233.189.0153.651.4107.6
DMCNet [39]WACV’23-58.584.68.613.7----96.5164.0
PET [28]ICCV’2351.749.378.86.29.7--79.5144.374.4328.5
FGENet [29]MMM’24-51.685.06.310.5142.6215.985.2158.8--
VMambaCC [40]arXiv’24-51.981.37.512.5----88.4144.7
MobileCount [41]Neurocomputing3.489.4146.09.015.4284.8392.8131.1222.6--
HMSTUNet (ours)-61.949.177.86.210.3142.1192.777.9132.543.2119.6

4.4. Ablation Study

4.4.1. Component-Wise Ablation Study

As illustrated in Figure 1, the proposed HMSTUNet model is primarily composed of an encoder module and a decoder module. The encoder employs a pre-trained ConvNeXt-Small network as its backbone, while the decoder incorporates a hybrid feature extraction network combining MSViT and DCAB, which serves as the key component of our architectural design.
To validate the effectiveness of individual components, we conducted comprehensive ablation studies using a baseline model without the hybrid network on both the SHA and NWPU-Crowd datasets. The experimental results, summarized in Table 3, clearly demonstrate the contributions of each component:
1. The gradual incorporation of the combined MSViT and DCAB into successive decoder stages (S1 to S3) yields consistent and substantial performance gains across both datasets. For instance, on the SHA dataset, MAE decreases from 72.2 (Baseline) to 54.6 (Baseline + S3), confirming the effectiveness of local integration.
2. Isolating each component reveals their distinct strengths. The MSViT block excels at modeling global context and long-range dependencies, which is particularly beneficial for the large-scale and diverse NWPU-Crowd dataset, reducing MAE from 75.3 to 46.9. The DCAB focuses on adaptive local feature refinement, showing strong performance on both datasets.
3. The full integration of both MSViT and DCAB achieves the best results, outperforming all other configurations. This demonstrates their complementary nature: MSViT establishes a coherent global understanding, while DCAB performs precise local adjustment, a combination crucial for handling scale variation and congestion.

4.4.2. Loss Function Ablation Study

As shown in Equation (9), the loss function comprises three terms: the counting loss ( L C ), the optimal transport loss ( L O T ), and the variation loss ( L V ). By tuning the weight coefficients λ 1 and λ 2 , we systematically evaluated the contribution of each loss component to model performance, with results summarized in Table 4. The experiments demonstrate that the model achieves optimal performance when both λ 1 and λ 2 are set to 0.1. This ablation analysis quantitatively validates the central role of counting loss in crowd counting tasks, while revealing that optimal transport loss enhances macro-level counting accuracy and distribution matching, and variation loss improves the fine-grained local fidelity and spatial smoothness of density maps.

4.5. Visualization

Figure 4 and Figure 5 depict the visualization results of the proposed HMSTUNet model on the SHA and NWPU-Crowd datasets, respectively. These results encompass a variety of complex scenarios, including diverse image scales, crowd densities, and illumination conditions, validating the model’s robustness across different challenging environments. The findings demonstrate that the proposed model is capable of generating high-quality density maps and maintaining accurate crowd counting performance in various complex scenes.
Figure 6 presents the visualization results of HMSTUNet under challenging conditions, such as low illumination and severe occlusion. The model exhibits degraded prediction performance under these conditions. Specifically, for the first sample set, the discrepancy between the predicted and ground-truth counts is 64.09, which remains within an acceptable range. In contrast, the prediction errors rise to 186.38 and 93 for the second and third sample sets, respectively, indicating that the model’s performance in such extreme cases still requires substantial improvement.

5. Conclusions

To address the challenges of severe occlusion, large scale variations, and non-uniform distribution in dense crowd counting, this paper proposes HMSTUNet, a hybrid multi-scale Transformer-CNN U-shaped encoder–decoder network. The model adopts a U-shaped architecture to extract multi-level image features, where the encoder is built upon a pre-trained ConvNeXt backbone and the decoder incorporates a hybrid multi-scale design combining Transformer and CNN components to effectively capture both local and global contextual information. Specifically, to address severe occlusion and overlapping in crowd images, we design a MSViT block for effectively modeling long-range dependencies. For handling significant scale changes and complex density patterns, we propose a DCAB that enhances the model’s capacity for capturing complex density patterns. Furthermore, a multi-scale pyramid DecoderFPN block is developed to aggregate feature information with larger receptive fields, while skip connections are incorporated to fuse deep semantic information with shallow spatial details.
Extensive experiments on five public crowd counting benchmarks demonstrate that HMSTUNet achieves state-of-the-art performance. It obtains the best MAE on all five datasets and the best MSE on three of them, indicating strong robustness and generalization ability. Although HMSTUNet provides an effective solution for dense crowd counting, its performance under challenging conditions such as poor illumination and extreme occlusion remains to be further improved. Future work will focus on two key directions to advance this line of research. First, we will conduct a deeper investigation into the multi-scale Vision Transformer and dynamic convolution frameworks to improve the model’s robustness and generalization in complex scenarios. Second, we will delve into the interpretability of the proposed architecture and extend the design principles of the Multi-Scale Transformer-CNN UNet to related tasks, such as person re-identification, to validate its transferability and broader applicability.

Author Contributions

K.Z.: manuscript writing, methodology, and algorithm development; C.H.: validation and data analysis; S.P.: data curation; T.L.: research design. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Beijing Social Science Fund (24FXC017), and partially by the Key Research Projects of Higher Education Institutions in Henan Province (26B520057).

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author, lutianliang@ppsuc.edu.cn, upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Saleh, S.A.M.; Suandi, S.A.; Ibrahim, H. Recent survey on crowd density estimation and counting for visual surveillance. Eng. Appl. Artif. Intell. 2015, 41, 103–114. [Google Scholar] [CrossRef]
  2. Luo, A.; Yang, F.; Li, X.; Nie, D.; Jiao, Z.; Zhou, S.; Cheng, H. Hybrid graph neural networks for crowd counting. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 11693–11700. [Google Scholar] [CrossRef]
  3. Zhang, C.; Li, H.; Wang, X.; Yang, X. Cross-scene crowd counting via deep convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 833–841. [Google Scholar] [CrossRef]
  4. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar] [CrossRef]
  5. Zhang, Y.; Zhou, D.; Chen, S.; Gao, S.; Ma, Y. Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 589–597. [Google Scholar] [CrossRef]
  6. Li, Y.; Zhang, X.; Chen, D. CSRNet: Dilated convolutional neural networks for understanding the highly congested scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1091–1100. [Google Scholar] [CrossRef]
  7. Liu, L.; Qiu, Z.; Li, G.; Liu, S.; Ouyang, W.; Lin, L. Crowd counting with deep structured scale integration network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1774–1783. [Google Scholar] [CrossRef]
  8. Chen, Z.; Zhong, F.; Luo, Q.; Zhang, X.; Zheng, Y. EdgeViT: Efficient visual modeling for edge computing. In Proceedings of the International Conference on Wireless Algorithms, Systems, and Applications, Dalian, China, 24–26 November 2022; pp. 393–405. [Google Scholar] [CrossRef]
  9. Wang, A.; Chen, H.; Lin, Z.; Han, J.; Ding, G. RepViT: Revisiting mobile CNN from ViT perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 15909–15920. [Google Scholar] [CrossRef]
  10. Liu, W.; Salzmann, M.; Fua, P. Context-aware crowd counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5099–5108. [Google Scholar] [CrossRef]
  11. Han, T.; Bai, L.; Liu, L.; Ouyang, W. Steerer: Resolving scale variations for counting and localization via selective inheritance learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 21848–21859. [Google Scholar] [CrossRef]
  12. Jiang, X.; Zhang, L.; Xu, M.; Zhang, T.; Lv, P.; Zhou, B.; Yang, X.; Pang, Y. Attention scaling for crowd counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4706–4715. [Google Scholar] [CrossRef]
  13. Lin, H.; Ma, Z.; Ji, R.; Wang, Y.; Hong, X. Boosting crowd counting via multifaceted attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 19628–19637. [Google Scholar] [CrossRef]
  14. Tian, Y.; Chu, X.; Wang, H. CCTrans: Simplifying and improving crowd counting with transformer. arXiv 2021, arXiv:2109.14483. [Google Scholar] [CrossRef]
  15. Qian, Y.; Zhang, L.; Hong, X.; Donovan, C.; Arandjelovic, O.; Fife, U.; Harbin, P. Segmentation Assisted U-shaped Multi-scale Transformer for Crowd Counting. In Proceedings of the British Machine Vision Conference, London, UK, 21–24 November 2022. [Google Scholar]
  16. Zhang, Z.; Zhang, H.; Zhao, L.; Chen, T.; Arik, S.Ö.; Pfister, T. Nested hierarchical transformer: Towards accurate, data-efficient and interpretable visual understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; pp. 3417–3425. [Google Scholar] [CrossRef]
  17. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar] [CrossRef]
  18. Peng, Z.; Huang, W.; Gu, S.; Xie, L.; Wang, Y.; Jiao, J.; Ye, Q. Conformer: Local features coupling global representations for visual recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 367–376. [Google Scholar] [CrossRef]
  19. Lee, Y.; Kim, J.; Willette, J.; Hwang, S.J. MPViT: Multi-path vision transformer for dense prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7277–7286. [Google Scholar] [CrossRef]
  20. Wang, W.; Chen, W.; Qiu, Q.; Chen, L.; Wu, B.; Lin, B.; He, X.; Liu, W. Crossformer++: A versatile vision transformer hinging on cross-scale attention. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 3123–3136. [Google Scholar] [CrossRef] [PubMed]
  21. Lempitsky, V.; Zisserman, A. Learning to count objects in images. Adv. Neur. Inf. 2010, 23, 1324–1332. [Google Scholar]
  22. Wan, J.; Chan, A. Adaptive density map generation for crowd counting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1130–1139. [Google Scholar] [CrossRef]
  23. Wang, B.; Liu, H.; Samaras, D.; Nguyen, M.H. Distribution matching for crowd counting. Adv. Neur. Inf. 2020, 33, 1595–1607. [Google Scholar] [CrossRef]
  24. Peyré, G.; Cuturi, M. Computational optimal transport: With applications to data science. Found. Trends Mach. Learn. 2019, 11, 355–607. [Google Scholar] [CrossRef]
  25. Idrees, H.; Saleemi, I.; Seibert, C.; Shah, M. Multi-source multi-scale counting in extremely dense crowd images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 2547–2554. [Google Scholar] [CrossRef]
  26. Idrees, H.; Tayyab, M.; Athrey, K.; Zhang, D.; Al-Maadeed, S.; Rajpoot, N.; Shah, M. Composition loss for counting, density map estimation and localization in dense crowds. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 532–546. [Google Scholar] [CrossRef]
  27. Wang, Q.; Gao, J.; Lin, W.; Li, X. NWPU-crowd: A large-scale benchmark for crowd counting and localization. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 2141–2149. [Google Scholar] [CrossRef] [PubMed]
  28. Liu, C.; Lu, H.; Cao, Z.; Liu, T. Point-query quadtree for crowd counting, localization, and more. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 1676–1685. [Google Scholar] [CrossRef]
  29. Ma, H.Y.; Zhang, L.; Wei, X.Y. FGENet: Fine-grained extraction network for congested crowd counting. In Proceedings of the International Conference on Multimedia Modeling, Amsterdam, The Netherlands, 29 January–1 February 2024; pp. 43–56. [Google Scholar] [CrossRef]
  30. Song, Q.; Wang, C.; Jiang, Z.; Wang, Y.; Tai, Y.; Wang, C.; Li, J.; Huang, F.; Wu, Y. Rethinking counting and localization in crowds: A purely point-based framework. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 3345–3354. [Google Scholar] [CrossRef]
  31. Liang, D.; Xu, W.; Zhu, Y.; Zhou, Y. Focal inverse distance transform maps for crowd localization. IEEE Trans. Multimed. 2023, 25, 6040–6052. [Google Scholar] [CrossRef]
  32. Wang, Q.; Gao, J.; Lin, W.; Yuan, Y. Learning from synthetic data for crowd counting in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8198–8207. [Google Scholar] [CrossRef]
  33. Ma, Z.; Wei, X.; Hong, X.; Gong, Y. Bayesian loss for crowd count estimation with point supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6141–6150. [Google Scholar] [CrossRef]
  34. Liu, X.; Yang, J.; Ding, W.; Wang, T.; Wang, Z.; Xiong, J. Adaptive mixture regression network with local counting map for crowd counting. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 241–257. [Google Scholar] [CrossRef]
  35. Liu, L.; Lu, H.; Zou, H.; Xiong, H.; Cao, Z.; Shen, C. Weighing counts: Sequential crowd counting by reinforcement learning. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 164–181. [Google Scholar] [CrossRef]
  36. Meng, Y.; Zhang, H.; Zhao, Y.; Yang, X.; Qian, X.; Huang, X.; Zheng, Y. Spatial uncertainty-aware semi-supervised crowd counting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 15529–15539. [Google Scholar] [CrossRef]
  37. Liang, D.; Xu, W.; Bai, X. An end-to-end transformer model for crowd localization. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 38–54. [Google Scholar] [CrossRef]
  38. Liang, D.; Chen, X.; Xu, W.; Zhou, Y.; Bai, X. TransCrowd: Weakly-supervised crowd counting with transformers. Sci. China Inf. Sci. 2022, 65, 160104. [Google Scholar] [CrossRef]
  39. Wang, M.; Cai, H.; Dai, Y.; Gong, M. Dynamic mixture of counter network for location-agnostic crowd counting. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2023; pp. 167–177. [Google Scholar] [CrossRef]
  40. Ma, H.Y.; Zhang, L.; Shi, S. VMambaCC: A visual state space model for crowd counting. arXiv 2024, arXiv:2405.03978. [Google Scholar] [CrossRef]
  41. Wang, P.; Gao, C.; Wang, Y.; Li, H.; Gao, Y. MobileCount: An efficient encoder-decoder framework for real-time crowd counting. Neurocomputing 2020, 407, 292–299. [Google Scholar] [CrossRef]
Figure 1. The overall architecture of the HMSTUNet.
Figure 1. The overall architecture of the HMSTUNet.
Sensors 26 00333 g001
Figure 2. Illustration of MSViT Block.
Figure 2. Illustration of MSViT Block.
Sensors 26 00333 g002
Figure 3. Illustration of DCAB.
Figure 3. Illustration of DCAB.
Sensors 26 00333 g003
Figure 4. The visualization results on SHA datasets.
Figure 4. The visualization results on SHA datasets.
Sensors 26 00333 g004
Figure 5. The visualization results on NWPU-Crowd datasets.
Figure 5. The visualization results on NWPU-Crowd datasets.
Sensors 26 00333 g005
Figure 6. The visualization results under challenging conditions.
Figure 6. The visualization results under challenging conditions.
Sensors 26 00333 g006
Table 1. Statistics of crowd counting datasets.
Table 1. Statistics of crowd counting datasets.
DatasetsNumber of ImagesTraining/Validation/TestCount StatisticsAvg. Resolution
TotalMaxMin
SHA482300/-/182241,677313933589 × 868
SHB716400/-/31688,4885789768 × 1024
UCF_CC_5050-63,9744543942101 × 2888
UCF-QNRF15351201/-/3341,251,64212,865492013 × 2902
NWPU-Crowd51093109/500/15002,133,37520,03302191 × 3209
Table 3. Component-wise ablation study on HMSTUNet. The best performance is in boldface.
Table 3. Component-wise ablation study on HMSTUNet. The best performance is in boldface.
MethodSHANWPU-Crowd
MAEMSEMAEMSE
Baseline72.2129.380.3245.2
Baseline + S1 (MSViT + DCAB)58.490.757.9182.4
Baseline + S2 (MSViT + DCAB)56.287.955.6180.7
Baseline + S3 (MSViT + DCAB)54.686.853.4178.3
Baseline + MSViT52.183.646.9138.4
Baseline + DCAB51.883.445.1137.5
Baseline + MSViT + DCAB49.177.843.2119.6
Table 4. Loss function ablation study. The best performance is in boldface.
Table 4. Loss function ablation study. The best performance is in boldface.
Loss FunctionSHA
MAEMSE
L C 56.295.9
L C + L O T + L V   53.889.6
L C + λ 1 L O T + λ 2 L V   ( λ 1 = 0.1 , λ 2 = 0.01 ) 52.787.1
L C + λ 1 L O T + λ 2 L V   ( λ 1 = 0.1 , λ 2 = 0.1 ) 49.177.8
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhao, K.; He, C.; Peng, S.; Lu, T. A Hybrid Multi-Scale Transformer-CNN UNet for Crowd Counting. Sensors 2026, 26, 333. https://doi.org/10.3390/s26010333

AMA Style

Zhao K, He C, Peng S, Lu T. A Hybrid Multi-Scale Transformer-CNN UNet for Crowd Counting. Sensors. 2026; 26(1):333. https://doi.org/10.3390/s26010333

Chicago/Turabian Style

Zhao, Kai, Chunhao He, Shufan Peng, and Tianliang Lu. 2026. "A Hybrid Multi-Scale Transformer-CNN UNet for Crowd Counting" Sensors 26, no. 1: 333. https://doi.org/10.3390/s26010333

APA Style

Zhao, K., He, C., Peng, S., & Lu, T. (2026). A Hybrid Multi-Scale Transformer-CNN UNet for Crowd Counting. Sensors, 26(1), 333. https://doi.org/10.3390/s26010333

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop