Enhanced UAV-Dot for UAV Crowd Localization: Adaptive Gaussian Heat Map and Attention Mechanism to Address Scale/Low-Light Challenges

Zhang, Min; Zhao, Fei; Zhang, Yan

doi:10.3390/drones9120833

Open AccessArticle

Enhanced UAV-Dot for UAV Crowd Localization: Adaptive Gaussian Heat Map and Attention Mechanism to Address Scale/Low-Light Challenges

by

Min Zhang

,

Fei Zhao

^* and

Yan Zhang

College of Electronic Science and Technology, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(12), 833; https://doi.org/10.3390/drones9120833

Submission received: 16 October 2025 / Revised: 24 November 2025 / Accepted: 28 November 2025 / Published: 1 December 2025

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

The proposed scale-adaptive Gaussian kernel dynamically adjusts heatmap supervision, effectively resolving target merging and fragmentation caused by UAV altitude variations, contributing to a 1.62% gain in L-mAP.
The integrated CBAM attention module enhances feature extraction under low-light conditions via a “channel–spatial” focusing mechanism, improving feature discriminability and yielding a 0.92% L-mAP increase.

What is the implication of the main finding?

The enhanced UAV-Dot achieves a state-of-the-art 53.38% L-mAP on DroneCrowd with minimal overhead—parameters increase by only 0.36% (training) and 0.29% (testing), reconciling high accuracy with model efficiency for UAV deployment.
The synergistic combination of adaptive heatmaps and attention mechanisms establishes a new architectural paradigm for addressing the coupled challenges of scale variation and low-light degradation in aerial crowd localization.

Abstract

In public safety scenarios, such as large-scale event security and urban crowd management, unmanned aerial vehicles (UAVs) serve as a vital tool for crowd localization, offering high mobility and broad coverage. However, UAV-based overhead localization faces challenges, including significant target scale variations due to altitude changes and poor feature visibility in low-light conditions. To overcome these issues, this study enhances the UAV-Dot framework by introducing a scale prediction branch for adaptive Gaussian heatmap adjustment, embedding a CBAM attention module in the U-Net encoder to strengthen feature extraction in dim environments and optimizing post-processing via dynamic thresholding and DBSCAN clustering. Experiments on the DroneCrowd dataset show that the improved model increases parameters by only 0.36% during training and 0.29% during testing yet achieves 53.38% L-mAP—outperforming the original UAV-Dot by 2.38% and STNNet by 12.93%. The model also delivers consistent gains of approximately 2% in L-AP@10, L-AP@15, and L-AP@20.

Keywords:

crowd localization; UAV perspective; Gaussian heatmap; scale adaptation; attention mechanism

1. Introduction

With the rapid development of global tourism and the increasing frequency of large-scale gatherings, the risk of crowd stampede accidents has increased significantly [1,2,3]. In this context, effective public safety management has evolved beyond mere crowd counting to require precise crowd localization, which is critical for proactive risk assessment and real-time incident response. While traditional fixed surveillance systems are often hampered by blind spots and static viewpoints, unmanned aerial vehicles (UAVs) offer a dynamic and comprehensive solution with their high mobility, broad coverage, and unique aerial perspective [4,5,6].

However, the transition from fixed surveillance to UAV-based crowd localization introduces a set of distinct and formidable challenges. The top-down perspective of drones fundamentally differs from fixed cameras [7]. In UAV imagery, human targets (e.g., heads) are characterized by extremely low pixel ratios and highly abstract features. This intrinsic difficulty is exacerbated by two primary factors:

Drastic image scale variations: changes in UAV flight altitude cause the size of human targets to vary significantly within and across images.
Severe feature degradation in low light: under poor illumination conditions (e.g., at night or in heavily overcast scenes), the already weak features of small targets are easily submerged by complex background noise, drastically reducing localization accuracy.

Several frameworks have been proposed to address UAV-based crowd analysis. The space–time neighbor-aware network (STNNet) [8] pioneered an integrated system for localization, tracking, and counting. Multi-frame attention with feature-level warping (MFA) [9] leveraged temporal motion cues from video sequences to enhance target distinction. More recently, the UAV-based Dot localization network (UAV-Dot) [10] directly tackled the issue of small-target feature loss by introducing a pixel distillation module to preserve high-resolution information, establishing a strong baseline for the task.

Despite the aforementioned advancements, our analysis identifies the following issues in UAV-based crowd localization: existing methods universally employ a fixed-parameter Gaussian kernel for heatmap generation, which cannot adapt to the dynamic scale variations inherent to UAV operations. Furthermore, there is a lack of specialized architectural components designed for enhancing low-light characteristics, resulting in significant performance barriers for all-weather deployment. To bridge these gaps, we introduce an enhanced UAV-Dot framework that systematically alleviates the challenges of scale adaptation and illumination.

We selected UAV-Dot as our baseline due to its pixel distillation module, which effectively preserves fine-grained spatial information and consequently enables robust handling of small-sized targets. This establishes a superior foundation for implementing our scale and illumination adaptations. Building upon this framework, our work makes the following key contributions to the network’s heatmap supervision, feature extraction, and post-processing:

We propose a scale-aware Gaussian heatmap generation mechanism that dynamically adjusts kernel sizes based on a predicted scale factor, effectively resolving the target merging and fragmentation issues caused by altitude variations.
We embed a convolutional block attention module (CBAM) [11] into the U-Net encoder, creating a “channel–spatial” focusing mechanism that enhances feature discriminability in low-light conditions.
We adopt an optimized post-processing strategy incorporating dynamic thresholding and DBSCAN clustering to correct residual misdetections, particularly for large-scale targets with complex poses.

To evaluate the effectiveness of our proposed method, we conducted comprehensive experiments on the DroneCrowd dataset. The remainder of this paper is organized as follows: Section 2 provides a review of related work. Section 3 presents a detailed description of our enhanced UAV-Dot framework, focusing on its architectural design and critical innovations. Comprehensive experimental results and analyses are reported in Section 4, followed by an in-depth discussion in Section 5 and conclusions along with future research directions in Section 6.

2. Related Works

2.1. From Count to Point Regression

Research in crowd analysis has evolved from density-based counting towards precise point localization. Early studies such as dilated convolutional neural networks (CSRNet) [12] primarily employed dilated convolutional networks to estimate crowd density. However, with the growing demand for fine-grained crowd management, accurate individual localization has become crucial. This shift has driven the development of numerous new methods that go beyond mere counting. Pseudo-bounding box methods, such as a point-supervised deep detection network (PSDDN) [13] and locate, size, and count convolutional neural network (LSC-CNN) [14], emerged to bridge the gap between counting and localization. These approaches generate pseudo-bounding boxes from point annotations based on heuristic rules (e.g., fixed size or density-aware scaling), enabling the use of standard object detectors. As a more direct alternative, point regression methods, such as a purely point-based framework (P2PNet) [15], the end-to-end transformer model (CLTR) [16], point-query quadtree (PET) [17], and auxiliary point guidance crowd counting (APGCC) [18], entirely bypass bounding boxes. These can be categorized into point-matching-based strategies (P2PNet, APGCC) and transformer-based point query methods (CLTR, PET).

Based on point-matching methods, the initial approach employed direct point-to-point matching (P2PNet). Subsequently, to address the instability of direct point regression and improve localization accuracy, auxiliary positive and negative points were introduced for matching (APGCC). In contrast, transformer-based point query methods initially utilized a fixed set of learnable queries (CLTR). However, this design suffered from a fundamental mismatch between the fixed number of queries and the dynamically varying number of targets, along with issues of sparse features from the UAV perspective that are difficult to query and instability in matching under the coupled challenges of scale variation and feature abstraction. The PET model, proposed subsequently, effectively resolved the fixed query problem by dynamically generating query points and significantly mitigated matching instability. Nevertheless, the challenge of sparse feature degradation remains to be fully addressed. In contrast to these query-based matching paradigms, our work strengthens the heatmap regression foundation by enhancing feature representation and adapting the supervision signal, which proves more robust to the sparse, abstract targets and dynamic scales in UAV views.

Although these methods perform well in fixed-view scenarios where targets have clear appearances and sufficient pixel coverage, they face significant challenges when applied to UAV imagery. The heuristic rules relied upon by pseudo-bounding box methods struggle to adapt to the inherent severe scale variations and irregular distribution characteristics in UAV images [19]. If the bounding boxes are set too small, target features may be incompletely captured (e.g., alternating between the head and torso), compromising training stability, whereas excessively large boxes introduce excessive background noise, diluting the effectiveness of key features. Similarly, in UAV perspectives, point regression methods become highly unstable due to targets (e.g., human heads) being represented as abstract features with extremely low pixel ratios. When further compounded by low-light conditions, the features of annotated points are easily overwhelmed by background noise, leading to frequent localization deviations or even complete detection failure. These factors collectively limit the applicability of pseudo-bounding box and point regression methods in UAV-based crowd localization tasks.

2.2. Heatmap Regression Paradigm

In contrast to the aforementioned paradigms, heatmap-based methods have become the dominant approach for crowd localization, particularly in challenging scenarios like UAV imagery. This paradigm converts discrete point annotations into continuous probability distributions (e.g., using 2D Gaussian kernels [20,21]), which provides richer gradient signals during training and demonstrates inherent robustness to feature sparsity, making it particularly suitable for small, dense targets.

Recognizing these advantages, significant research has focused on improving the quality and discriminativity of heatmaps. Several advanced schemes have been proposed to better distinguish overlapping targets in dense regions. These include focal inverse distance transform maps (FIDTM) [19], which sharpen heatmap peaks; distance label maps and independent instance maps (IIM) [22,23], which enhance instance discrimination; and connectivity-based binary masks [24], which separate adjacent targets. Meanwhile, a fusion mechanism employing “selective feature inheritance” (STEERER) [25] is adopted to generate a high-quality, scale-aware Gaussian heatmap. Although these methods require post-processing (e.g., peak finding) to extract final coordinates, their superior performance has cemented heatmap regression as the foundational technique for UAV-based crowd analysis.

2.3. UAV-Specific Frameworks

Building upon the heatmap regression paradigm, several frameworks have been specifically designed to address the unique challenges of the UAV perspective.

STNNet [8] was a pioneering integrated framework that combined crowd localization, tracking, and counting for UAV applications. It employed a VGG backbone for multi-scale feature fusion to generate density maps. However, its requirement for phased training of the density map generation and localization networks introduces complexity and potential error accumulation. MFA [9] addressed the issue of weak target features under the UAV perspective by leveraging temporal information. It introduced motion and position maps (MPMs) constructed from multi-frame sequences to encapsulate target positions and movement directions, thereby enhancing the ability of heatmaps to distinguish targets. While effective, this approach relies on consecutive video frames and is not applicable to single-image analysis. UAV-Dot [10] directly tackled the core pain point of “difficulty in capturing small-target features” under the UAV perspective. It critiqued the loss of global context when processing high-resolution images via sliding windows and introduced a pixel distillation (PD) module to retain spatial information of high-resolution feature maps, significantly improving the representational ability of heatmaps for low-pixel small targets. The reviewed UAV-specific frameworks—STNNet, MFA, and UAV-Dot—demonstrate progressive innovations in temporal modeling and feature preservation. However, they share a critical architectural flaw: their reliance on a fixed-parameter Gaussian kernel for heatmap generation. This design inherently conflicts with the reality of “dynamic scale fluctuations” caused by varying UAV altitudes. Consequently, the generated heatmaps become suboptimal—either blurring small, dense targets into undetectable masses or fragmenting large targets into multiple false peaks. Furthermore, none incorporate explicit mechanisms to counteract feature degradation in low-light conditions, leaving target features vulnerable to background noise interference.

Although the aforementioned framework is specifically designed for the UAV-based crowd localization challenge, it is also important to recognize the significant progress in general tiny object detection. For instance, Gaussian receptive field-based label assignment (RFLA) [26] addresses the fundamental issue of label assignment for tiny objects by introducing Gaussian receptive field priors, while Swin-Deformable DEtection Transformer (SD-DETR) [27] demonstrates remarkable performance in end-to-end fashion based on transformer architecture. Nevertheless, these detection-based methods exhibit inherent limitations when dealing with point annotation tasks.

Based on the preceding analysis, we will build upon the UAV-Dot framework and investigate solutions to address its limitations in dynamically adjusting the size of Gaussian heatmaps and its insufficient robustness under low-light conditions. In the next section, we will introduce in detail the enhanced UAV-Dot framework designed to resolve these issues.

3. Methods

To collaboratively address the two coupled challenges of “dynamic scale variation” and “low-light feature submergence” in UAV-based crowd localization, this study proposes an enhanced framework. Building upon the UAV-Dot baseline, the framework introduces systematic optimizations at three levels: feature extraction, heatmap generation, and post-processing. The core concept lies in leveraging attention mechanisms to purify input features, employing a scale prediction branch to adaptively modulate heatmap distribution and finally refining the output through a scene-aware post-processing strategy. The overall architecture is illustrated in Figure 1.

3.1. Baseline Framework: UAV-Dot

The original UAV-Dot model integrates a pixel distillation (PD) module into a U-Net architecture, which consists of an encoder (MiT-B2 [28]) and a decoder (transposed convolution). The PD module first performs a PixelUnshuffle operation [29]—this reduces spatial resolution while increasing the number of channels to preserve high-resolution features. The tensor is then split along the channel axis, processed by a dataset-specific feature module, and finally concatenated along the channel axis. Subsequently, through U-Net’s skip connections, shallow (low-level, high-resolution) and deep (high-level, semantic) features from the encoder are fused, which helps retain spatial details.

3.2. Scale-Adaptive Gaussian Kernel

As emphasized in the Introduction, dynamic changes in UAV flight altitude cause fluctuations in target scale (Figure 2a,b). This leads to inherent flaws in fixed Gaussian heatmaps: either blurring the localization of small targets resulting in target merging (Figure 2c) or narrowing the range of positive samples for large targets, resulting in false multiple detections (Figure 2d). Thus, Gaussian kernels of varying sizes are required. Figure 2e,f show the heatmaps generated for the target in Figure 2b using Gaussian kernels of different radii.

To address this, a scale prediction branch is added to the network [30]—its output scale is used to adjust the heatmap generated by the fixed Gaussian kernel, thus modifying the ground truth. The formula for the fixed heatmap is shown in Equation (1), where (

x^{p}, y^{p}

) denotes the coordinates of the p-th person,

h^{p}

is the ground-truth heatmap for this person, and the coverage range of the heatmap is defined in Equation (2):

h_{i, j}^{p} = e^{- {[(i - x^{p})^{2} + (j - y^{p})^{2}] / 2 σ^{2}}}

(1)

∥ i - x^{p} ∥_{1} \leq 3 σ, ∥ j - y^{p} ∥_{1} \leq 3 σ

(2)

Here, σ represents the standard deviation, and (i, j) denotes the pixel position in

h^{p}

. For positions where

∥ i - x^{p} ∥_{1} > 3 σ

or

∥ j - y^{p} ∥_{1} > 3 σ

,

h_{i, j}^{p} = 0

.

A scale factor s is introduced via the scale branch to modify the standard deviation of the Gaussian function (and thus its coverage range). The adjusted heatmap is shown in Equation (3):

h_{i, j}^{p} = e^{- {[(i - x^{p})^{2} + (j - y^{p})^{2}] / 2 (σ_{0} \cdot s_{x^{p}, y^{p}})^{2}}}

(3)

Within the range

∥ i - x^{p} ∥_{1} \leq 3 σ

and

∥ j - y^{p} ∥_{1} \leq 3 σ

,

s_{x^{P}, y^{P}} \approx s_{i, j}

. Thus, Equation (3) can be rewritten as an element-wise operation:

h_{i, j}^{p} = e^{- {[(i - x^{p})^{2} + (j - y^{p})^{2}] / 2 (σ_{0} \cdot s_{i, j})^{2}}}

(4)

However, directly applying s to the Gaussian standard deviation per pixel leads to initialization difficulties, lack of reliable anchors (for training), severe gradient fluctuations, and error amplification between the heatmap and scale branches—making training difficult to converge. To resolve this, the adjusted heatmap is denoted as

H^{σ_{0} s}

, where s is used to globally adjust the initially fixed heatmap

H^{σ_{0}}

(original fixed heatmap). The calculation is shown in Equation (5):

H_{i, j}^{σ_{0} s} = \{\begin{matrix} {{(H}_{i, j}^{σ_{0}})}^{\frac{1}{s_{i, j}}} H_{i, j}^{σ_{0}} > 0 \\ H_{i, j}^{σ_{0}} H_{i, j}^{σ_{0}} = 0 \end{matrix}

(5)

When

s_{i, j}

> 1, the coverage range of the Gaussian kernel expands; when

s_{i, j}

< 1, the coverage range shrinks. Considering that the primary function of the scale prediction branch is to address the lack of target size information (only point annotations are available), it establishes a dynamic mechanism that adaptively adjusts the Gaussian kernel size in the ground-truth heatmaps based on the target’s scale. This enhances the supervisory signal rather than directly contributing to the network’s forward propagation. After training, the main network learns to generate heatmaps suitable for different scales without explicit scale guidance. Therefore, the scale branch is removed during the testing phase. This prediction branch (added after the decoder) consists of a global average pooling layer and a fully connected layer, as illustrated in Figure 3. The input to this branch is the output from the final decoder layer.

Equation (5) uses 1/

s_{i, j}

as an exponent, which can cause numerical instability when s is small or the fixed kernel size is tiny. To stabilize training and avoid gradient explosion/vanishing, let

α_{i, j}

= 1/

s_{i, j}

− 1 and expand Equation (5) via a Maclaurin series to obtain Equation (6):

H_{i, j}^{σ_{0} s} = \{\begin{array}{l} \frac{1}{2} H_{i, j}^{σ_{0}} [1 + {(1 + α_{i, j} \ln (H_{i, j}^{σ_{0}}))}^{2}] if H_{i, j}^{σ_{0}} > 0 \\ 0 if H_{i, j}^{σ_{0}} = 0 \end{array}

(6)

To endow the model with greater flexibility in learning the scale factor without pre-defining a hard range, we employ the softplus activation function on the output of the scale branch. The softplus function, defined as softplus(x) = log (1 + exp(x)), ensures the scale factor s is always positive and allows it to adaptively converge to a suitable value range based on the training data. This eliminates the need for manual range calibration and mitigates the risk of suboptimal performance due to an inappropriate preset range.

To stabilize the training of the scale module, a regularization loss is introduced (Equation (7)). This loss ensures s does not become excessively large or small and is only computed for valid heatmap pixels

H_{i, j}^{σ_{0}} > 0

:

L_{r e g u l a r i z e r} = {‖(\frac{1}{s} - 1) H_{i, j}^{σ_{0}}‖}_{2}^{2}

(7)

To quantify the approximation error between the second-order Maclaurin expansion in Equation (6) and the non-expanded form in Equation (5), we performed sampling within the practical heatmap value range of H ∈ [0.1, 1] and under the constraint of the scale factor s ∈ [0.8, 1.2] from Equation (7). The analysis yielded a maximum relative error of 4.93% and an average error of merely 0.07%. This level of approximation error is negligible compared to the necessity of ensuring numerical stability during training.

3.3. Attention Enhancement for Low-Light Scenarios and Post-Processing Strategy

After converting the test set images in the DroneCrowd dataset to grayscale, 900 images had an average brightness below 0.2, encompassing three typical low-light drone-captured scenarios (see Figure 4). Analysis reveals that in such environments, insufficient illumination, low pixel ratios of crowd targets (human heads), and background noise can lead to the submergence of effective features—increasing the difficulty of feature extraction and limiting the accuracy of subsequent Gaussian heatmap generation. Therefore, to better extract features, we introduce the traditional CBAM attention mechanism. Considering that in the U-Net architecture, the encoder features contain rich semantic information and global context, making them more suitable for two-stage filtering through channel and spatial attention, while the decoder features primarily handle spatial detail restoration tasks, introducing an attention mechanism may interfere with precise localization and increase computational overhead. Thus, the attention module is embedded after the effective output layers of the U-Net encoder, forming a two-stage feature enhancement mechanism of “channel masking and spatial focusing” to progressively enhance the features. Meanwhile, in Section 4, we will conduct experiments to compare the effects of embedding the attention mechanism at different positions. The U-Net structure with the CBAM module is shown in Figure 5 below.

The overall workflow of the CBAM module is shown in Figure 6. Its input features are the effective outputs of the MiT-B2 encoder. This dual-attention mechanism enables the network to adaptively capture key information in features and optimize feature responses.

The channel attention module is defined in Equation (8):

M_{c} (F) = σ (M L P (A v g P o o l (F)) + M L P (M a x P o o l (F)))

(8)

The shared MLP is designed to first compress the channel dimensionality to 1/16 of the original dimension through a fully connected layer, incorporating nonlinearity via ReLU activation, and then restore the channel dimensionality to match the original input F. The structure of the channel attention module is illustrated in Figure 7.

The spatial attention module is defined in Equation (9):

M_{s} (F) = σ (f^{7 \times 7} ([A v g P o o l (F); M a x P o o l (F)]))

(9)

Its input is the feature map optimized by channel attention.

f^{7 \times 7}

denotes a 7 × 7 convolution layer, and

([A v g P o o l (F); M a x P o o l (F)])

represents the concatenation of the average-pooled and max-pooled features of F. The structure of the spatial attention module is shown in Figure 8.

After obtaining the raw predicted heatmap, the following post-processing steps are performed by Algorithm 1:

Algorithm 1. Flowchart of the proposed post-processing algorithm.

local maxima detection strategy.

1: Input: Predicted Gauss heat map

2: Output: The coordinates of the persons

3: function Extract _ Position

4: normalized input = Normalize(input)

5: pos_ind = maxpooling (normalized input, size = (3,3))

6: pos_ind = (pos_ind == normalized input)
7: heatmap matrix = pos_ind

\times

normalized input
    8:          if max (heatmap matrix) < T_f then
    9:                  count = 0
  10:                coordinates = None
  11:          else
  12:                 T_a = 75/255

\times

max (heatmap matrix)
  13:                 heatmap matrix [heatmap matrix < Ta] = 0
  14:                 heatmap matrix [heatmap matrix > 0] = 1
  15:                 point = position (heatmap matrix = 1)
  16:          end if
  17:          if mean (normalized input) > T_b then
  18:                  point = DBSCAN (point)
  19:          else
  20:                  point = point
  21:          end if
  22:      return point
  23:    end Function

Due to significant variations in heatmap response intensity across different images, fixed thresholds are inherently flawed. This issue is addressed by modifying the threshold to 75/255

\times

max(M), where max(M) represents the maximum response value of the current heatmap. This adjustment ensures the threshold is proportional to the overall response level: for images with strong responses (e.g., dense crowds), the threshold increases to avoid false detections caused by background noise; for images with weak responses (e.g., sparse crowds), the threshold decreases to prevent missing genuine targets. Experimental validation confirms that a ratio of 75/255 achieves optimal performance (see Section 4).

Meanwhile, experiments reveal that even with the scale-adaptive Gaussian kernel, misdetections may still occur for large targets with extreme poses (e.g., extended arms or splits). It is observed that heatmaps of large, dense crowd targets typically exhibit higher mean values. Based on this observation, the post-processing workflow is further optimized: first, regions with high heatmap mean values (indicating potential large-scale dense crowds) are filtered. Then, the DBSCAN clustering algorithm is applied to these regions to merge over-detected points, thereby correcting localization errors for large targets.

In addition, T_f in Algorithm 1 is 0.1, indicating that if the maximum value in the heatmap after normalization and global max pooling is less than 0.1, the result is 0, meaning there is no target. At the same time, when implementing the DBSCAN algorithm, we identify potential large-scale dense crowds by filtering areas in the heatmap with higher average values. At this time, the screening threshold T_b is set to 0.015. If T_b is too small (<0.01), it may cause aggregation of other small targets, leading to missed detections. Conversely, if T_b is too large (>0.017), it may result in incomplete screening of large targets. The determination of T_b is based on the average values in the normalized heatmaps of the test set as shown in the Figure 9 below, where (a) represents small targets, and (b) represents large targets. In the figure, “max” is the maximum result in the normalized heatmap, and “mean” is the average result.

Figure 10 validates the post-processing strategy by comparing different configurations: (a) original UAV-Dot with fixed threshold completely fails to detect any targets in this region (only ground truth points are visible); (b) adaptive threshold with fixed kernel causes multiple false detections on large targets; (c) scale-adaptive kernel reduces errors but retains residual inaccuracies; (d) the combination of the scale-adaptive heatmap and DBSCAN clustering yields the best result among all configurations. It significantly reduces the fragmentation observed in (c), leading to a much closer alignment between predictions and ground truth. Although some localization errors persist in the upper half, these fall within the acceptable matching distance of our evaluation metrics, while still indicating room for further improvement.

4. Experiments and Results

4.1. Dataset and Metrics

4.1.1. Dataset

The experiment was conducted based on the DroneCrowd dataset [8], which covers 70 distinct scenarios and consists of 112 video clips, generating 33,600 high-definition frames (resolution: 1920 × 1080 pixels). Training and testing videos were captured from different geographic locations, with the training set containing 24,600 images and the testing set 9000 images. A total of 4.8 million head center points were annotated, and three key attributes were labeled to characterize scene complexity: illumination (sunny, cloudy, night), target scale (human targets occupy 3 × 3 to 20 × 20 pixels), and crowd density (25 to 455 crowd targets per frame). Based on crowd density, the training set is divided as follows: low density (<50 people) accounts for 12.5%, medium density (50–215 people) accounts for 70%, and high density (>215 people) accounts for 17.5%. In the test set, low density accounts for 3.4%, medium density for 82.0%, and high density for 14.6%. Based on illumination, low-light images (with an average brightness below 0.2 after grayscale conversion) account for 3.66% of the training set, while the rest are normal illumination (brightness between 0.2 and 0.7). In the test set, low-light images account for 10%, with the rest being normal illumination.

It can be observed that the dataset has a relatively large proportion of medium-to-high density cases, and there is an imbalance in illumination distribution.

4.1.2. Metrics

Consistent with the UAV crowd localization benchmark [12], the model is evaluated using localization average precision (L-AP) and its variants, which measure the alignment between predicted points and ground-truth (GT) points. These metrics are determined by a distance threshold calculated via a greedy method: a prediction point is considered a true positive (TP) if a matching ground truth point exists within the predetermined pixel range with a high score. If no matching ground truth is found within the specified pixel range, or if the score is too low despite being within range, the point is classified as a false positive (FP). A case is regarded as a false negative (FN) when no prediction point is detected within the predetermined pixel range of an actual ground truth point. L-AP@k: the maximum allowable distance between a predicted point and its matched GT is k pixels (k = 10, 15, 20). L-mAP: the average L-AP across k = 1–25 pixels, providing a comprehensive performance overview. F1 Score@k: the harmonic mean of precision and recall at a specific distance k, balancing accuracy and completeness.

4.2. Implementation Details

For the purpose of comparative experiments, the data augmentation strategy consistent with the original UAV-Dot was applied during training. The loss function builds upon UAV-Dot by incorporating a regularization loss to constrain the scale branch, as shown in Equation (10), where λ is set to 1 in practice. We evaluated the regularization weight λ across a range of values [0.8, 0.9, 1.0, 1.1]. The results indicate that λ = 1.0 achieves the optimal balance, with smaller values (0.8, 0.9) providing insufficient regularization and larger values (1.1) causing over-regularization that hinders scale adaptation. Here,

L_{n e g}

,

L_{o b j}

, and

L_{r e g}

denote the heatmap loss, the binary cross-entropy loss for predicted points, and the regression loss between predicted points and ground truth points, respectively:

L_{t o t a l} = 0.25 * L_{n e g} + L_{o b j} + 2 * L_{r e g} + λ * L_{r e g u l a r i z e r}

(10)

Moreover, in UAV-Dot, the total loss for the last three layers of the decoder’s outputs is calculated according to the ratio in Equation (11). Here, Loss₁ denotes the output of the last layer (i.e., the output with the highest resolution), Loss₂ denotes the output of the second-to-last layer, and Loss₃ denotes the output of the third-to-last layer:

t o t a l l o s s = {l o s s}_{1} + 0.7 * {l o s s}_{2} + 0.3 * {l o s s}_{3}

(11)

Experiments are conducted on hardware including an RTX 4070 GPU (12GB), Intel Core i5-13490F CPU, and 64GB RAM, with software based on Python 3.10.0, PyTorch 2.7.1 and CUDA 12.8. The maximum training steps are set to 100, and the early stopping period is 30 epochs. The initial learning rate is set to

4 \times 10^{- 4}

, using a cosine annealing algorithm with a warm restart strategy. The AdamW optimizer from PyTorch is adopted with a weight decay strategy, where the decay strength coefficient is

1 \times 10^{- 4}

.

4.3. Experiments

4.3.1. Comparative Experiment

The proposed method is evaluated on the UAV dataset DroneCrowd, using the evaluation metric (L-AP) for the crowd localization task. Table 1 below presents the detection results of several different networks in terms of (L-AP), L-AP@10, L-AP@15, L-AP@20, and L-mAP.

As shown in Table 1, the proposed enhanced UAV-Dot achieves optimal performance across all evaluation metrics, with a 53.38% L-mAP—outperforming other methods. Its L-AP@20 reaches 64.96%, demonstrating strong localization capability under relaxed thresholds. Compared to the original UAV-Dot, the enhanced version gains approximately 2.0 percentage points across metrics, confirming the effectiveness of the proposed strategies.

Networks designed for fixed-view counting or detection, such as CSRNet and P2PNet, perform poorly in UAV scenarios. CSRNet lacks localization-specific modules required for precise position regression, while P2PNet struggles with the sparse appearance features typical in UAV views despite its good performance on fixed-perspective datasets. STEERER performs well in fixed perspectives by utilizing density maps, but its scale selection mechanism and feature inheritance strategy cannot adapt to the extreme scale variations and tiny target detection in drone perspectives, resulting in poor performance in drone views. Meanwhile, detection networks designed for remote sensing images, such as RFLA and SD-DETR, rely on bounding box annotations and may not be well suited for the point annotations in the DroneCrowd dataset, requiring self-generated pseudo-bounding boxes for regressing center points during training, which ultimately leads to performance degradation. However, it can be observed that SD-DETR still achieves reasonably good performance.

4.3.2. Ablation Experiments, Parameter Analysis, and FPS Measurement

To verify the performance of each module in the enhanced Uav-Dot method, ablation experiments were conducted, and the detailed results are presented in Table 2. Meanwhile, Table 3 presents tests and analysis on the final model’s parameter count and FPS.

Table 2 evaluates the contributions of three core modules—CBAM attention, adaptive Gaussian kernel, and optimized post-processing—through ablation studies.

Embedding CBAM in the U-Net encoder improved L-mAP by 0.92%, with a notable 1.16% gain in L-AP@10, demonstrating its effectiveness in enhancing feature clarity under low-light conditions via channel–spatial attention.

The adaptive Gaussian kernel alone boosted L-mAP by 1.62% by dynamically adjusting heatmap coverage, effectively resolving overlap and fragmentation issues caused by scale variation. Its greater impact on L-AP@15/20 confirms its role in eliminating fragmented heatmaps for large targets.

Combining CBAM with the adaptive kernel achieved synergistic gains, increasing L-mAP by 0.26% over individual modules. This highlights their complementarity: CBAM enhances feature quality while the adaptive kernel ensures proper heatmap sizing. The modest performance improvement is likely attributed to the relative functional independence between modules, the limited overlapping scenarios in the dataset requiring both low-light enhancement and scale adaptation (where target sizes remain largely unchanged under low-light conditions), and the cumulative errors in the multi-step process from feature purification to scale estimation for heatmap optimization.

Adding optimized post-processing (dynamic thresholding and DBSCAN) further raised L-mAP by 0.50%, primarily correcting residual false detections in relaxed localization scenarios.

Parameter analysis (Table 3) confirms that these improvements incur only minimal parameter increases, preserving the model’s lightweight nature. While the FPS reduction is relatively small, the overall FPS remains not high.

It can be seen from Table 3 that the enhanced UAV-Dot increases parameters by only 0.1 M (0.36%) during training and 0.326 M (0.29%) during testing. This minimal growth primarily comes from the CBAM module (~0.1 M), while the scale prediction branch adds negligible parameters, and the post-processing requires none.

The frame rate measurements in the table represent averaged results from 20 rounds × 50 iterations of batch testing conducted on an RTX 4070, including both pure model inference and complete pipeline results (with post-processing). It can be observed that the enhanced model experiences a decrease of approximately 4 fps during the inference phase and that post-processing causes less frame rate drop in both baseline and enhanced models, but at the same time, when testing the MFA scenario, it was found that its pure reasoning FPS reached a maximum of 47.74, indicating that the baseline reasoning speed of UAV-Dot needs improvement.

Overall, the model achieves a 2.38% improvement across all L-AP@k metrics with minimal parameter increase and only a slight reduction in fps. It effectively addresses challenges related to scale variation and low-light conditions, making it suitable for lightweight deployment on drones; however, its overall inference speed remains relatively low and requires further improvement.

4.3.3. Robustness Testing Under Different Lighting Conditions and CBAM Position Analysis

To quantitatively evaluate the model’s robustness under different lighting conditions, we divided the DroneCrowd test set into low-light (900 images with mean grayscale value < 0.2) and normal-light (0.2 < mean grayscale value < 0.7) subsets based on illumination levels. We compared the baseline model, the model with CBAM, and our final improved model under different lighting conditions as shown in Table 4. Furthermore, to explore the optimal placement of the CBAM module, we compared its integration at the encoder’s effective output layer (Encoder-CBAM) versus the decoder’s effective output layer (Decoder-CBAM). The obtained results are presented in Table 5 below.

As can be seen from Table 4, which shows performance comparisons under different lighting conditions, the integration of the CBAM module yields a 1.52% performance gain in low-light conditions and a 0.85% gain in normal-light conditions. This indicates that the attention module brings improvements in both lighting scenarios, with the performance gain in low-light conditions being greater than that in normal-light conditions, effectively validating the module’s specific efficacy in addressing degradation issues in low-light environments. The enhanced UAV-Dot model further improves localization capability, achieving a 2.03% gain in low-light conditions and a 2.42% gain in normal-light conditions.

According to Table 5, placing CBAM after the encoder output layer achieves higher L_mAP (51.92% vs. 51.23%) and FPS (15.86 vs. 15.15) compared to placement after the decoder. As described in Section 3.3, encoder features contain rich semantic context suitable for attention mechanisms, whereas decoder features focus on detail restoration, where introducing CBAM may interfere with localization and increase computational overhead.

The table also reveals that although Decoder-CBAM has slightly fewer parameters, its FPS unexpectedly decreases. This can be attributed to the parallel processing in Encoder-CBAM: the encoder simultaneously outputs multiple feature maps at different scales, while the decoder generates features sequentially with each decoding block depending on previous outputs. Additionally, encoder feature maps are smaller due to downsampling, while the decoder feature maps enlarge due to upsampling, which requires increased memory access, thereby causing latency.

4.3.4. Stability Analysis of Equations (5) and (6) and Analysis of the λ Parameter in Equation (10)

The convergence behavior of the F1 score during the training process is shown in Figure 11. The F1 score was calculated according to the criteria described in Section 4.1.2, using a 5-pixel threshold to determine TP, FP, and FN and then derived through its standard formula. The results of Equations (5) and (6) are analyzed, represented by the orange curve and blue curve, respectively.

As observed from the above Figure 11, compared to the exponential form of Equation (5), the Maclaurin expansion in Equation (6) achieves better convergence with a smoother convergence curve. The corresponding L-mAP metrics are 51.46% (Equation (5)) and 52.62% (Equation (6)), respectively, demonstrating the necessity of employing the Maclaurin expansion (Equation (6)).

For the regularization term loss in Equation (10), tests were conducted with λ values of [0.8, 0.9, 1, 1.1], and the results are presented in Table 6 below.

Based on the data in Table 6, the following conclusions can be drawn: when λ is set to 1.0, the model achieves optimal performance across all metrics. The suboptimal performance at λ = 0.9 and λ = 0.8 indicates that a slightly weaker regularization strength (λ = 0.9) leads to a minor decline in all metrics. A further reduction to λ = 0.8 results in a more noticeable performance drop. This trend suggests that insufficient regularization (an overly low λ value) hinders the model’s generalization capability, fails to adequately constrain scale variations, and leads to suboptimal localization accuracy. When the regularization strength is increased to λ = 1.1, performance deteriorates significantly. This demonstrates that excessive regularization (an overly high λ value) over-constrains the model, negatively impacting its learning capacity and final performance. These findings indicate that performance is highly sensitive to the value of λ.

4.3.5. Post-Processing Threshold Optimization Analysis

Table 7 explores the impact of different dynamic thresholds (defined as r = threshold*max heatmap response, where r ranges from 70/255 to 85/255) on model performance, with a focus on the trade-off between localization accuracy (L-AP/L-mAP) and practical utility (F1 score, balancing precision and recall). The baseline here is the “no post-processing” setting.

As shown in Table 7, both L-mAP and L-AP increase monotonically as the threshold r decreases from 85/255 to 70/255. This trend occurs because lower thresholds retain more valid low-response targets (e.g., small heads in low-light scenes), reducing missed detections. The F1 score first increases and then decreases with the threshold parameter r. This indicates that a higher threshold removes some essential points, reducing detection capability, while a lower threshold, although improving recall, introduces more background noise, leading to a significant drop in precision and ultimately a decrease in the F1 score.

Based on the above results, r = 75/255 is selected as the optimal threshold. It balances localization accuracy and noise suppression: L-mAP increases by 0.50% (to 53.38%) compared to the no-post-processing baseline, while F1@10, F1@15, and F1@20 only marginally decrease by 0.33%, 0.28%, and 0.11%, respectively—an acceptable trade-off in practice. In contrast, r = 70/255 achieves the highest AP (L-mAP = 53.87%) but causes a significant F1 drop, with F1@10 and F1@15 decreasing by 0.74% and 0.71% simultaneously; observing the predicted point images reveals a large number of false detections. Meanwhile, r = 85/255 yields an excessively low AP (L-mAP = 50.81%).

4.4. Visual Heatmap Comparison

In Figure 12, (a) and (c) represent heatmaps generated using fixed Gaussian kernels, whereas (b) and (d) represent heatmaps generated using adaptive Gaussian kernels. Here, (a) and (b) correspond to larger-scale targets, while (c) and (d) correspond to smaller-scale targets.

It can be observed that in Figure 12a, when a fixed Gaussian kernel is used, the heatmap response area may be overly concentrated and narrow, leading to multiple peaks within the heatmap of a single target. In contrast, in Figure 12b, after applying the adaptive Gaussian kernel, a single large-scale target produces a more continuous, complete, and single-peak heatmap region. At the same time, in Figure 12c, the heatmap generated with the fixed Gaussian kernel exhibits overlapping and merging of responses. However, in Figure 12d, the use of the adaptive Gaussian kernel enables the network to generate smaller, more concentrated heatmaps for the targets.

The Figure 13 illustrates the detection performance under low-light conditions from the DroneCrowd dataset. The first column shows the input images to be detected, with green dots representing the ground-truth annotations. The second column displays the heatmaps predicted by the original UAV-Dot model. It can be observed that its heatmap responses are relatively diffuse and significantly affected by background noise, leading to weak responses at some true target locations or confusion with noise. The third column presents the heatmaps predicted by the model after integrating the CBAM module. A comparative analysis reveals that with the introduction of CBAM, the heatmaps generated by the model exhibit sharper and more concentrated peak responses in crowd target regions, while background noise is effectively suppressed. This indicates that the “channel–spatial” attention mechanism of CBAM effectively enhances the model’s ability to extract key target features under low-light conditions, improving feature discriminability. Consequently, the heatmap generation process becomes more focused on the crowds themselves rather than irrelevant environmental information.

5. Discussion

Experimental results demonstrate that the proposed enhanced UAV-Dot achieves state-of-the-art performance on the DroneCrowd dataset. This success can be directly attributed to our targeted architectural innovations, which systematically address the core limitations of existing methods in UAV-based crowd localization.

The results confirm that the scale-adaptive Gaussian kernel represents a decisive improvement over the fixed-kernel approach employed in prior works such as UAV-Dot and STNNet. By dynamically adjusting the heatmap supervision to match the size of target, it effectively resolves the issues of target merging and fragmentation caused by altitude-induced scale variations. This module alone contributed to a substantial gain of 1.62% in L-mAP. Simultaneously, the embedded CBAM attention mechanism effectively counteracts feature degradation in low-light conditions, a challenge overlooked by many methods. Its “channel–spatial” focusing mechanism enhances feature discriminability, which is particularly crucial for the strict localization of low-signal targets, as confirmed by the more improvement in L-mAP. The synergy observed between these modules reveals that high-quality features and well-sized supervisory signals are mutually reinforcing. The final optimized post-processing strategy further introduces necessary scene-aware reasoning, correcting residual errors and achieving an optimal balance between high recall (L-mAP) and practical utility (F1 Score).

In summary, enhanced UAV-Dot reconciles the contradictory demands of high accuracy and minimal parameter growth. It establishes that addressing the dual challenges of “dynamic scale variations” and “low-light feature submergence” through dedicated components is pivotal for advancing UAV crowd localization. However, this study has several limitations worthy of future exploration. First, regarding enhancement for extreme low-light environments, although the CBAM attention mechanism partially addresses low-light challenges, the visualization results remain suboptimal, and feature extraction for ultra-low illumination requires further optimization. Meanwhile, although the model achieves promising performance on DroneCrowd with minimal parameter increase, its inference FPS remains relatively low, requiring further optimization to better suit practical deployment on drones. Furthermore, while this study demonstrates the effectiveness of our method on the DroneCrowd dataset, its generalization capability across other UAV-based crowd datasets (e.g., the UP-COUNT dataset) remains to be thoroughly validated. We have explicitly designated this cross-dataset evaluation as a critical and immediate goal of our future research.

6. Conclusions

This study introduces an enhanced UAV-Dot to address scale variations caused by UAV altitude changes and feature degradation under low-light conditions for UAV-based crowd localization. First, a scale-aware Gaussian heatmap mechanism with a lightweight prediction branch is adopted. It utilizes the predicted scale factor (s, obtained via softplus regression) to adaptively and dynamically adjust the Gaussian kernel size in the ground truth and is constrained by a regularization loss during training, resulting in a 1.62% increase in L-mAP. Additionally, a CBAM attention module embedded in the effective feature layers (Layers 3–6) of the MiT-B2 encoder enhances feature extraction in low-light conditions through channel–spatial attention, contributing to a 0.92% gain in L-mAP. Notably, the synergistic interplay between the scale-adaptive mechanism and the attention module—where CBAM purifies features for more reliable scale estimation, which in turn guides the generation of more precise heatmaps—results in a combined L-mAP improvement of 1.88%. Finally, an optimized post-processing strategy employing a dynamic threshold (75/255 of the maximum response) and DBSCAN clustering further improves L-mAP by 0.50% while balancing the F1 score. Evaluated on the DroneCrowd dataset, our method achieves an L-mAP of 53.38%—a 2.38% improvement over UAV-Dot—with only a 0.36% increase in parameters during training and 0.29% during testing. Although the FPS reduction caused by the number of parameters is relatively small compared to the baseline, the overall computational speed still needs improvement to better suit deployment on drones.

The performance improvements achieved by enhanced UAV-Dot unlock significant potential for various real-world applications in public safety and urban management. Its capability for accurate localization of individuals in dense, dynamic crowds makes it suitable for real-time monitoring at large-scale public events, providing crucial data for crowd density control and stampede prevention. Furthermore, its enhanced performance in low-light conditions ensures operational capability during nighttime missions. For smart city management, the localization capability can be utilized to analyze pedestrian flow patterns at transportation hubs or urban centers, facilitating better urban planning and resource allocation.

Based on the discussion in Section 5, future research will focus on the following aspects: (1) enhancing network performance under extreme conditions such as low illumination, (2) improving the model’s inference speed and training and testing it on the UP-COUNT dataset, (3) integrating localization with subsequent tracking tasks, (4) considering the impact of the threshold on the results, a dynamic threshold that adapts to different scenarios will be set later, and this system will be deployed on UAV platforms for practical application.

Author Contributions

M.Z. was the main author of the work in this paper. M.Z. conceived the methodology of this paper, designed the experiments, collected and analyzed the data, and wrote the paper. F.Z. was responsible for directing the writing of the paper and providing financial support. Y.Z. was responsible for supervising the writing of the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are openly available in the DroneCrowd dataset at https://github.com/VisDrone/DroneCrowd(accessed on 16 October 2025), reference number [8].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Barr, D.; Drury, J.; Butler, T.; Choudhury, S.; Neville, F. Beyond ‘stampedes’: Towards a new psychology of crowd crush disasters. Br. J. Soc. Psychol. 2024, 63, 52–69. [Google Scholar] [PubMed]
Wang, S.J. Survey of Crowd Crush Disasters and Countermeasures. Prehosp. Disaster Med. 2023, 38, s78. [Google Scholar]
Sharma, A.; McCloskey, B.; Hui, D.S.; Rambia, A.; Zumla, A.; Traore, T.; Shafi, S.; El-Kafrawy, S.A.; Azhar, E.I.; Zumla, A.; et al. Global mass gathering events and deaths due to crowd surge, stampedes, crush and physical injuries-Lessons from the Seoul Halloween and other disasters. Travel Med. Infect. Dis. 2023, 52, 102524. [Google Scholar] [CrossRef] [PubMed]
Lei, Y.; Zhu, H.; Yuan, J.; Xiang, G.; Zhong, X.; He, S. DenseTrack: Drone-Based Crowd Tracking via Density-Aware Motion-Appearance Synergy. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, Australia, 28 October–1 November 2024. [Google Scholar]
Dronova, O.; Parinov, D.; Soloviev, B.; Kasumova, D.; Kochetkov, E.; Medvedeva, O.; Sergeeva, I. Unmanned aerial vehicles as element of road traffic safety monitoring. Transp. Res. Procedia 2022, 63, 2308–2314. [Google Scholar] [CrossRef]
Butila, E.V.; Boboc, R.G. Urban traffic monitoring and analysis using unmanned aerial vehicles (UAVs): A systematic literature review. Remote Sens. 2022, 14, 620. [Google Scholar] [CrossRef]
Zhao, L.; Bao, Z.; Xie, Z.; Huang, G.; Rehman, Z.U. A point and density map hybrid network for crowd counting and localization based on unmanned aerial vehicles. Connect. Sci. 2022, 34, 2481–2499. [Google Scholar] [CrossRef]
Wen, L.; Du, D.; Zhu, P.; Hu, Q.; Wang, Q.; Bo, L.; Lyu, S. Detection, Tracking, and Counting Meets Drones in Crowds: A Benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7808–7817. [Google Scholar]
Asanomi, T.; Nishimura, K.; Bise, R. Multi-Frame Attention with Feature-Level Warping for Drone Crowd Tracking. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 1664–1673. [Google Scholar]
Ptak, B.; Kraft, M. Enhancing people localisation in drone imagery for better crowd management by utilising every pixel in high-resolution images. arXiv 2025, arXiv:2502.04014. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Li, Y.; Zhang, X.; Chen, D. CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1091–1100. [Google Scholar]
Liu, Y.; Shi, M.; Zhao, Q.; Wang, X. Point in, Box Out: Beyond Counting Persons in Crowds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6462–6471. [Google Scholar]
Sam, D.B.; Peri, S.V.; Sundararaman, M.N.; Kamath, A.; Babu, R.V. Locate, Size, and Count: Accurately Resolving People in Dense Crowds via Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 2739–2751. [Google Scholar] [PubMed]
Song, Q.; Wang, C.; Jiang, Z.; Wang, Y.; Tai, Y.; Wang, C.; Li, J.; Huang, F.; Wu, Y. Rethinking Counting and Localization in Crowds: A Purely Point-Based Framework. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 3345–3354. [Google Scholar]
Liang, D.; Xu, W.; Bai, X. An End-to-End Transformer Model for Crowd Localization. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 235–251. [Google Scholar]
Liu, C.; Lu, H.; Cao, Z.; Liu, T. Point-Query Quadtree for Crowd Counting, Localization, and More. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 1676–1685. [Google Scholar]
Chen, I.H.; Chen, W.T.; Liu, Y.W.; Yang, M.H.; Kuo, S.Y. Improving Point-based Crowd Counting and Localization Based on Auxiliary Point Guidance. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024. [Google Scholar]
Liang, D.; Xu, W.; Zhu, Y.; Zhou, Y. Focal Inverse Distance Transform Maps for Crowd Localization. IEEE Trans. Multimed. 2023, 25, 6040–6052. [Google Scholar] [CrossRef]
Idrees, H.; Tayyab, M.; Athrey, K.; Zhang, D.; Al-Maadeed, S.; Rajpoot, N.; Shah, M. Composition Loss for Counting, Density Map Estimation and Localization in Dense Crowds. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 532–546. [Google Scholar]
Gao, J.; Han, T.; Yuan, Y.; Wang, Q. Domain-Adaptive Crowd Counting via High-Quality Image Translation and Density Reconstruction. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 4803–4815. [Google Scholar] [CrossRef] [PubMed]
Xu, C.; Liang, D.; Xu, Y.; Bai, S.; Zhan, W.; Bai, X.; Tomizuka, M. AutoScale: Learning to Scale for Crowd Counting. Int. J. Comput. Vis. 2022, 130, 405–434. [Google Scholar] [CrossRef]
Gao, J.; Han, T.; Wang, Q.; Yuan, Y.; Li, X. Learning Independent Instance Maps for Crowd Localization. arXiv 2020, arXiv:2012.04176. [Google Scholar]
Abousamra, S.; Hoai, M.; Samaras, D.; Chen, C. Localization in the Crowd with Topological Constraints. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; pp. 1–9. [Google Scholar]
Han, T.; Bai, L.; Liu, L.; Ouyang, W. STEERER: Resolving Scale Variations for Counting and Localization via Selective Inheritance Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 21848–21859. [Google Scholar]
Xu, C.; Wang, J.; Yang, W.; Yu, H.; Yu, L.; Xia, G.S. RFLA: Gaussian Receptive Field Based Label Assignment for Tiny Object Detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 450–467. [Google Scholar]
Liao, Y.K.; Lin, G.S.; Yeh, M.C. A Transformer-Based Framework for Tiny Object Detection. In Proceedings of the Asia Pacific Signal and Information Processing Association Annual Summit and Conference, Beijing, China, 12–15 December 2023; pp. 373–377. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar]
Luo, Z.; Wang, Z.; Huang, Y.; Wang, L.; Tan, T.; Zhou, E. Rethinking the Heatmap Regression for Bottom-up Human Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13259–13268. [Google Scholar]

Figure 1. Overall architecture of the proposed enhanced UAV-Dot framework. The input image is processed by the pixel distillation (PD) [10] module and a CBAM-enhanced U-Net. The decoder outputs dual branches for heatmap regression and scale prediction, where the latter dynamically constrains the ground truth generation for scale adaptation during training. Final locations are extracted via post-processing.

Figure 2. (a,b) Examples of targets with different sizes under UAV perspective. (c) Overly large kernel causes merging of small targets. (d) Overly small kernel causes fragmentation of large targets. (e,f) Heatmaps generated with small and large kernels for the target in (b).

Figure 3. Structural diagram of the scale prediction branch, where different colors represent different channels.

Figure 4. Three low-light scenarios (average brightness < 0.2) in the DroneCrowd test set, showing enlarged parts of the image to display the people.

Figure 5. Schematic diagram of embedding CBAM in the encoder of the U-Net architecture.

Figure 6. The architecture of traditional CBAM.

Figure 7. Structure of the channel attention module.

Figure 8. Structure of the spatial attention module.

Figure 9. The situation of normalized heatmap for obtaining T_b. (a) represents the case of a small target. (b) represents the case of a large target. In the figure, “max” represents the maximum value of the normalized heatmap, and “mean” represents the average value.

Figure 10. Visual comparison of detection results under different configurations. Green and red dots denote ground truth and predictions, respectively. (a) Using a fixed baseline threshold completely fails to detect. (b) After adaptive thresholding, multiple false detections on a large target. (c) Reduced yet residual errors with scale adaptation. (d) Best result with significantly reduced fragmentation, despite minor residual errors.

Figure 11. In the training curves of Equations (5) and (6), the orange line represents the results of Equation (5), and the blue line represents the results of Equation (6).

Figure 12. Heatmap visualization of the baseline and scale-adapted approaches, where (a,c) represent the baseline heatmaps for large-scale and small-scale cases, respectively, while (b,d) depict the corresponding heatmaps after introducing scale adaptation. The red response indicates the identified person, with darker colors representing higher heatmap values. Blue and other colors represent background responses.

Figure 13. Heatmap comparisons between the baseline and CBAM-enhanced models under low-light conditions. The first column shows original images with green dots indicating ground truth (GT), the second column displays baseline heatmaps, and the third column presents heatmaps after incorporating CBAM. In the heatmap response, the red response indicates the identified person, with darker colours representing higher heatmap values. Blue and other colours represent background responses.

Table 1. Performance comparison of different networks on the DroneCrowd dataset.

Method	L-mAP	L-AP@10	L-AP@15	L-AP@20
CSRNet	14.4%	15.13%	19.77%	21.16%
STNNet	40.45%	42.75%	50.98%	55.77%
P2PNet	29.44%	17.76%	38.42%	52.71%
MFA	43.43%	47.14%	51.58%	54.02%
STEERER	38.31%	41.96%	46.58%	49.07%
RFLA	32.05%	34.41%	39.59%	42.52%
SD-DETR	48.12%	52.56%	57.35%	60.08%
UAV-Dot	51.00%	57.06%	60.45%	62.29%
Enhanced UAV-Dot	53.38%	59.07%	63.11%	64.96%

Table 2. Ablation experiment.

Adaptive Gaussian Kernel	CBAM	Post- Processing Correction	L-mAP	L-AP@10	L-AP@15	L-AP@20
	✓		51.92%	58.22%	61.16%	63.02%
✓			52.62%	58.34%	61.99%	63.83%
✓	✓		52.88%	58.65%	62.37%	64.25%
✓	✓	✓	53.38%	59.07%	63.11%	64.96%

Table 3. Parameter analysis and FPS calculation.

Model	Training Parameters	Testing Parameters	Redenerende FPS	Algehele FPS
UAV-Dot	27.5 M	110.015 M	19.63 ± 0.22	19.27 ± 0.24
Enhanced UAV-Dot	27.6 M	110.314 M	15.49 ± 0.58	15.18 ± 0.16

Table 4. Model analysis under different lighting conditions.

Lighting Conditions	UAV-Dot	UAV-Dot with CBAM	Growth Situation	Enhanced UAV-Dot	Growth Situation
Low-light group	17.10%	18.62%	1.52%	19.13%	2.03%
Normal-light group	52.80%	53.65%	0.85%	55.22%	2.42%

Table 5. Analysis of CBAM at different positions.

Position of CBAM	L-mAP	Training Parameters	Testing Parameters	FPS
After the effective output layer of the encoder	51.92%	27.6 M	110.209 M	15.86 ± 0.45
Decoder output	51.23 %	27.5 M	110.060 M	15.15 ± 0.28

Table 6. Analysis of λ values.

λ	L-mAP	L-AP@10	L-AP@15	L-AP@20
0.8	51.63%	57.14%	61.17%	62.74%
0.9	52.01%	57.68%	61.25%	63.37%
1	52.62%	58.34%	61.99%	63.83%
1.1	50.58%	56.12%	59.82%	61.96%

Table 7. Post-processing threshold analysis (after CBAM and adaptive kernel).

Post- Processing Threshold	L-mAP	L-AP@10	L-AP@15	L-AP@20	F1 Score@10	F1 Score@15	F1 Score@20
No post-processing	52.88%	58.65%	62.37%	64.25%	67.57%	70.78%	72.48%
85/255	50.81%	56.33%	60.01%	62.27%	67.16%	70.40%	72.33%
80/255	51.93%	57.93%	61.46%	63.47%	67.34%	70.60%	72.51%
75/255	53.38%	59.07%	63.11%	64.96%	67.24%	70.50%	72.37%
70/255	53.87%	59.27%	63.53%	65.92%	66.83%	70.07%	71.91%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, M.; Zhao, F.; Zhang, Y. Enhanced UAV-Dot for UAV Crowd Localization: Adaptive Gaussian Heat Map and Attention Mechanism to Address Scale/Low-Light Challenges. Drones 2025, 9, 833. https://doi.org/10.3390/drones9120833

AMA Style

Zhang M, Zhao F, Zhang Y. Enhanced UAV-Dot for UAV Crowd Localization: Adaptive Gaussian Heat Map and Attention Mechanism to Address Scale/Low-Light Challenges. Drones. 2025; 9(12):833. https://doi.org/10.3390/drones9120833

Chicago/Turabian Style

Zhang, Min, Fei Zhao, and Yan Zhang. 2025. "Enhanced UAV-Dot for UAV Crowd Localization: Adaptive Gaussian Heat Map and Attention Mechanism to Address Scale/Low-Light Challenges" Drones 9, no. 12: 833. https://doi.org/10.3390/drones9120833

APA Style

Zhang, M., Zhao, F., & Zhang, Y. (2025). Enhanced UAV-Dot for UAV Crowd Localization: Adaptive Gaussian Heat Map and Attention Mechanism to Address Scale/Low-Light Challenges. Drones, 9(12), 833. https://doi.org/10.3390/drones9120833

Article Menu

Enhanced UAV-Dot for UAV Crowd Localization: Adaptive Gaussian Heat Map and Attention Mechanism to Address Scale/Low-Light Challenges

Highlights

Abstract

1. Introduction

2. Related Works

2.1. From Count to Point Regression

2.2. Heatmap Regression Paradigm

2.3. UAV-Specific Frameworks

3. Methods

3.1. Baseline Framework: UAV-Dot

3.2. Scale-Adaptive Gaussian Kernel

3.3. Attention Enhancement for Low-Light Scenarios and Post-Processing Strategy

4. Experiments and Results

4.1. Dataset and Metrics

4.1.1. Dataset

4.1.2. Metrics

4.2. Implementation Details

4.3. Experiments

4.3.1. Comparative Experiment

4.3.2. Ablation Experiments, Parameter Analysis, and FPS Measurement

4.3.3. Robustness Testing Under Different Lighting Conditions and CBAM Position Analysis

4.3.4. Stability Analysis of Equations (5) and (6) and Analysis of the λ Parameter in Equation (10)

4.3.5. Post-Processing Threshold Optimization Analysis

4.4. Visual Heatmap Comparison

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI