Adaptive Clustering-Guided Multi-Scale Integration for Traffic Density Estimation in Remote Sensing Images

Liu, Xin; Meng, Qiao; Zhang, Xiangqing; Li, Xinli; Li, Shihao

doi:10.3390/rs17162796

Open AccessArticle

Adaptive Clustering-Guided Multi-Scale Integration for Traffic Density Estimation in Remote Sensing Images

by

Xin Liu

¹,

Qiao Meng

^1,*,

Xiangqing Zhang

²,

Xinli Li

¹ and

Shihao Li

¹

School of Computer Technology and Application, Qinghai University, Xining 810016, China

²

College of Mathematics and Computer Science, Yan’an University, Yan’an 716000, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(16), 2796; https://doi.org/10.3390/rs17162796

Submission received: 8 July 2025 / Revised: 8 August 2025 / Accepted: 11 August 2025 / Published: 12 August 2025

Download

Browse Figures

Versions Notes

Abstract

Grading and providing early warning of traffic congestion density is crucial for the timely coordination and optimization of traffic management. However, current traffic density detection methods primarily rely on historical traffic flow data, resulting in ambiguous thresholds for congestion classification. To overcome these challenges, this paper proposes a traffic density grading algorithm for remote sensing images that integrates adaptive clustering and multi-scale fusion. A dynamic neighborhood radius adjustment mechanism guided by spatial distribution characteristics is introduced to ensure consistency between the density clustering parameter space and the decision domain for image cropping, thereby addressing the issues of large errors and low efficiency in existing cropping techniques. Furthermore, a hierarchical detection framework is developed by incorporating a dynamic background suppression strategy to fuse multi-scale spatiotemporal features, thereby enhancing the detection accuracy of small objects in remote sensing imagery. Additionally, we propose a novel method that combines density analysis with pixel-level gradient quantification to construct a traffic state evaluation model featuring a dual optimization strategy. This enables precise detection and grading of traffic congestion areas while maintaining low computational overhead. Experimental results demonstrate that the proposed approach achieves average precision (AP) scores of 32.6% on the VisDrone dataset and 16.2% on the UAVDT dataset.

Keywords:

remote sensing imagery; object detection; traffic density detection; image cropping techniques; traffic congestion grading

1. Introduction

As urbanization accelerates, the issue of traffic congestion has become increasingly severe, representing one of the key factors limiting sustainable urban development [1]. Accurate monitoring and analysis of traffic density is crucial for traffic management, urban planning, and environmental protection [2]. By analyzing the spatiotemporal distribution patterns of traffic flow and density, effective early warning information for traffic congestion can be provided, offering scientific support for optimizing road network resource allocation and formulating congestion mitigation policies. Traditional traffic density monitoring methods mainly rely on ground sensors and traffic cameras. While these methods can provide real-time data to some extent, they are limited in terms of large-area coverage and data update speed [3]. With the advancement of drone technology and its deep integration into low-altitude resource utilization, remote sensing image acquisition via drone platforms has emerged as a novel solution for urban traffic management. Remote sensing technology, with its extensive surface coverage, high-frequency data acquisition, and real-time information updating capabilities, significantly enhances traffic state perception [4,5]. By analyzing road features and vehicle distribution in remote sensing imagery, automated estimation of traffic flow and density can be achieved. Traffic density grading is a core focus of traffic flow research. Accelerated urbanization intensifies congestion, demanding more scientific and precise grading systems to assess traffic conditions. Such systems provide a bridge between theoretical research and practical applications, offering critical support for traffic management, control measures, and planning decisions. However, the complexity and diversity of remote sensing images, such as varying sizes of traffic targets and different types of roads, pose new challenges for the accurate classification of traffic density.

Current research predominantly employs traffic density grading models based on image processing methods and machine learning algorithms [6,7,8]. Although these models yield baseline accuracy under standard conditions, they face significant difficulties when representing small targets in remote sensing image interpretation. Small targets occupy a large portion of such images, and traditional image processing techniques often struggle with the identification and localization of these small targets. To address this issue, existing studies often use image preprocessing techniques to enhance target information, thereby improving the signal-to-noise ratio for small targets [9,10,11]. Among these techniques, image cropping based on spatial domain partitioning is a popular approach. However, the limitations of this cropping method significantly affect the accuracy of traffic density grading. The most common cropping techniques involve rigid grid segmentation [12,13] and density-driven masks [14]. As shown in Figure 1, the former uses equidistant grids to divide the original image uniformly, offering the advantage of low time complexity but severely disrupting spatial continuity, especially in areas with curved roads or dense traffic. In such cases, vehicles at the boundaries of cropped sub-images are incomplete, directly affecting the accuracy of subsequent detection and counting. The latter technique, while capable of generating adaptive masks based on density heatmaps, often requires the additional development of a density prediction network for joint training, which consumes considerable computational resources, reduces computational efficiency, and complicates real-time density detection. Therefore, in order to improve the accuracy and speed of traffic density grading, it is essential to design a spatial partitioning mechanism that maintains spatial continuity while also being computationally efficient and able to enhance and integrate local image information.

In summary, the use of drone remote sensing image technology can significantly expand the scope of traffic density monitoring in intelligent transportation contexts, providing broader monitoring coverage to address issues such as insufficient detection of congestion areas and limitations in the coverage of detected regions. However, challenges such as low small-target detection accuracy in drone remote sensing images, disruption of spatial continuity caused by existing image cropping techniques, and computational inefficiency represent significant bottlenecks that hinder the improvement of traffic density grading accuracy and real-time performance. Therefore, this paper proposes a remote sensing image-based traffic density grading algorithm that utilizes adaptive density clustering and multi-stage training integration. This method incorporates a multi-stage training and multi-scale inference approach to enhance the detection accuracy of targets in remote sensing images while effectively and efficiently grading traffic density. The main contributions of this paper include the following three points:

(1): An adaptive density-based clustering and pruning algorithm constrained by dual spatial factors is proposed. By dynamically adjusting the neighborhood radius and aligning the density clustering parameter space with the pruning region, the proposed method enables adaptive enhancement processing of remote sensing imagery. It overcomes the dependence on deep neural network training found in conventional approaches, leading to reduced computational overhead and time consumption.
(2): A dynamic traffic density zoning technique based on density analysis and pixel-level gradient quantization is proposed. By incorporating precise target localization and a dual-optimization mechanism, a traffic state evaluation model with visual interpretability is developed. This model enhances the accuracy of dense object detection and the interpretability of the decision-making process, enabling a reliable regional congestion assessment strategy.
(3): A collaborative optimization strategy that incorporates multi-stage training and multi-scale inference is proposed. A hierarchical detection framework connecting raw input data with region-of-interest refinement is established. Through a dynamic fine-tuning strategy for suppressing background interference and a multi-level detection mechanism, this method achieves accurate localization of traffic-related targets.

2. Related Works

This section provides a brief review of the relevant literature, including object detection in remote sensing images, techniques for density map estimation, and the principal existing methods for traffic density grading.

2.1. Target Detection

Object detection in remote sensing imagery represents a critical interdisciplinary field combining remote sensing technology with computer vision, with the goal of automatically identifying and localizing targets such as buildings, roads, vehicles, and vegetation in high-resolution images. Despite its importance, this task presents several challenges. First, the high spatial resolution of remote sensing images leads to large image dimensions and dense target distributions, requiring algorithms with high computational efficiency and rapid inference capabilities [15]. Second, the wide range of object sizes—from tall buildings to small vehicles—poses significant challenges for multi-scale detection. Moreover, complex backgrounds, occlusions, and the tendency of objects to blend into their surroundings further complicate detection tasks. The diverse appearances, shapes, and textures of different object categories also demand robust recognition capabilities from detection algorithms [16,17]. Additionally, the vast data volumes and high annotation costs associated with remote sensing imagery limit the development of large-scale datasets and hinder effective model training.

In terms of methods, early target detection relied primarily on traditional machine learning techniques such as Histogram of Oriented Gradients (HOG) [18], Scale-Invariant Feature Transform (SIFT) [19], and other handcrafted feature extraction methods combined with classifiers such as Support Vector Machine (SVM) [20]. While these methods worked well in simpler scenarios, their performance in complex remote sensing images was limited. In recent years, deep learning approaches—especially target detection methods based on Convolutional Neural Networks (CNN) [21,22] and transformer models [23]—have made remarkable progress. Two-stage detectors such as Faster R-CNN [24] generate candidate regions first, then classify and regress, providing high-precision results. On the other hand, one-stage detectors such as the YOLO series [25], and Single-Shot MultiBox Detector (SSD) [26], and RetinaNet [27] (which addresses foreground–background class imbalance via focal loss) enable real-time detection at faster speeds. Additionally, the DETR model developed by Yao [28] and others uses self-attention mechanisms for global feature modeling, making it more effective in complex scenarios. Detecting moving objects such as vehicles and ships presents unique challenges in remote sensing, including motion blur, deformation, and complex trajectories. These are often addressed using techniques such as optical flow, Recurrent Neural Networks (RNNs) with Long Short-Term Memory, (LSTM), or specialized attention mechanisms [29]. Lightweight networks such as MobileNet [30] (proposed by Howard et al.) and EfficientNet [31] (by Tan et al.) help to reduce computational resource consumption while maintaining strong detection performance, making them suitable for resource-limited applications. Zhang et al. [32] proposed a lightweight distillation network called APDNet which enables efficient and accurate aerial personnel detection on edge devices while maintaining a balance between real-time performance and detection accuracy. Techniques such as multi-scale feature fusion, attention mechanisms, data augmentation, and transfer learning have further enhanced the accuracy and robustness of target detection in remote sensing images.

2.2. Density Map Estimation

Density map estimation is a crucial computer vision task that generates continuous density maps such as crowd density or vehicle density by analyzing the distribution of targets in images or videos [33]. Early methods for density map estimation primarily relied on traditional algorithms that used manually designed features and thresholds. For example, the thresholding method based on grayscale histograms distinguishes dense and sparse areas by analyzing the grayscale distribution of an image; however, this approach is highly sensitive to lighting variations [34]. The optical flow method estimates density based on the changes in optical flow of moving targets. It is commonly used in the case of dynamic scenes such as crowd movement, but is less effective for static targets or in low-light environments [35]. In terms of spatial interpolation, Kernel Density Estimation (KDE) non-parametrically estimates density using a probability distribution model, although it is computationally intensive. Mean-shift clustering iteratively identifies density peaks to partition regions, but is prone to interference from noise. Traditional machine learning models such as Support Vector Machine (SVM) and random forests rely on handcrafted features such as target area or spacing, resulting in weaker generalization abilities.

With the advent of deep learning, data-driven methods have become the dominant approach. Monocular vision-based density estimation models such as CSRNet utilize multi-scale residual blocks to capture crowd density features, then incorporate pooling operations to extract contextual information [36]. To enhance the performance of density map generation, Zhang et al. [37] introduced geometric adaptation and fixed convolution kernels, while using Gaussian convolution to generate the density maps. In addition, Ding et al. [38] employed the U-Net and Hourglass architectures in their density map generation models, producing density heatmaps that were optimized through pixel-level loss functions. Yang et al. [39] combined density maps with residual estimation methods to improve crowd counting accuracy. Wang et al. [40] further improved the quality of their generated density maps by incorporating sparse constraints inspired by manifold learning.

2.3. Density Zone Grading

Traditional traffic density grading methods mainly rely on vehicle flow and speed. Classic methods such as the traffic flow–density model generally describe traffic conditions by dividing traffic flow into different density ranges based on the relationship between lane capacity, vehicle speed, and traffic flow [41,42,43]. In these methods, traffic density is typically defined as the number of vehicles per unit length of road. Traffic conditions are then classified into multiple levels based on variations in vehicle speed and density.

Although these traditional grading methods based on traffic flow theory provide some insights into traffic conditions, they fail to adequately account for the complexity of traffic networks and the differences between various types of roads. To address this limitation, emerging research increasingly focuses on grading methods that integrate micro-scale simulation and complex network analysis. These advanced approaches enable more precise grading of traffic density by capturing the evolution of traffic flow while considering factors such as road segments, road conditions, and temporal variations. This results in the generation of more granular density levels, providing a better understanding of the microscopic characteristics of traffic flow.

3. Methodology

3.1. Overall Framework

This study proposes a traffic density grading method for remote sensing images based on adaptive density clustering and multi-stage training integration, providing reliable evidence for key scientific challenges in traffic scheduling and safety within intelligent transportation. The core approach of this method involves four stages: data preparation, model training, target localization, and density search and grading (as shown in Figure 2). The entire method implements a traffic density region search and early warning system using the collaborative mechanisms of deep learning and computer vision.

In the data preparation phase, a dynamic cropping mechanism is introduced. The annotated original dataset is fed into an adaptive density clustering cropping algorithm under dual spatial constraints to perform density clustering on original images. By cropping high-density regions to reduce redundant background information, an image sequence of high-density effective regions is constructed into a region-enhanced dataset, which serves as preparation for subsequent model training. In the model training phase, a cascaded training strategy is adopted. The first stage involves pretraining a benchmark model on the original image dataset to obtain an initial model. The second stage implements parameter fine-tuning, in which the cropped enhanced dataset is input to the initial model for adaptive parameter adjustment, ultimately yielding an optimized model. In the target localization phase, a multi-scale joint inference mechanism is constructed. Specifically, the original image is first input into the model to obtain detection results for the original dataset. A homologous density clustering algorithm is then employed to generate a region focus map, while secondary inference is conducted via the neural network model to acquire local detection results for the cropped region’s focus map. Subsequently, spatial distribution modeling is performed on the detection results of the original dataset and local results to construct a spatiotemporally consistent representation space for traffic elements. These two types of detection results are spatially fused to achieve spatial localization of targets, which effectively improves detection accuracy for traffic-dense areas in large-view remote sensing images. The density search and classification phase utilizes a region recognition algorithm based on the novel proposed density search structure and a pixel-level density gradient quantization method, providing a reliable basis for macro-level traffic scheduling and traffic congestion mitigation.

3.2. Adaptive Pruning of Density-Based Clustering

This paper presents an adaptive density clustering pruning algorithm under dual space constraints, with the core processing flow illustrated in Figure 3. The method performs intelligent detection and image optimization of target-dense regions through a three-stage process. First, a dynamic density clustering algorithm with an adaptively adjusted neighborhood radius is employed for analyzing the distribution characteristics of the target. Second, clustering parameters are dynamically adjusted based on local density characteristics to adaptively optimize the clustering process. Finally, the optimal detection region is generated by integrating spatial distribution and density features, improving the accuracy and efficiency of cropping in high-density regions of images.

Analysis of target distribution characteristics and fundamental principles of DBSCAN.

We combine the core principles of the DBSCAN algorithm with the k-nearest neighbors (KNN) algorithm to achieve density-based clustering through a set of dual constraints. The following delineates the core philosophy underlying the DBSCAN algorithm [44].

As illustrated in Figure 4, DBSCAN is a density-based clustering method capable of identifying clusters of arbitrary shape and effectively recognizing and ignoring noise points.

The core principle of the algorithm revolves around two parameters: a neighborhood radius

ϵ

and a minimum number of points

m i n P t s

. A data point P is classified as a core point if the number of samples within its

ϵ

-neighborhood is at least

m i n P t s

, as formalized in Equation (1):

core (P) \Leftrightarrow | {q \in D ∣ dist (P, q) \leq ϵ} | \geq minPts

(1)

where D is the dataset and

dist (P, q)

is the distance between points P and q.

Clusters are formed by connecting points based on

d e n s i t y - r e a c h a b i l i t y

. A point q is

d i r e c t l y d e n s i t y - r e a c h a b l e

from a core point P if it lies within the

ϵ

-neighborhood of P, as expressed in Equation (2):

DirectlyDensityReachable (q, P) \Leftrightarrow core (P) \land dist (P, q) \leq ϵ .

(2)

The clustering process expands recursively, as follows:

If q is directly density-reachable from a core point P, then q belongs to the same cluster as P.
A point q is $d e n s i t y - r e a c h a b l e$ from P if there exists a chain of points $p_{1}, p_{2}, . . ., p_{n}$ ( $p_{1} = P$ , $p_{n} = q$ ) in which each $p_{i + 1}$ is directly density-reachable from $p_{i}$ and in which $p_{i}$ is a core point.
A point q is $d e n s i t y - c o n n e c t e d$ to P if there exists a core point o such that both P and q are density-reachable from o.

The algorithm starts with an arbitrary unvisited core point, then adds all density-reachable points to a cluster and marks them as visited. This process repeats with the remaining unvisited core points. Any point that is not density-reachable from any core point and is not itself a core point is classified as noise.

Building upon the core principles of DBSCAN outlined above, we propose an improved dynamic density clustering visual perception optimization algorithm. By dynamically adjusting

ϵ

and incorporating a dual-space constraint optimization strategy, the proposed algorithm can be utilized more effectively in image cropping tasks.

Dynamic adjustment of the clustering parameters.

To ensure that DBSCAN clustering can adapt to various data distributions, it is crucial to select an appropriate

ε

value. If

ε

is set too small, the clusters will be excessively fragmented; if

ε

is set too large, different clusters will be merged. To address the parameter sensitivity issue in traditional DBSCAN for UAV image processing, a dynamic parameter estimation method based on k-nearest neighbor distances is proposed. Let the target center coordinate set be as shown in Equation (3).

P = {p_{i} ∣ p_{i} = (x_{i}, y_{i})}_{i = 1}^{n}

(3)

Next, a distance matrix D is constructed in which

D_{i j}

denotes the distance between points

p_{i}

and

p_{j}

, as defined in Equation (4).

D = [\begin{matrix} dist (x_{1}, y_{1}, x_{1}, y_{1}) & dist (x_{1}, y_{1}, x_{2}, y_{2}) & \dots & dist (x_{1}, y_{1}, x_{n}, y_{n}) \\ dist (x_{2}, y_{2}, x_{1}, y_{1}) & dist (x_{2}, y_{2}, x_{2}, y_{2}) & \dots & dist (x_{2}, y_{2}, x_{n}, y_{n}) \\ ⋮ & ⋮ & ⋱ & ⋮ \\ dist (x_{n}, y_{n}, x_{1}, y_{1}) & dist (x_{n}, y_{n}, x_{2}, y_{2}) & \dots & dist (x_{n}, y_{n}, x_{n}, y_{n}) \end{matrix}]

(4)

For each point

p_{i}

, we can use the distance matrix D to determine its distance to other points and select the k closest points, as shown in Equation (5).

D_{i}^{(k)} = sort (D_{i, 1}, D_{i, 2}, ߪ, D_{i, n}) [: k]

(5)

For each point

p_{i}

, this paper takes the distance to its k-th nearest neighbor as the maximum neighborhood distance, denoted

d_{i, max}

, as shown in Equation (6).

d_{i, max} = D_{i}^{(k)} [k]

(6)

The dynamic adjustment strategy for the value of k is shown in Equation (7).

k = max (k_{base}, ⌈ β n ⌉)

(7)

In the above equation,

k_{base}

is the base parameter, with a default value of 5,

β

is the sample-ratio coefficient, set as 0.05, and n is the total number of targets in the current image. The mean and standard deviation of the maximum neighborhood distances for all points are calculated as shown in Equations (8) and (9):

μ_{d} = \frac{1}{n} \sum_{i = 1}^{n} d_{i, max}

(8)

σ_{d} = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(d_{i, max} - μ_{d})}^{2}}

(9)

where

μ_{d}

is the mean of all maximum neighborhood distances and

σ_{d}

is the standard deviation. The dynamic neighborhood radius is defined as shown in Equation (10):

ε = μ_{d} + α σ_{d}

(10)

in which

α

is the dynamic adjustment coefficient, which is set to 0.5 in this algorithm. Figure 5 illustrates the effect of varying the

α

parameter. Setting

α

too low results in the failure to identify certain dense regions, while setting

α

too high preserves excessive background areas, thereby compromising detection performance. Consequently, we selected a moderate value of 0.5. This method computes an image-specific dynamic radius

ε

based on global density statistics. By automatically adapting

ε

to the overall density characteristics of each image, the approach enables adaptive clustering of data with varying densities across different images while maintaining spatial uniformity within individual images.

Generating the optimal detection region.

In the region generation phase, a dual spatial constraint mechanism is employed; the geometric constraint requires that the candidate region contain at least a specified number of detection targets, while the size constraint specifies that the minimum side length is 70 pixels, ensuring that the output region satisfies both spatial integrity and detection validity. Actively filtering out regions that fail to meet the aforementioned two criteria helps to ensure output quality. Specifically, utilizing a large number of small low-density regions not only fails to significantly improve detection accuracy but also consumes computational resources during subsequent training. Filtering out these substandard candidate regions guarantees that the final output regions possess detection effectiveness while simultaneously enhancing the efficiency of subsequent analysis. By utilizing the density clustering method, the algorithm adaptively selects the neighborhood radius

ε

, accurately identifying high-density target regions in the image. The algorithm does not require prior specification of the number of clusters; instead, it automatically adjusts the value based on the target distribution in the image, demonstrating strong robustness and scalability. Subsequently, through image cropping and coordinate mapping operations, the cropped high-density enhanced image is constructed into a new enhanced high-density region dataset, providing effective data support for the subsequent model fine-tuning phase.

3.3. Joint Optimization of Multi-Stage Training and Multi-Scale Inference

The original dataset is augmented using the dynamic density clustering visual perception optimization algorithm discussed earlier. An affine transformation model is established to map the original annotation coordinate system to the sub-region, maintaining coordinate correspondence between the original and locally enhanced datasets. Next, a framework integrating processing of the original data and focused regions is developed for automatic target recognition and localization. Finally, strategies for model training and inference optimization are implemented. These include utilizing the enhanced dataset for training, integrating detections from both the original data and focused regions, and modifying the Non-Maximum Suppression (NMS) algorithm to prevent duplicate detections. The specific structure of the methodology is shown in Figure 6.

In the model training phase, a validated deep neural network architecture can be selected to optimize the parameters of the initial training set. Several neural network models with excellent performance have emerged in the field of object detection, such as Faster R-CNN, the YOLO series, and RetinaNet. Due to the YOLO model’s strong adaptability, high real-time performance, and superiority in small object detection, in this paper we select the YOLO series as the base model. Initially, the model is trained sufficiently on the original annotated dataset to achieve preliminary parameter convergence. Then, the locally enhanced dataset constructed using the density clustering pruning algorithm is introduced. This dataset is generated using a multi-scale cropping data augmentation strategy. A two-phase training strategy is employed; in the first phase, we initialize the backbone network with pretrained weights (e.g., weights pretrained on the COCO dataset) and train it on the original dataset until the model fits the data. After this initial training, the backbone network parameters are frozen. In the second phase, we use the high-density region-enhanced dataset (cropped image dataset) as input and unfreeze all network layers to perform fine-tuning based on the already fitted weights from the first phase. Notably, experimental results show that the fine-tuning process in the second phase achieves rapid convergence (within approximately 20 training epochs), significantly accelerating the overall training efficiency compared to training from scratch.

In the model inference phase, the trained detection model is first applied to the original input dataset to obtain object detection results at the global scale through forward propagation. Then, the same density clustering cropping algorithm is applied dynamically to the original data. During this application process, a set of enhanced sub-images focusing on critical local areas is generated. The pretrained model is then deployed on these dynamically generated local sub-images for second-pass inference, yielding refined local detection results.

In the target detection result fusion phase, the original dataset images and locally augmented images are matched by image name; based on the image coordinate information, the detection boxes from the locally augmented images are then mapped to the coordinate system of the original dataset images, achieving fusion of the detection results across data sources. This process is illustrated in Equation (11):

(\begin{matrix} o r i g_{l e f t, i} \\ o r i g_{t o p, i} \end{matrix}) = (\begin{matrix} n e w_{l e f t, i} \\ n e w_{t o p, i} \end{matrix}) + (\begin{matrix} l e f t_{i} \\ t o p_{i} \end{matrix})

(11)

where the coordinates of the top-left corner of the cropped region are

(l e f t_{i}, t o p_{i})

, while

n e w_{l e f t, i}

and

n e w_{t o p, i}

represent the new coordinates relative to this corner. To prevent duplicate detection of the same object, the Non-Maximum Suppression (NMS) algorithm is applied for deduplication. The core of the NMS algorithm is calculation of the Intersection over Union (IoU) for each pair of detection boxes, and its calculation formula is shown in Equation (12):

I o U (A, B) = \frac{| A \cap B |}{| A \cup B |}

(12)

where A and B represent two detection boxes, with

| A \cap B |

denoting the intersection area of the two boxes and

| A \cup B |

representing their union area. If the IoU value between the two boxes exceeds the set threshold (0.7 in this study), they are considered duplicates and the box with the lower confidence score is removed. This approach prevents duplicate detection of the same object, helping to ensure the accuracy of the results.

3.4. Maximum Density Region Search and Congestion Grading Methodology

To address issues such as inaccurate congestion area recognition and unclear congestion level boundaries as well as the need for postprocessing and visualization of detection results, this paper proposes an adaptive rectangular generation method based on regional compactness metrics. This method is optimized for the interpretability of detection results through a multi-stage filtering strategy. The core structure of the algorithm is illustrated in Figure 7:

The target linking module constructs a weighted complete graph

G = (V, E)

in which the vertex set V represents the detection targets and the edge weight

d_{i j}

corresponds to the Euclidean distance between the centers of targets i and j, as shown in Equation (13):

d_{i j} = \sqrt{{(x_{i} - x_{j})}^{2} + {(y_{i} - y_{j})}^{2}} .

(13)

A greedy strategy is employed for target linking. The target with the smallest global distance sum is initially selected as the seed node, and the minimum spanning tree is constructed using an improved version of Prim’s algorithm. The nearest neighbor nodes are iteratively expanded to form the linking sequence.

Initially, we select an initial node

r \in V

and construct the minimum spanning tree T using the MST algorithm. In the region generation phase, a dynamic threshold mechanism is introduced. When the number of linked nodes reaches a predefined threshold (set to 30% of the total detections in this paper), the calculation of the minimum bounding rectangle is triggered.

This rectangle is defined as the smallest axis-aligned bounding box that encloses the selected nodes. For each candidate window

W_{k}

, its compactness is assessed using two distinct criteria:

Area Ratio (

R_{p}

)

R_{p} = \frac{\sum_{k = 1}^{M} A_{k}}{W \times H}

(14)

This metric quantifies the density of detection boxes within the candidate window. It represents the ratio of the total area covered by all detection boxes to the area of the bounding box, where:

$A_{k}$ is the area of the k-th detection box (detected target area).
W and H respectively denote the width and height of the bounding box (density region area identified by the search).

Unique Pixel Rate (

R_{u}

)

R_{u} (W_{k}) = \frac{∥⋃_{B_{i} \in W_{k}} Mask (B_{i})∥}{S (W_{k})}

(15)

This metric evaluates the spatial distribution characteristics of the target region by measuring the proportion of distinct target pixels:

Numerator $∥⋃_{B_{i} \in W_{k}} Mask (B_{i})∥$ : Total pixels in the union of detection masks within $W_{k}$ . Overlapping regions are eliminated through pixel-wise Boolean operations to avoid double-counting, thereby representing actual target coverage.
Denominator $S (W_{k})$ : Total pixels in the candidate window, corresponding to the maximum theoretical coverage area.

The

R_{u}

metric quantifies the compactness of the valid target region within the window. A value approaching 1 indicates a denser distribution of targets. In the optimal window selection mechanism,

R_{u}

and the pixel fill ratio

R_{p}

together form a multi-objective optimization function, as expressed in Equation (16):

Score (W_{k}) = R_{p} (W_{k}) + λ R_{u} (W_{k})

(16)

where

R_{p}

represents the overall coverage of the target within the window, calculated as the ratio of the sum of the areas of all detection boxes to the window’s area without accounting for overlap effects. The weighting coefficient

λ

(set to 0.8 in this paper) governs the balance between the two metrics, with the algorithm prioritizing the uniqueness of the target distribution over mere area accumulation. This design is based on the prevalence of overlapping targets in dense scenes, where a higher

λ

helps to suppress false positive detections caused by target stacking.

Finally, the score is quantified into ten levels, with a color scale from green to red generated via linear interpolation to represent varying degrees of congestion. This method effectively identifies target regions with spatial clustering characteristics and provides interpretable decision support for dense object detection scenarios.

4. Experiments and Results

4.1. Dataset Introduction

In terms of datasets, large remote sensing image datasets such as VisDrone [45], UAVDT [46], and UAV-BD [47] provide critical data for the development and evaluation of object detection algorithms [48]. The VisDrone [45] and UAVDT [46] datasets are widely utilized due to their extensive collection of traffic-related remote sensing images and the diverse set of detection targets characterized by varying sizes and categories. Figure 8 and Figure 9 illustrate the distribution of pixel sizes for traffic targets and the number of different categories. To demonstrate the algorithm’s effectiveness, this paper conducts experiments using these two datasets.

VisDrone [45]: This dataset serves as a benchmark. It consists of 10,209 static images sourced from various drone-mounted cameras. It covers a broad range of conditions, including different cities (spanning fourteen cities across thousands of kilometers), various environments including urban and rural settings, and a wide variety of objects such as pedestrians, vehicles, bicycles, and others. In addition, it contains scenes of varying densities, including both sparse and crowded environments. We used 6471 images from this dataset for training, 548 for validation, and 3190 for testing.

UAVDT [46]: This dataset includes a large number of aerial images. It is a challenging large-scale dataset specifically designed for drone videos. The images were captured by drones in various complex scenarios, focusing on vehicles. The dataset contains three categories (cars, trucks, and buses), with a resolution of approximately 1024 × 540 pixels. In this study, 29,789 images were used for training and 10,620 for testing.

4.2. Experimental Settings and Evaluation Indicators

For selection of the benchmark model, this study utilizes the YOLOv11 model, which has demonstrated excellent performance in small object detection. The YOLOv11l model pretrained on the MS COCO dataset [49] is used as the base, with YOLOv8n serving as the initial weights. The batch size per GPU is set to 16, with an initial learning rate of 0.001. The model is trained for 250 epochs on the original dataset and 20 additional epochs on the augmented dataset.

The experimental environment was configured as follows: the Central Processing Unit (CPU) was an Intel Xeon G6242 and the Graphics Processing Unit (GPU) was an NVIDIA A100 with 80 GB of onboard memory. The experiments were conducted using Python 3.8 and CUDA 11.7. This hardware and software configuration provided the necessary computational resources to support the effective execution of our experiments.

Consistent with mainstream evaluation standards, this study adopts the MS COCO evaluation metrics, utilizing six evaluation metrics: AP (average precision),

{AP}_{50}

,

{AP}_{75}

,

{AP}_{small}

,

{AP}_{medium}

, and

{AP}_{large}

. AP represents the average precision across multiple IoU thresholds ranging from 0.50 to 0.95, with a step size of 0.05. Additionally, the AP values are computed separately for different object categories in order to assess the model’s performance across targets of varying scales.

4.3. Experimental Results

In this section, we analyze the object detection results and visualization performance of our proposed traffic density grading algorithm for remote sensing images based on adaptive density clustering and multi-stage training fusion using the VisionDrone and UAVDT datasets. The visualization results across different scenes are shown in Figure 10 and Figure 11. As shown in Figure 11, the first row displays ground-truth density heatmaps generated using Seaborn KDE 0.13.2 with ground truth coordinates. The second row shows heatmaps created by processing the original COCO annotations. In this process, image dimensions and category information were retrieved, the center points of all bounding boxes were extracted, and a zero matrix matching the image dimensions was created. A value of 1 was assigned to each center point location, and the resulting map was then smoothed using a Gaussian filter and normalized to yield density values between 0 and 1. The third row presents the density regions and their stratification generated by our complete algorithm, demonstrating close alignment with the ground truth. The fourth row shows the visualization results obtained by using area as the sole evaluation metric. This approach yields relatively imprecise density region detection; furthermore, significant object overlaps can lead to false assessments of uncongested regions as dense, resulting in frequent exceeding of the threshold. The fifth row visualizes the results using object count as the evaluation metric. While this method achieves more precise density region detection, inconsistent object sizes and viewing perspectives make stratification assessments challenging. Figure 12 shows the cropping results of the adaptive density-based clustering pruning algorithm under dual spatial constraints when applied to the dataset images.

Table 1 shows the results of different APs on the VisDrone dataset. It is evident that the model proposed in this paper consistently outperforms other detection models, achieving significant improvements across various backbone networks. Specifically, the proposed algorithm achieves an AP of 32.6% on the YOLOv11-based model, demonstrating excellent performance and significantly surpassing both the classic DMNet [14] and the original YOLO model. Furthermore, the AP at an Intersection over Union (IoU) threshold of 75 (AP75) shows an improvement of nearly five points compared to both DMNet and the original YOLO model, highlighting our algorithm’s robustness at higher IoU thresholds. This result confirms the effectiveness of the proposed model in high-precision tasks. Additionally, improvements in AP for small objects (APsmall) exceed four points at different scales, further emphasizing the significant impact of the proposed density map cropping strategy in small object detection.

As shown in Table 2, the proposed algorithm performs exceptionally well across all ten VisDrone categories when detecting targets of various scales and types. Compared to other methods, it achieves notable improvements in several categories. For example, the AP value in the pedestrian category reaches 31.5%, representing an eight-point increase over DMNet. In the car category, the AP value is 61.5%, a five-point improvement over DMNet. The model also exhibits strong detection performance for smaller targets such as tricycles and motorcycles, achieving AP values of 24.6% and 30.7%, respectively. These results underscore the model’s advantages in multi-scale object detection.

Table 3 presents the results from various methods on the UVADT dataset. It is clear that general object detectors fall short of the results discussed in Table 2. Similar to the findings on VisDrone, the proposed algorithm significantly outperforms other benchmark detectors on UVADT, achieving an advanced performance of 16.2% AP. This further validates the efficacy of the algorithm in this model.

Regarding inference speed, the proposed model was evaluated on a single NVIDIA GeForce GTX 4090 GPU (Santa Clara, CA, USA). The inference latency for the YOLOv8L and YOLOv11L backbone models was measured at 26.2 ms/img and 16.2 ms/img, respectively.

4.4. Ablation Experiment

In this ablation experiment, multiple sets of tests were performed on the benchmark V11 model using the VisDrone2019 dataset to assess the impact of different training datasets and the application of data fusion on model performance. The results are presented in Table 4.

Although introducing additional computational overhead, the dual-branch architecture significantly enhances object detection accuracy by effectively fusing global image information with localized details from cropped regions. Experimental results on the VisDrone2019 dataset demonstrate that this design achieves superior detection accuracy compared to density-guided neural network-based cropping methods while reducing inference time by approximately 4 ms (excluding the training overhead associated with density networks). When benchmarked against the baseline model, the proposed scheme delivers a relative accuracy improvement of 39.32% at a marginal latency cost of 8 ms.

The experimental conditions involved three scenarios: the original dataset, the augmented dataset, and the combination of the original and augmented datasets. Comparisons were made under both fusion and non-fusion conditions. Meanwhile, different cropping algorithms were added for comparison. The results, shown in Table 4, reveal that without data fusion, the model achieved an AP of 23.4% on the original dataset and 22.1% on the augmented dataset. Although the original dataset marginally outperformed the augmented dataset, the difference was minimal. However, when data fusion was implemented, the AP on the original dataset increased to 28.5% and the AP on the augmented dataset increased to 30.7%, demonstrating the significant enhancement in model performance due to data fusion, particularly on the augmented dataset. As shown in Figure 13, the incremental addition of each module contributes significantly to enhancing system performance.

Additionally, when the original and augmented datasets are combined for training, the model’s AP reaches 32.6%, representing the optimal performance across all experimental configurations. This demonstrates that utilizing both datasets together enables the model to learn from a broader range of samples, resulting in enhanced detection accuracy. This outcome further underscores the superiority of the proposed algorithm.

5. Discussion

This study presents an adaptive traffic density grading framework integrating dynamic density clustering and multi-scale feature fusion. The effectiveness of the proposed framework is validated on the VisDrone and UAVDT datasets. The proposed method achieves notable improvements in object detection performance, particularly on VisDrone, where the YOLOv11 model with combined original and augmented datasets yields a 32.6% AP. This outperforms existing methods such as DMNet and YOLOv8, especially in detecting small objects (AP_small reaching 24.3%), addressing key challenges in drone-captured imagery such as varying target scales and complex backgrounds.

This superior performance stems from the synergistic effects of core components. First, adaptive density clustering prunes irrelevant backgrounds to enhance focus on high-density traffic regions. This enhances small target localization, as confirmed by our ablation experiments showing advantages over simple equalization or MCNN-based clustering. Data augmentation and multi-stage fusion further boost generalization, with combined datasets expanding sample diversity and improving cross-category results (e.g., 8% higher AP for pedestrians than DMNet). Even on UAVDT, with its vehicular focus, the 16.2% AP and improved AP₇₅ highlight enhanced adaptability and precise localization for closely spaced targets. The core algorithm code is available at https://github.com/xingmengen/Core_Algorithm.git (accessed on 10 August 2025).

Notably, the proposed framework balances performance and efficiency, with YOLOv11L achieving a 16.2 ms/img inference speed that is suitable for real-time applications. Limitations include untested performance in extreme weather or low-light conditions as well as a focus on static images rather than video. Future work will address these issues by expanding datasets to more challenging conditions, developing video-based extensions for temporal analysis, and integrating with cloud computing for large-scale traffic monitoring with the aim of advancing drone-based detection in intelligent transportation.

6. Conclusions

This study has proposed an adaptive traffic density grading framework integrating dynamic density clustering and multi-scale feature fusion. The proposed framework effectively addresses the limitations of geometric discontinuity and computational inefficiency inherent in traditional image segmentation methods. By constructing a synergistic multi-stage training and multi-scale inference architecture, the model’s feature representation capability in complex scenarios is significantly enhanced. Building upon this foundation, our dynamic density clustering algorithm achieves substantial computational savings compared to neural network-based segmentation methods. This is accomplished by establishing an intelligent mapping relationship between density clusters and optimal cropping regions. Furthermore, by incorporating pixel-level gradient quantization and density analysis techniques, the developed traffic state assessment model exhibits advantages in visual interpretability.

Next, we plan to develop and implement an innovative video-based traffic detection method with the aims of achieving fine-grained and lane-specific warnings for localized traffic flow, providing precise predictive information for autonomous driving, and accurately calculating post-congestion queue lengths to support traffic guidance. In the future, we will integrate this technology into existing traffic management systems, combining big data analytics and cloud computing to achieve comprehensive traffic condition monitoring and intelligent control. Tailored solutions will also be developed for various scenarios, including urban traffic, highways, and rural roads.

Through these efforts, we aim to promote the widespread application of video-based traffic detection technology in intelligent transportation, thereby contributing to the modernization and intelligence of traffic management. This will provide crucial technical support for alleviating traffic congestion and improving traffic safety and efficiency. We will continue to refine and innovate our detection methods and systems, striving for greater breakthroughs and achievements in intelligent transportation.

Author Contributions

Conceptualization, X.L. (Xin Liu) and Q.M.; methodology, X.L. (Xin Liu), S.L. and Q.M.; software, X.L. (Xin Liu), X.Z. and X.L. (Xinli Li); validation, X.L. (Xin Liu) and X.L. (Xinli Li); resources, X.L. (Xin Liu), Q.M. and X.Z.; data curation, X.L. (Xin Liu), S.L. and Q.M.; writing—original draft preparation, X.L. (Xin Liu), Q.M. and X.Z.; writing—review and editing, X.L. (Xin Liu) and Q.M.; visualization, X.L. (Xin Liu) and X.L. (Xinli Li). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Qinghai Province under grant number 2023-ZJ-989Q.

Data Availability Statement

The data used in this paper can be obtained through the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Shaygan, M.; Meese, C.; Li, W.; Zhao, X.G.; Nejad, M. Traffic prediction using artificial intelligence: Review of recent advances and emerging opportunities. Transp. Res. Part C Emerg. Technol. 2022, 145, 103921. [Google Scholar] [CrossRef]
Zhang, K.; Chu, Z.; Xing, J.; Zhang, H.; Cheng, Q. Urban Traffic Flow Congestion Prediction Based on a Data-Driven Model. Mathematics 2023, 11, 4075. [Google Scholar] [CrossRef]
Tang, F.; Fu, X.; Cai, M.; Lu, Y.; Zeng, Y.; Zhong, S.; Huang, Y.; Lu, C. Multilevel Traffic State Detection in Traffic Surveillance System Using a Deep Residual Squeeze-and-Excitation Network and an Improved Triplet Loss. IEEE Access 2020, 8, 114460–114474. [Google Scholar] [CrossRef]
Macioszek, E.; Kurek, A. Extracting Road Traffic Volume in the City before and during COVID-19 through Video Remote Sensing. Remote Sens. 2021, 13, 2329. [Google Scholar] [CrossRef]
Bisio, I.; Garibotto, C.; Haleem, H.; Lavagetto, F.; Sciarrone, A. A Systematic Review of Drone Based Road Traffic Monitoring System. IEEE Access 2022, 10, 101537–101555. [Google Scholar] [CrossRef]
Ünlüleblebici, S.; Taşyürek, M.; Öztürk, C. Traffic Density Estimation using Machine Learning Methods. Int. J. Artif. Intell. Data Sci. 2021, 1, 136–143. [Google Scholar]
Chakraborty, D.; Dutta, D.; Jha, C.S. Remote Sensing and Deep Learning for Traffic Density Assessment. In Geospatial Technologies for Resources Planning and Management; Jha, C.S., Pandey, A., Chowdary, V., Singh, V., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 611–630. [Google Scholar] [CrossRef]
Nam, D.; Lavanya, R.; Jayakrishnan, R.; Yang, I.; Jeon, W.H. A Deep Learning Approach for Estimating Traffic Density Using Data Obtained from Connected and Autonomous Probes. Sensors 2020, 20, 4824. [Google Scholar] [CrossRef]
Li, Y.; Ma, L.; Yang, S.; Fu, Q.; Sun, H.; Wang, C. Infrared Image-Enhancement Algorithm for Weak Targets in Complex Backgrounds. Sensors 2023, 23, 6215. [Google Scholar] [CrossRef]
Singh, M.P.; Gayathri, V.; Chaudhuri, D. A Simple Data Preprocessing and Postprocessing Techniques for SVM Classifier of Remote Sensing Multispectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 7248–7262. [Google Scholar] [CrossRef]
Guo, Y.; Wu, C.; Du, B.; Zhang, L. Density Map-based vehicle counting in remote sensing images with limited resolution. ISPRS J. Photogramm. Remote Sens. 2022, 189, 201–217. [Google Scholar] [CrossRef]
Unel, F.O.; Ozkalayci, B.O.; Cigla, C. The Power of Tiling for Small Object Detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 16–20 June 2019; pp. 582–591. [Google Scholar] [CrossRef]
Zhang, X.; Feng, Y.; Zhang, S.; Wang, N.; Mei, S. Finding Nonrigid Tiny Person With Densely Cropped and Local Attention Object Detector Networks in Low-Altitude Aerial Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 4371–4385. [Google Scholar] [CrossRef]
Li, C.; Yang, T.; Zhu, S.; Chen, C.; Guan, S. Density Map Guided Object Detection in Aerial Images. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 737–746. [Google Scholar] [CrossRef]
Mei, S.; Chen, X.; Zhang, Y.; Li, J.; Plaza, A. Accelerating Convolutional Neural Network-Based Hyperspectral Image Classification by Step Activation Quantization. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–12. [Google Scholar] [CrossRef]
Zhang, X.; Izquierdo, E.; Chandramouli, K. Dense and Small Object Detection in UAV Vision Based on Cascade Network. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 118–126. [Google Scholar] [CrossRef]
Ma, X.; Zhang, X.; Pun, M.O. RS3Mamba: Visual State Space Model for Remote Sensing Image Semantic Segmentation. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 21–23 September 2005; pp. 886–893. [Google Scholar] [CrossRef]
Mu, K.; Hui, F.; Zhao, X. Multiple Vehicle Detection and Tracking in Highway Traffic Surveillance Video Based on SIFT Feature Matching. J. Inf. Process. Syst. 2016, 12, 183–195. [Google Scholar] [CrossRef]
Greeshma, K.; Gripsy, J.V. Image classification using HOG and LBP feature descriptors with SVM and CNN. Int. J. Eng. Res. Technol. 2020, 8, 1–4. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems, Proceedings of the NeurIPS 2012, Lake Tahoe, NV, USA, 3–8 December 2012; Pereira, F., Burges, C., Bottou, L., Weinberger, K., Eds.; Curran Associates, Inc.: New York, NY, USA, 2012; Volume 25. [Google Scholar]
Mei, S.; Jiang, R.; Ma, M.; Song, C. Rotation-Invariant Feature Learning via Convolutional Neural Network With Cyclic Polar Coordinates Convolutional Layer. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5600713. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.U.; Polosukhin, I. Attention is All you Need. In Advances in Neural Information Processing Systems, Proceedings of the NIPS 2017, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: New York, NY, USA, 2017; Volume 30. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar] [CrossRef]
Yao, Z.; Ai, J.; Li, B.; Zhang, C. Efficient DETR: Improving End-to-End Object Detector with Dense Prior. arXiv 2021, arXiv:2104.01318. [Google Scholar]
Donahue, J.; Hendricks, L.A.; Guadarrama, S.; Rohrbach, M.; Venugopalan, S.; Darrell, T.; Saenko, K. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 2625–2634. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Tan, M.; Le, Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Chaudhuri, K., Salakhutdinov, R., Eds.; Volume 97, pp. 6105–6114. [Google Scholar]
Zhang, X.; Feng, Y.; Zhang, S.; Wang, N.; Lu, G.; Mei, S. Robust Aerial Person Detection With Lightweight Distillation Network for Edge Deployment. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–16. [Google Scholar] [CrossRef]
Li, J.; Guo, S.; Yi, S.; He, R.; Jia, Y. DMCTDet: A density map-guided composite transformer network for object detection of UAV images. Signal Process. Image Commun. 2025, 136, 117284. [Google Scholar] [CrossRef]
Shorewala, S.; Ashfaque, A.; Sidharth, R.; Verma, U. Weed Density and Distribution Estimation for Precision Agriculture Using Semi-Supervised Learning. IEEE Access 2021, 9, 27971–27986. [Google Scholar] [CrossRef]
Wu, Q.; Zhou, Y.; Wu, X.; Liang, G.; Ou, Y.; Sun, T. Real-time running detection system for UAV imagery based on optical flow and deep convolutional networks. IET Intell. Transp. Syst. 2020, 14, 278–287. [Google Scholar] [CrossRef]
Li, Y.; Zhang, X.; Chen, D. CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes. arXiv 2018, arXiv:1802.10062. [Google Scholar] [CrossRef]
Zhang, Y.; Zhou, D.; Chen, S.; Gao, S.; Ma, Y. Single-Image Crowd Counting via Multi-Column Convolutional Neural Network. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 589–597. [Google Scholar] [CrossRef]
Ding, X.; He, F.; Lin, Z.; Wang, Y.; Guo, H.; Huang, Y. Crowd Density Estimation Using Fusion of Multi-Layer Features. IEEE Trans. Intell. Transp. Syst. 2021, 22, 4776–4787. [Google Scholar] [CrossRef]
Yang, L.; Guo, Y.; Sang, J.; Wu, W.; Wu, Z.; Liu, Q.; Xia, X. A crowd counting method via density map and counting residual estimation. Multimed. Tools Appl. 2022, 81, 43503–43512. [Google Scholar] [CrossRef]
Wang, Y.; Zou, Y.X.; Chen, J.; Huang, X.; Cai, C. Example-based visual object counting with a sparsity constraint. In Proceedings of the 2016 IEEE International Conference on Multimedia and Expo (ICME), Seattle, WA, USA, 11–15 June 2016; pp. 1–6. [Google Scholar] [CrossRef]
Kumar, K.N.; Roy, D.; Suman, T.A.; Vishnu, C.; Mohan, C.K. TSANet: Forecasting traffic congestion patterns from aerial videos using graphs and transformers. Pattern Recognit. 2024, 155, 110721. [Google Scholar] [CrossRef]
Jiang, S.; Feng, Y.; Zhang, W.; Liao, X.; Dai, X.; Onasanya, B.O. A New Multi-Branch Convolutional Neural Network and Feature Map Extraction Method for Traffic Congestion Detection. Sensors 2024, 24, 4272. [Google Scholar] [CrossRef]
Li, L.; Coskun, S.; Wang, J.; Fan, Y.; Zhang, F.; Langari, R. Velocity Prediction Based on Vehicle Lateral Risk Assessment and Traffic Flow: A Brief Review and Application Examples. Energies 2021, 14, 3431. [Google Scholar] [CrossRef]
Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96, Portland, OR, USA, 2–4 August 1996; AAAI Press: Washington, DC, USA, 1996; pp. 226–231. [Google Scholar]
Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and Tracking Meet Drones Challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7380–7399. [Google Scholar] [CrossRef]
Du, D.; Qi, Y.; Yu, H.; Yang, Y.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The unmanned aerial vehicle benchmark: Object detection and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 370–386. [Google Scholar]
Wang, J.; Guo, W.; Pan, T.; Yu, H.; Duan, L.; Yang, W. Bottle Detection in the Wild Using Low-Altitude Unmanned Aerial Vehicles. In Proceedings of the 21st International Conference on Information Fusion (FUSION), Cambridge, UK, 10–13 July 2018; pp. 439–444. [Google Scholar] [CrossRef]
Waqas Zamir, S.; Arora, A.; Gupta, A.; Khan, S.; Sun, G.; Shahbaz Khan, F.; Zhu, F.; Shao, L.; Xia, G.S.; Bai, X. iSAID: A Large-scale Dataset for Instance Segmentation in Aerial Images. In Proceedings of the EEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the Computer Vision—ECCV 2014, Zurich, Switzerland, 6–12 September 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar] [CrossRef]

Figure 1. The left-hand image shows the rigid grid segmentation technique, while the image on the right shows the generated density field.

Figure 2. Overall framework of the remote sensing image traffic density grading algorithm based on adaptive clustering and multi-scale fusion.

Figure 3. Overall framework of the traffic density grading algorithm for remote sensing images based on adaptive density clustering and multi-level training fusion.

Figure 4. In (a), a set of data points, with the red point x representing a core point and its neighborhood

ϵ

; in (b), the density reachability between y and z.

Figure 4. In (a), a set of data points, with the red point x representing a core point and its neighborhood

ϵ

; in (b), the density reachability between y and z.

Figure 5. (a,b) respectively illustrate the different scenarios of the two images. We assessed threshold values of 0.3, 0.5, and 0.7 to evaluate the segmentation effect at different values of

α

. As shown in the figure,

α

= 0.5 achieves a superior balance by effectively preserving dense regions while simultaneously removing extraneous background areas.

Figure 5. (a,b) respectively illustrate the different scenarios of the two images. We assessed threshold values of 0.3, 0.5, and 0.7 to evaluate the segmentation effect at different values of

α

. As shown in the figure,

α

= 0.5 achieves a superior balance by effectively preserving dense regions while simultaneously removing extraneous background areas.

Figure 6. Collaborative optimization framework integrating multi-stage training with multi-scale inference.

Figure 7. Overall framework of maximum density region search and congestion grading algorithm.

Figure 8. Scatter plot of combined pixel size distribution for traffic targets from VisDrone and UAVDT.

Figure 9. Distribution of category counts in the VisDrone dataset.

Figure 10. In (a), the visualization results of YOLO v8; in (b), the detection results of YOLOv11; in (c), the detection results of the proposed algorithm. The green bounding boxes highlight the advantages exhibited by the proposed method.

Figure 11. In this figure, (a,b) depict the ground-truth density heat maps generated from annotated data, while (c) illustrates the high-density regions identified by the proposed algorithm and (d) presents the density regions and their stratification derived solely based on area metrics. In contrast, (e) shows the density regions obtained based on count metrics.

Figure 12. The first row shows the actual cropping effect using the algorithm in this paper and the second row presents the cropping effect after density clustering via MCNN [37], with the blue area representing the high-density region, i.e., the optimal cropping region. The background areas with fewer targets are removed, then a local dataset is constructed by extracting data from the high-density regions for model fine tuning.

Figure 13. Visualization of the results from the ablation experiment.

Table 1. Comparison of detection results on the VisDrone dataset.

Benchmark Model	Training Dataset	Whether to Merge	AP(%)	${AP}_{50}$ (%)	${AP}_{75}$ (%)	${AP}_{small}$ (%)	${AP}_{mid}$ (%)	${AP}_{large}$ (%)
FastRCNN	–	–	18.6	33.6	17.9	10.1	28.8	40.3
DINO	–	–	30.5	52.4	29.8	22.0	41.0	48.9
RTMDet	–	–	15.4	26.2	15.6	7.4	24.5	33.6
RetinaNet	–	–	10.7	17.6	10.5	5.9	18.0	23.7
VFNet	–	–	27.5	43.3	29.0	18.1	39.8	56.5
DETR	–	–	19.5	34.9	18.9	11.4	28.8	45.2
DMNet	–	–	26.8	43.9	29.6	19.6	38.7	50.9
YOLOv8	original	N	20.9	32.1	22.7	11.1	34.3	57.5
	original	Y	21.7	33.6	23.3	11.7	35.2	54.3
	original+Enhancement	N	30.0	46.2	32.4	22.4	40.7	57.2
	original+Enhancement	Y	31.2	47.9	33.2	23.4	42.2	55.2
YOLOv11	original	N	23.4	35.3	25.0	13.6	36.4	50.9
YOLOv11	original+Enhancement	Y	32.6	49.7	35.1	24.3	44.4	56.9

Table 2. Comparison of detection results on each category of the VisDrone dataset.

Benchmark Model	Training Dataset	Whether to Merge	Pedestrian (%)	People (%)	Bicycle (%)	Car (%)	Van (%)	Truck (%)	Tricycle (%)	Awning -Tricycle (%)	Bus (%)	Motor (%)
FastRCNN	–	–	18.1	10.0	5.5	48.2	23.2	22.1	10.1	5.2	27.7	15.1
DINO	–	–	30.1	20.6	14.6	60.7	38.2	29.5	22.0	13.1	47.4	28.9
RTMDet	–	–	10.6	6.9	2.1	48.8	24.5	14.7	9.6	7.2	28.8	11.7
RetinaNet	–	–	8.9	4.2	2.3	43.2	15.6	9.3	3.3	2.7	10.3	9.3
VFNet	–	–	24.6	13.6	11.8	58.0	36.2	29.6	20.6	11.0	45.9	24.1
DETR	–	–	19.9	13.2	7.2	47.3	24.7	19.6	12.3	5.9	24.9	19.7
DMNet	–	–	23.3	15.5	13.8	56.8	37.5	29.7	22.4	12.4	46.9	24.7
YOLOv8	original	N	17.4	11.1	5.6	51.3	26.2	22.6	12.0	8.8	36.5	16.8
	original	Y	18.7	11.6	7.1	51.8	24.3	25.1	13.6	7.5	39.6	17.6
	original+ Enhancement	N	29.9	19.5	12.7	60.9	36.6	31.1	19.9	12.7	49.1	27.3
	original+ Enhancement	Y	30.7	20.3	14.7	61.3	35.1	33.4	22.2	12.7	52.5	28.5
YOLOv11	original	N	20.1	10.6	7.4	55.2	29.3	25.5	15.5	7.1	42.1	20.8
YOLOv11	original+ Enhancement	Y	31.5	20.7	15.9	61.5	39.4	35.6	24.6	13.6	52.1	30.7

Table 3. Comparison of detection performance on the UAVDT dataset.

Benchmark Model	Training Dataset	Whether to Merge	AP (%)	${AP}_{50}$ (%)	${AP}_{75}$ (%)	AP_small (%)	AP_mid (%)	AP_large (%)
FastRCNN	–	–	6.2	16.8	2.7	4.5	14.6	10.2
DINO	–	–	14.2	24.7	15.7	8.6	23.2	31.8
RTMDet	–	–	14.7	25.3	14.1	8.7	20.3	32.1
RetinaNet	–	–	15.4	26.6	14.4	9.2	21.5	31.2
VFNet	–	–	15.4	27.0	17.6	10.3	22.4	37.8
DETR	–	–	13.4	24.1	13.7	8.9	19.9	32.3
DMNet	–	–	12.6	22.8	14.9	8.1	24.1	34.7
YOLOv8	original	N	14.0	23.7	14.9	8.2	24.7	30.2
	original	Y	15.0	26.5	15.2	8.6	26.4	30.6
	original+Enhancement	N	12.5	21.5	13.5	7.4	22.9	33.9
	original+Enhancement	Y	16.1	27.6	17.2	10.1	26.7	34.4
YOLOv11	original	N	14.6	25.7	15.2	10.6	26.3	33.2
	original+Enhancement	Y	16.2	26.5	17.7	10.7	25.6	39.7

Table 4. Impact of different modules on accuracy.

Benchmark Model	Cropping Method	Training Dataset	Whether to Merge	AP(%)	AP_small(%)	AP_mid(%)	AP_large(%)	Processing Speed (ms/img)
	Image equalization (four pieces)	original+Enhancement	Y	19.9	12.9	28.4	39.9	15.0
	Image equalization (six pieces)	original+Enhancement	Y	19.4	14.9	26.3	37.3	15.0
YOLOv11	MCNN	original+Enhancement	Y	17.5	10.9	24.9	40.7	20.5
	The cropping algorithm proposed in this paper	original	N	23.4	13.6	36.4	50.9	8.0
		Enhancement	N	22.1	11.3	36.7	56.3	8.0
		original	Y	28.5	20.8	38.5	50.6	16.2
		Enhancement	Y	30.7	23.3	43.1	55.1	16.2
		original+Enhancement	Y	32.6	24.3	44.4	56.9	16.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, X.; Meng, Q.; Zhang, X.; Li, X.; Li, S. Adaptive Clustering-Guided Multi-Scale Integration for Traffic Density Estimation in Remote Sensing Images. Remote Sens. 2025, 17, 2796. https://doi.org/10.3390/rs17162796

AMA Style

Liu X, Meng Q, Zhang X, Li X, Li S. Adaptive Clustering-Guided Multi-Scale Integration for Traffic Density Estimation in Remote Sensing Images. Remote Sensing. 2025; 17(16):2796. https://doi.org/10.3390/rs17162796

Chicago/Turabian Style

Liu, Xin, Qiao Meng, Xiangqing Zhang, Xinli Li, and Shihao Li. 2025. "Adaptive Clustering-Guided Multi-Scale Integration for Traffic Density Estimation in Remote Sensing Images" Remote Sensing 17, no. 16: 2796. https://doi.org/10.3390/rs17162796

APA Style

Liu, X., Meng, Q., Zhang, X., Li, X., & Li, S. (2025). Adaptive Clustering-Guided Multi-Scale Integration for Traffic Density Estimation in Remote Sensing Images. Remote Sensing, 17(16), 2796. https://doi.org/10.3390/rs17162796

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Adaptive Clustering-Guided Multi-Scale Integration for Traffic Density Estimation in Remote Sensing Images

Abstract

1. Introduction

2. Related Works

2.1. Target Detection

2.2. Density Map Estimation

2.3. Density Zone Grading

3. Methodology

3.1. Overall Framework

3.2. Adaptive Pruning of Density-Based Clustering

3.3. Joint Optimization of Multi-Stage Training and Multi-Scale Inference

3.4. Maximum Density Region Search and Congestion Grading Methodology

4. Experiments and Results

4.1. Dataset Introduction

4.2. Experimental Settings and Evaluation Indicators

4.3. Experimental Results

4.4. Ablation Experiment

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI