1. Introduction
Remote sensing technology is playing an increasingly vital role in modern fields of science and technology, with its applications spanning across various domains, such as earth observation [
1], environmental monitoring [
2], and resource management [
3]. Remote sensing target identification, as a core application of remote sensing technology, is essential for the extraction and analysis of target information from remote sensing images. With the rapid growth of remote sensing data and improvements in data processing capabilities, efficiently and accurately identifying and locating targets from large-scale remote sensing images has become a pressing challenge for researchers. In the application of uncrewed aerial vehicle (UAV) remote sensing images, Pan et al. [
4] collected multispectral images acquired by UAVs and extracted spatial, spectral, and textural features. Subsequently, they employed the Support Vector Machine (SVM) algorithm for training, classifying pixels or regions in the images such as potholes, cracks, or undamaged road surfaces. Bejiga et al. [
5] proposed a method utilizing UAVs equipped with visual cameras to support search-and-rescue (SAR) operations for avalanches. Images captured by the UAVs underwent processing via a pre-trained convolutional neural network (CNN) for feature extraction, followed by the use of a linear SVM for detecting objects of interest.
In recent years, breakthroughs have been achieved in the field of computer vision, particularly with advancements in deep learning technology. Target-detection algorithms based on CNNs have garnered remarkable achievements in various domains. For instance, the introduction of attention mechanisms, multi-scale fusion, and cross-domain object relationship modeling techniques can further improve the performance of detectors in unmanned aerial vehicle images. Zhang et al. [
6] proposed the Depthwise-separable Attention-Guided Network (DAGN) method, a framework designed for real-time vehicle detection in unmanned aerial vehicle remote sensing images. Li et al. [
7] proposed modifications to YOLOv5, removing the second convolutional layer in the BottleNeckCSP structure and directly merging three-quarters of its input channels with the results processed by the first branch. This reduction in the number of model parameters, along with the addition of a 3 × 3 max-pooling layer in the SPP (spatial pyramid pooling) module of the network model, enhanced the model’s receptive field.
UAVs have revolutionized remote sensing applications by providing versatile platforms for data acquisition in various domains. Among the plethora of tasks facilitated by UAV-captured imagery, pedestrian detection stands out as a crucial task with profound implications for safety, security, and surveillance in diverse scenarios [
8,
9]. However, the integration of pedestrian detection capabilities into UAV systems is beset by a myriad of challenges stemming from the unique characteristics of UAV remote sensing imaging technology.
One of the primary challenges of UAV remote sensing imaging technology lies in the inherent variability in data acquisition conditions. Unlike fixed surveillance cameras, UAVs capture images from varying altitudes, angles, and distances, leading to significant geometric distortions and scale variations in the acquired imagery [
10]. These variations pose substantial challenges for pedestrian detection algorithms, which must adapt to the diverse spatial resolutions and perspectives encountered in UAV imagery. Hong et al. [
11] proposed a scale selection pyramid network (SSPNet) for detecting tiny objects, focusing on identifying small-scale persons in large-scale images obtained from UAVs. SSPNet consists of three main components: the context attention module for contextual information integration, the scale enhancement module to highlight features of specific scales, and the scale selection module for appropriate feature sharing between deep and shallow layers.
Shao [
12] proposed another solution, Aero-YOLO, a lightweight UAV image target-detection method based on YOLOv8. Aero-YOLO employs GSConv and C3 modules to reduce the number of model parameters, extending the receptive field. Additionally, coordinate attention (CA) [
13] and shuffle attention (SA) [
14] are introduced to improve feature extraction, which is particularly beneficial for detecting small vehicles from a UAV perspective.
Furthermore, the dynamic nature of the environment in which UAVs operate introduces additional complexities for pedestrian detection. UAVs are often deployed in urban, rural, and natural environments, characterized by diverse lighting conditions, weather phenomena, and terrain features. These environmental factors can adversely affect the visibility and appearance of pedestrians in the captured imagery, hindering detection accuracy. As a result, Wang [
15] proposed a pedestrian detection method for UAVs in low-light environments by merging visible and infrared images using a U-type GAN. A convolutional block attention module was introduced to enhance pedestrian information in the fused images, alongside spatial and channel domain attention mechanisms for the generation of detailed color fusion images. Subsequently, a YOLOv3 model with SPP and transfer learning was trained on the fused images. Moreover, the limited payload capacity and power constraints of UAVs imposed restrictions on the sensing modalities and hardware configurations that could be deployed for pedestrian detection tasks.
Balancing the trade-off between computational efficiency, detection accuracy, and energy consumption is a critical consideration in designing effective pedestrian detection systems for UAVs. Giovanna [
16] introduced a lightweight CNN designed for crowd counting in UAV-captured images. The network utilizes a multi-input model trained on both real-world images and corresponding synthetic “crowd heatmaps” to focus on crucial image components. During inference, the synthetic input path is disregarded, resulting in a single-view model optimized for resource-constrained devices.
The existing methods, although notable, exhibit limitations in effectively detecting small-scale pedestrians in UAV remote sensing images, primarily due to challenges in spatial resolution and scale variation. Moreover, they struggle to adapt to the dynamic environmental conditions inherent in UAV operations, impacting the detection accuracy across diverse lighting and weather scenarios. Resource constraints pose further challenges, with existing lightweight architectures often failing to optimize the balance between detection accuracy, computational efficiency, and energy consumption. Additionally, shortcomings persist in feature representation and fusion techniques, particularly in low-light environments, hindering pedestrian detection accuracy across varying conditions.
Therefore, an improved lightweight YOLOv5-based method for remote sensing pedestrian detection is proposed in this paper to overcome the shortcomings of traditional algorithms and enhancing the accuracy and efficiency of target detection. This research primarily focuses on two key issues: first, enhancing the YOLOv5 algorithm to adapt to the characteristics of remote sensing images, thus improving the performance and robustness of remote sensing target identification, and second, designing a lightweight model to reduce the number of model parameters and computational load while maintaining recognition accuracy, thus meeting the real-time and efficiency requirements in practical applications.
In summary, our main contributions are as follows:
GhostNet is integrated with YOLOv5 to reduce the number of model parameters, achieving model lightweighting.
To enhance the model’s detection of small objects in UAV remote sensing images, we replaced traditional strided convolutions and pooling in CNNs with SPD-Conv.
An attention mechanism module is added to the model to assist in accurately locating targets in the image.
The chapter organization of this paper is as follows:
Section 1 introduces the application of remote sensing technology, along with the challenges faced in pedestrian detection from UAV remote sensing images.
Section 2 provides a brief overview of the development of object detection techniques, focusing on small object detection and lightweight networks. In
Section 3, we construct a lightweight detection network based on YOLOv5 and provide detailed explanations of the improvements made. Subsequently, in
Section 4, we systematically demonstrate the effectiveness of our proposed enhancements through experiments. In
Section 5, we discuss the results and compare the proposed model with other mainstream models. Finally, we summarize our work and provide prospects for future research.
3. Materials and Methods
3.1. GSC-YOLO Network Framework
To address the issues of large parameter count and low recognition accuracy in object detection models for UAV remote sensing images, we propose a lightweight model called GSC-YOLO (which stands for GhostNet, SPD-Conv, and coordinate attention YOLOv5), a fusion of YOLOv5s and GhostNet. The structure of GSC-YOLO is illustrated in
Figure 1.
All CBS and C3 modules in the YOLOv5 network (
Figure 2) are replaced with GhostCBS and C3Ghost modules to reduce the parameter count (as shown in
Figure 3). Then, a space-to-depth step is incorporated after GhostCBS in the backbone structure to enhance the network’s feature extraction capability. Finally, an attention mechanism is introduced, allowing the model to focus on important areas in the image by learning attention weights. This enables GSC-YOLO to achieve more accurate detection and localization of objects in various complex scenes than YOLOv5s.
3.2. Ghost Net
GhostNet [
49] is a lightweight CNN architecture designed to provide efficient and accurate image classification. It is inspired by the Ghost Module (as shown in
Figure 4), where the network computation and parameter count are reduced via low-cost operations. As a result, the model of capacity of GhostNet is better while maintaining a lightweight nature. The Ghost Module, as the core component of GhostNet, first compresses the number of channels in the input image using regular convolutions. Then, it generates additional feature maps through linear transformations and concatenates these different feature maps together to form a new output. This operation significantly reduces the computational cost of the model. Specifically, it works as follows:
If there is an input feature map of size
h ×
w ×
c and we want to obtain an output of size
h′ ×
w′ ×
n using a convolutional kernel of size
k ×
k, then the number of computations required to complete one convolution calculation for a regular convolution process is
In Equation (1), cost represents the number of computation required to complete one convolution calculation. h and w denote the height and width of the input feature map, respectively, while c represents the number of channels. Similarly, h′ and w′ represent the height and width of the output feature map, respectively, and n represents the number of output feature channels. k represents the size of the convolutional kernel.
If the linear operation within GhostConv is set as a depthwise separable convolution, then the computational
required for performing this convolution operation is
where
represents the proportion of channels that undergo regular convolution compared to the total number of output feature channels in the entire convolution operation and
h′ and
w′ represent the height and width, respectively, of the output feature map after GhostConv.
If we replace regular convolution with GhostConv, the compression ratio of computational
between these two operations is theoretically
In Equation (3),
represents the compression ratio of computational cost between regular convolution and GhostConv. Equation (4) provides a simplified version of the compression ratio when
c is much larger than
s.
GhostNet is composed of several Ghost Bottlenecks (
Figure 5), which use Dwconv2d (Depthwise Separable Convolution) and GhostConv as the basic convolution units. Ghost Bottlenecks are further classified into two structures: strides = 1 and strides = 2.
3.3. K-Means Algorithm
In UAV remote sensing imagery, pedestrians may exhibit significant variations in size, pose, and viewpoint. Employing the K-means algorithm enables the adaptive selection of suitable anchor sizes based on the actual distribution of target scales and shapes within the dataset. This facilitates better coverage of pedestrians with diverse scales and shapes.
The algorithm unfolds through a series of systematic steps: Firstly, a training dataset is meticulously curated, with a focus on acquiring comprehensive annotated bounding box data tailored specifically for object detection tasks. Subsequently, a rigorous process of feature extraction ensues, whereby salient attributes such as width, height, and aspect ratio are meticulously derived from each bounding box. The initialization of cluster centers marks the commencement of the clustering procedure, wherein K initial cluster centers are meticulously chosen. This selection process may either entail a random assignment of K bounding boxes or a deliberate initialization guided by prior knowledge. Thereafter, the assignment of samples to clusters transpires, facilitated by the calculation of their distances to the K cluster centers, leading to their allocation to the cluster exhibiting the closest proximity. Iterative refinement ensues as the cluster centers are methodically updated by computing the average values derived from all bounding boxes encompassed within each cluster. The iterative process persists until either the cluster centers exhibit negligible fluctuations or a predetermined iteration threshold is attained. Ultimately, the selection of representative cluster centers culminates, discerned through a meticulous evaluation of the clustering outcomes, with these selected representatives serving as the esteemed prior boxes.
As per the architectural design of the YOLOv5 algorithm under investigation, it is customary for each detection head to encompass three candidate bounding boxes, thereby yielding a collective presence of three detection heads and, correspondingly, nine Anchor Boxes (
Table 1). This architectural insight influenced our deliberation to designate the value of K as 9 when implementing the K-means algorithm. This strategic decision ensures the methodical clustering of each detection head, thereby delineating nine Anchor Boxes tailored for object detection purposes.
3.4. SPD-Conv
Traditional object detection networks face challenges in effectively detecting small objects due to inherent limitations.
Firstly, convolutional operations like stride convolutions and pooling layers often lead to critical information loss, primarily caused by downsampling and decreased spatial resolution. Secondly, small objects’ limited spatial coverage contrasts with traditional networks’ fixed receptive fields, hindering the capture of necessary contextual information and leading to incomplete understanding and localization difficulties. Lastly, traditional networks’ constrained spatial context and reduced resolution may inadequately learn discriminative features for small objects, resulting in suboptimal performance in distinguishing them from larger objects or complex backgrounds.
SPD-Conv is a novel building module for CNN architectures that aims to enhance the performance of object detection models. It replaces commonly used stride convolutions and pooling operations in typical CNN architectures. It consists of two main components: the SPD layer and the non-stride convolution layer. The SPD layer downsamples the feature map without losing important information in the channel dimension. It rescales the feature map while preserving all information, thereby eliminating the need for stride convolutions and pooling operations. A non-stride convolution layer is added after the SPD layer to obtain the desired number of channels. The advantages of SPD-Conv include the preservation of fine-grained information, improved feature representation, and scalability. By avoiding stride convolutions and pooling operations, SPD-Conv retains the fine-grained details of the original feature map, which is crucial for the accurate detection of small objects. It enhances the representation of small objects and low-resolution images, leading to more precise detection. Additionally, SPD-Conv can be easily scaled or adjusted to accommodate different applications or hardware requirements. Assuming we have an intermediate feature map
X of size
S ×
S ×
C1, the SPD structure performs slicing operations on this map to obtain the following sub-features in sequence:
where
scale is a scaling factor that determines the downsampling rate applied to the input feature map
X. The notation
represents the sub-feature maps obtained through iterating slicing operations, where
i and
j increase from 0 to (
scale − 1). Each sub-feature map
has a size of
S/scale × S/
scale ×
C1, where
S represents the spatial dimensions of the input feature map and
C1 represents the number of channels. For instance, when
scale = 2 (
Figure 6), the input feature map is divided into four sub-feature maps, each with dimensions
S/2 × S/2 × C1. These sub-feature maps are then concatenated along the channel dimension, resulting in a new feature map
X′ with dimensions
S/2 ×
S/2 × (4
C1). Finally, a 1 × 1 convolutional kernel with dimensions
C2 is applied to transform the concatenated feature map
X′ into the desired feature map size.
We apply the method described above to the backbone and neck parts of YOLO-Ghost (as shown in
Section 4.2.2), in which SPD-Conv replaces the downsampling layer (
Figure 7). In the backbone, YOLO-Ghost has 5 subsampling layers, that is,
times downsampling of the feature map; in the neck, there are two downsampling layers, or
times downsampling of the feature map. Using 5 (only backbone) and 7 (both backbone and neck) SPD-Conv modules, we demonstrate that this method is more advantageous in small target detection.
3.5. Coordinate Attention
The CA mechanism is an improvement over the squeeze-and-excitation (SE) [
50] attention mechanism, where positional information is embedded into channel attention (
Figure 8). The SE attention mechanism primarily enhances the importance of different channels by learning channel weights through global average pooling (Equations (6)) and fully connected layer operations. It focuses on inter-channel information interaction while overlooking the significance of positional information. On the other hand, CA incorporates two 1D global average pooling operations to capture global information in both the horizontal and vertical directions of the feature map. It then utilizes two fully connected layers to learn the feature weights. As a result, CA can capture long-range dependencies in the feature map while preserving precise positional information. By encoding positional information, it generates a pair of direction-aware and position-sensitive attention maps, thereby improving the representation capability for the objects of interest.
The main operations of CA are as follows: Firstly, the input feature map undergoes two 1D global average pooling operations along each spatial direction (Equations (7) and (8)). Specifically, given input X, we encode each channel separately along the horizontal and vertical axis with two spatial ranges of pooling kernels
and
. This enables global information to be separately captured in both spatial directions.
The features obtained from the two global average pooling operations are concatenated followed by dimensionality reduction using 1 × 1 convolutional kernels and subsequent batch normalization.
where
denotes the concatenation operation along the spatial dimension,
represents the non-linear activation function, and
denotes the shared 1 × 1 convolutional transformation.
After the computation outlined in Equation (9), we obtain intermediate variables, as shown in Equation (10).
where
is the reduction ratio of the control block size.
Subsequently,
is split into two tensors along the spatial dimension (as shown in Equations (11) and (12)).
Then, two 1 × 1 convolutional transformations,
and
, are utilized to respectively transform
and
into tensors with the same channel number as the input X.
where
σ is the sigmoid activation function.
Finally, the outputs
and
from the previous step are unfolded and used as attention weights. The output of Coordinate Attention Block Y can be represented in the form of Equation (15).
From the derivation of the above equations, it can be observed that CA introduces positional information in addition to channel attention, enabling a more comprehensive capture of spatial relationships.