Study on an Improved YOLOv7-Based Algorithm for Human Head Detection

Wu, Dong; Yan, Weidong; Wang, Jingli

doi:10.3390/electronics14091889

Open AccessArticle

Study on an Improved YOLOv7-Based Algorithm for Human Head Detection

by

Dong Wu

^*,

Weidong Yan

and

Jingli Wang

School of Civil Engineering, Shenyang Jianzhu University, Shenyang 110168, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(9), 1889; https://doi.org/10.3390/electronics14091889

Submission received: 5 April 2025 / Revised: 1 May 2025 / Accepted: 2 May 2025 / Published: 7 May 2025

Download

Browse Figures

Versions Notes

Abstract

In response to the decreased accuracy in person detection caused by densely populated areas and mutual occlusions in public spaces, a human head-detection approach is employed to assist in detecting individuals. To address key issues in dense scenes—such as poor feature extraction, rough label assignment, and inefficient pooling—we improved the YOLOv7 network in three aspects: adding attention mechanisms, enhancing the receptive field, and applying multi-scale feature fusion. First, a large amount of surveillance video data from crowded public spaces was collected to compile a head-detection dataset. Then, based on YOLOv7, the network was optimized as follows: (1) a CBAM attention module was added to the neck section; (2) a Gaussian receptive field-based label-assignment strategy was implemented at the junction between the original feature-fusion module and the detection head; (3) the SPPFCSPC module was used to replace the multi-space pyramid pooling. By seamlessly uniting CBAM, RFLAGauss, and SPPFCSPC, we establish a novel collaborative optimization framework. Finally, experimental comparisons revealed that the improved model’s accuracy increased from 92.4% to 94.4%; recall improved from 90.5% to 93.9%; and inference speed increased from 87.2 frames per second to 94.2 frames per second. Compared with single-stage object-detection models such as YOLOv7 and YOLOv8, the model demonstrated superior accuracy and inference speed. Its inference speed also significantly outperforms that of Faster R-CNN, Mask R-CNN, DINOv2, and RT-DETRv2, markedly enhancing both small-object (head) detection performance and efficiency.

Keywords:

head detection; YOLOv7; CBAM; Gaussian receptive field; label assignment; SPPFCSPC

1. Introduction

Pedestrian detection is one of the hot research topics in computer vision, with widespread applications in autonomous driving, security, intelligent transportation, robotics, and other fields. With the continuous emergence of large shopping malls and urban complexes, person detection and counting have become key technological issues in public safety and emergency management. Detecting people in real-time allows public spaces to be monitored more effectively, reducing risks like crowd crushes and improving emergency response. For instance, during the “12.31” crowd crush incident at The Bund in Shanghai, despite the presence of surveillance cameras, insufficient video data analysis capabilities failed to accurately assess the number of people and safety risks, ultimately leading to the incident. Therefore, research on computer vision-based pedestrian detection holds significant practical significance and application value.

To date, research on person detection has yielded substantial results, which can be broadly categorized into three approaches: (1) detection based on hand-crafted features [1]; (2) deep learning-based object detection [2]; and (3) transformer-based detection methods. In the domain of manually extracted features, representative detection models include Viola–Jones [3], histogram of oriented gradients (HOG) [4], and deformable part model (DPM) [5]. Lei H et al. employed features extracted from HOG and local binary patterns (LBP) for rapid pedestrian detection in still images [6]. Hongzhi Zhou and colleagues proposed using HOG for feature extraction followed by SVM classification for pedestrian detection [7]. Although these traditional methods involve lower computational costs compared to deep learning-based object-detection algorithms, they suffer from relatively low accuracy and are difficult to further improve.

With the continuous development and maturation of deep learning technologies, deep learning-based object detection has gradually replaced traditional methods. Current mainstream deep learning object-detection algorithms are mainly divided into three categories: two-stage algorithms (such as RCNN [8], Faster-RCNN [9], Mask-RCNN [10]), single-stage algorithms (such as the SSD series [11] and YOLO series [12]), and end-to-end detection approaches [13]. While two-stage algorithms generally achieve higher detection accuracy, they are hampered by lower prediction efficiency and slower processing speeds. In contrast, single-stage algorithms feature simpler architectures and higher efficiency. Given the high mobility of people and the complexity of scenes in public spaces, single-stage algorithms are more suitable for person detection. To improve the detection accuracy of single-stage algorithms, researchers have extensively improved the YOLO series models by adopting deeper network architectures, automatic anchor box learning, improved loss functions, multi-scale training, and data augmentation [14]. Jiwoong, for example, built a bounding-box framework model with Gaussian parameters on the basis of YOLOv3, proposing a new prediction localization algorithm that significantly enhanced detection accuracy [15]. Sha Cheng optimized both the model and loss function of YOLOv3 for pedestrian detection in campus surveillance videos, achieving indoor pedestrian detection [16], although the results were heavily affected by lighting conditions. Fanxin Yu integrated the backbone and neck of the YOLOv5s network using the CBAM attention module to enhance pedestrian features and improve detection accuracy [17]. Lihu Pan and colleagues addressed the challenge of pedestrian detection on roadways by introducing an advanced model, HF-YOLO, which alleviates the complexity of detecting pedestrians in intricate traffic scenes [18]. Although these approaches achieved promising detection results, few studies have focused on densely populated environments with severe occlusions.

Transformer-based person detection leverages the transformer architecture for human–object localization and recognition, representing a cutting-edge computer vision paradigm. To tackle accuracy bottlenecks arising from diverse pedestrian poses and complex backgrounds, Maxime Oquab et al. developed DINOv2 [19], a transformer-based method enabling real-time person detection. Building on transformer networks, Wenyu Lv and co-workers proposed RT-DETRv2 [20]. These methods employ global self-attention and end-to-end design to achieve high-precision detection in complex scenarios without the need for manually designed anchor boxes or post-processing. However, they entail high computational complexity and hardware demands, exhibit limited performance on small objects, and thus struggle in highly crowded settings.

The aforementioned methods primarily address issues of low detection accuracy. However, in densely populated public spaces, occlusions between pedestrians are a critical factor affecting detection accuracy. Some researchers have begun to facilitate overall pedestrian detection by focusing on partial features of the human body. Since surveillance cameras in public areas are usually installed at elevated positions, the human head has naturally become a focal point for research. In densely populated public spaces, head detection faces multiple severe challenges. First, human heads in crowded scenes often appear as only a few pixels, making detailed feature extraction difficult. Second, frequent occlusions and overlaps blur the boundaries between heads or hide them completely, increasing localization difficulty; third, variations in camera mounting heights and viewing angles lead to drastic changes in head scale, and a single receptive field cannot accommodate all targets; moreover, complex and varied backgrounds in settings such as shopping malls and subway stations frequently confuse head textures, resulting in false positives; and finally, factors such as lighting conditions and compression noise further weaken small-object signals, undermining model stability. Luo Jie proposed an improved algorithm based on YOLOv5s for head detection and counting on the SCUT_HEAD dataset [21]. Chen Yong introduced a multi-feature-fusion pedestrian detection method that combines head information with the overall appearance of the pedestrian [22].

Existing studies have largely focused on optimizing individual modules, with few exploring the potential of multi-module collaborative enhancement for YOLOv7. Notably, multi-task collaborative training mechanisms have demonstrated the ability to resolve subtask conflicts in end-to-end person search—such as the BWIC-TIMC [23] system and the SWIC-TMC [24] approach—by dynamically balancing the learning weights across tasks, thereby significantly improving model performance. This paradigm offers valuable inspiration for multi-module collaborative optimization in dense scenarios. This paper optimizes the YOLOv7 model in the following ways:

Adding a CBAM attention module in the neck part, so that the network focuses more on target features while suppressing irrelevant regions;
Introducing a Gaussian receptive field-based label-assignment strategy at the junction between the feature-fusion module and the detection head to enhance the network’s receptive field for small targets;
Replacing the multi-space pyramid pooling with an SPPFCSPC module to improve inference efficiency while maintaining the same receptive field.

In our method design, we chose YOLOv7 as the base model because it maintains high accuracy while achieving state-of-the-art real-time detection performance through a series of bag-of-freebies and bag-of-specials techniques, making it highly valuable for engineering applications. We integrated the CBAM attention module primarily due to its lightweight structure, which dynamically enhances critical information along both channel and spatial dimensions, significantly improving feature representation for tiny head regions. We adopted a Gaussian receptive field-based label-assignment strategy (RFLA_Gauss) because it adaptively assigns positive and negative sample weights according to the distance from an object’s center—ideal for handling extremely small, pixel-sparse head targets. Finally, we introduced the SPPFCSPC multi-scale fusion module because it substantially reduces computational complexity while preserving the same receptive field as traditional SPP, further boosting inference speed and effectively integrating multi-scale information. The main contributions of this paper are as follows:

Construction of a head-detection database: A head-detection dataset was built, comprising images from various environments such as classrooms, shopping malls, internet cafes, offices, airport security checkpoints, and staircases.
Optimized YOLOv7 network design for head detection: An improved YOLOv7 network was proposed for head detection. Specifically, the CBAM attention module was introduced in the neck section; a Gaussian receptive field-based label-assignment strategy was implemented between the feature-fusion module and the detection head; and finally, the SPPFCSPC module was employed to replace the multi-space pyramid pooling.

2. YOLOv7 Model Optimization

2.1. Overall Model Framework

The YOLOv7 model [25] was proposed in 2022 and consists of four main modules: the input, backbone, neck, and head. The input module scales the input images to a uniform size to meet the network training requirements. The backbone is used for extracting features from the target images, the neck fuses features extracted at different scales, and the head adjusts the image channels of these multi-scale features and transforms them into bounding boxes, class labels, and confidence scores to achieve detection across various scales.

Although YOLOv7 shows significant improvements in overall prediction accuracy and operational efficiency compared to its predecessors, there remain some noteworthy issues in small-object detection. First, small objects generally occupy few pixels in an image, resulting in relatively vague feature representations; this makes it difficult for the model to capture sufficient detailed information during feature extraction, leading to missed or false detections. Second, since small objects are easily confused with the background or noise, the model might struggle to adequately differentiate between target and background during training, thereby affecting detection accuracy. Moreover, the existing network architecture may not fully address the feature representation requirements of both large and small objects during multi-scale feature fusion, which can lead to the attenuation of small object features and reduced detection performance. Finally, in real-world scenarios, small objects often exhibit diverse shapes and severe occlusion, further increasing detection challenges. Therefore, despite YOLOv7’s excellent overall performance, structural and strategic optimizations are urgently needed to enhance the accuracy and robustness of small-object detection.

In this study, we enhance the original YOLOv7 architecture to better address small-object detection, as illustrated in Figure 1. First, we embed the CBAM attention mechanism into the neck module (yellow region) to more effectively capture the fine-grained features of small objects. Next, we insert a Gaussian receptive field-based label-assignment strategy, RFLAGauss, between the standard feature-fusion module and the detection head (orange region), using the CBAM-refined features as input to perform precise boundary- and shape-aware label allocation. Finally, we replace the conventional multi-space pyramid pooling with the SPPFCSPC module (brown region), which aggregates multi-scale contextual information without enlarging the receptive field, thereby preventing information loss during feature fusion.

2.2. CBAM

YOLOv7 employs convolutional neural networks (CNNs) for feature extraction. However, traditional convolution operations are fixed in nature and cannot dynamically adjust their focus based on varying input content. Designed with an emphasis on computational efficiency and fast inference, YOLOv7 may overlook subtle yet critical information within feature maps, especially when dealing with complex backgrounds or small-scale targets. In dense scenarios, small objects like heads often appear with blurred features due to sparse pixels and background interference. Conventional convolutional layers lack adaptive focusing ability, which increases the likelihood of missed detections.

Since the vanilla YOLOv7 network does not incorporate any attention mechanism, it struggles to capture the faint features of small targets. To address this, we embed a convolutional block attention module (CBAM) into the neck module. By applying both channel-wise and spatial attention, CBAM more precisely focuses on small-object regions, thereby significantly improving recall for these targets.

CBAM is a lightweight yet effective attention module designed for feedforward convolutional neural networks [26]. It consists of two sequential sub-modules: the channel attention module (CAM) and the spatial attention module (SAM). CAM helps the network concentrate more on foreground and semantically meaningful regions, while SAM allows the model to emphasize spatially informative areas rich in contextual information across the entire image. The structure of the CBAM module is illustrated in Figure 2.

The core concept of the channel attention mechanism is consistent with that of the squeeze-and-excitation networks (SENet) [27]. As shown in Figure 2, the input feature map F(H × W × C) is first subjected to both global max pooling and global average pooling along the spatial dimensions (height H and width W), producing two feature descriptors of size 1 × 1 × C. These two descriptors are then passed through a shared-weight two-layer multilayer perceptron (MLP) to learn inter-channel dependencies. A reduction ratio r is applied between the two layers to reduce dimensionality. Finally, the outputs from the MLP are combined via element-wise addition, followed by a sigmoid activation function to generate the final channel attention weights, denoted as M_c. The formulation is as follows:

M_{c} (F) = σ (M L P (A v g P o o l (F)) + M L P (M a x P o o l (F)))

(1)

where F is the input feature map; AvgPool(F), MaxPool(F) are the global average pooling and max pooling outputs; MLP is a two-layer fully connected network with a bottleneck structure using a reduction ratio r; σ denotes the sigmoid activation function; M_c is the channel attention weight vector.

The spatial attention mechanism takes the feature map F output from the channel attention module as its input [28]. First, it applies global max pooling and average pooling along the channel dimension to generate two feature maps of size H × W × 1. These two maps are then concatenated along the channel axis.

A convolution operation with a 7 × 7 kernel is then applied to reduce the dimensionality, resulting in a single-channel spatial attention map. Finally, a sigmoid activation function is used to learn the spatial dependencies among elements, producing the spatial attention weights M_s [20]. The formulation is as follows:

M_{s} (F) = σ (f^{7 \times 7} ([A v g P o o l (F); M a x P o o l (F)]))

(2)

where F is the input feature map; [.;.] indicates concatenation along the channel axis; f^7×7 represents a convolution operation with a 7 × 7 kernel; σ denotes the sigmoid activation function; M_s is the learned spatial attention map.

The channel attention mechanism dynamically adjusts the weight of each channel based on the global information of each channel in the image, allowing the model to focus more on feature channels with higher information content. Spatial attention, on the other hand, helps the model focus on the location of the target object by emphasizing significant spatial features, rather than being distracted by the background or noise.

First, the channel attention module processes each feature channel by applying global average pooling and max pooling to extract global information. These are then passed through a shared multilayer perceptron (MLP) to generate channel-wise attention weights. This step enables the network to automatically identify the feature channels that are more important for the detection task while suppressing those with irrelevant or noisy information, thus enhancing the quality of feature representation.

Next, the spatial attention module processes the channel-weighted feature maps by aggregating channel information (e.g., via average pooling and max pooling) to generate a spatial descriptor, followed by a convolution layer to generate the spatial attention map. This mechanism clearly identifies the regions in the image that require more attention, thereby strengthening the prominence of target regions during the feature-fusion phase.

For small-object detection, this detailed attention mechanism is particularly critical because small objects often occupy a small area in the image and have weak features, making them prone to being ignored or confused with the background during traditional convolution operations. By incorporating the CBAM module, YOLOv7 is able to more effectively focus on these subtle regions, enhancing the feature expression of small objects and consequently improving detection accuracy and robustness.

2.3. Receptive Field Enhancement

YOLOv7 employs a label-assignment strategy based on fixed intersection over union (IoU) thresholds. However, this fixed approach often struggles to correctly classify borderline samples as positive or negative, and it proves limited when dealing with targets of varying scales or irregular shapes. In this study, we introduce a label-assignment strategy based on Gaussian receptive field (RFLAGauss), which is applied between the feature-fusion module and the detection head. This strategy dynamically adjusts the assignment weights such that anchor points closer to the target center receive higher importance, thereby enabling a more reasonable allocation of positive and negative samples.

The Gaussian receptive field adapts well to different object sizes. It concentrates more around the center for small objects, while expanding suitably for larger ones to capture broader features. This dynamic mechanism also helps mitigate the imbalance between positive and negative samples, preventing excessive negative samples from suppressing the learning process. Moreover, it enhances the quality of positive samples, enabling the detection head to learn object features more accurately.

The underlying principle of the Gaussian receptive field-based label-assignment strategy, RFLAGauss, is illustrated in Figure 3. The process begins with feature extraction, followed by convolution with a Gaussian kernel. Finally, the features are aggregated into a single feature point, resulting in an effective receptive field. Label assignment is then performed based on this adaptive receptive field to better accommodate objects of various sizes.

Traditional label-assignment methods often rely on fixed or simplistic rules to determine which regions should be regarded as positive samples. However, such methods tend to struggle with small-object detection due to their limited pixel representation and indistinct features, resulting in inaccurate label allocation and degraded detection performance. The RFLAGauss strategy addresses this issue by introducing a Gaussian-based weighting mechanism within the receptive field, enabling more fine-grained and adaptive label assignment.

Specifically, this strategy is applied between the feature-fusion module and the detection head, where a Gaussian weighting function is computed for each candidate region or feature point. Regions closer to the object center are assigned higher weights, while those farther away receive lower ones. This Gaussian receptive field-based allocation mechanism captures core object features more effectively and helps suppress background noise during training. For small objects, where only a limited number of pixels are available, such precise label distribution is particularly critical: it ensures that even minimal object regions are given sufficient positive sample weight, thereby enhancing the model’s sensitivity and accuracy in detecting small targets.

Furthermore, the Gaussian receptive field approach alleviates issues in multi-scale feature alignment. Conventional methods often lose critical details of small targets during scale transformations. In contrast, RFLAGauss dynamically adjusts the weights, guiding the detection head to focus more on the central and adjacent regions of the target across different scales. This improves the localization accuracy and confidence scores for small objects. As a result, this refined label-assignment strategy not only boosts the recall rate for small-object detection but also reduces false positives, ultimately enhancing the model’s overall robustness and detection performance.

2.4. SPPFCSPC

In the previous section, the label-assignment strategy between the feature-fusion module and the detection head was optimized, improving the efficiency of feature utilization and the rationality of sample distribution. However, in object-detection tasks, the size of the receptive field is equally critical for capturing multi-scale targets and global contextual information.

To address this, the YOLOv7 network employs spatial pyramid pooling connected spatial pyramid convolution (SPPCSPC) to expand the receptive field and enhance the model’s adaptability to input images of varying resolutions. The underlying mechanism is illustrated in Figure 4a. The pooling operation within the SPPCSPC module is computed using the following formula:

R (F) = M a x P o o l_{k = 5}^{p = 2} (F) Θ M a x P o o l_{k = 13}^{p = 6} (F)

(3)

where R denotes the output;

Θ

represents the result of tensor concatenation; F represents the input feature map; k represents the kernel size used in the pooling; p denotes the padding applied to the input feature map.

The SPPCSPC structure, however, increases the computational complexity of the network [29] and suffers from severe information loss when detecting small objects. In this work, we draw on the fast spatial pyramid pooling of SPPF and the feature split–merge–fusion concept of CSP to design the SPPFCSPC module. The SPPFCSPC structure is an optimization of the SPPCSPC structure, incorporating the design principles of the SPPF module. The structure of the SPPFCSPC module is shown in Figure 4b.

The pooling part of the SPPFCSPC module is calculated as shown in Equations (4)–(7). By connecting the output of three independent pooling layers with smaller pooling kernels, the computational load is reduced, and the pooling layer with larger pooling kernels produces results that accelerate the process while maintaining a constant receptive field [30].

R_{1} (F) = M a x P o o l_{k = 5}^{p = 2} (F)

(4)

R_{2} (R_{1}) = M a x P o o l_{k = 5}^{p = 2} (R_{1})

(5)

R_{3} (R_{2}) = M a x P o o l_{k = 5}^{p = 2} (R_{2})

(6)

R_{4} = R_{1} Θ R_{2} Θ R_{3}

(7)

where R₁ denotes the pooling layer result with the smallest pooling kernel; R₂ represents the pooling layer result with a medium pooling kernel; R₃ indicates the pooling layer result with the largest pooling kernel; R₄ represents the final output;

Θ

denotes the tensor concatenation.

By introducing the SPPFCSPC module to replace the traditional multi-scale pyramid pooling, the SPPFCSPC module combines the advantages of fast spatial pyramid pooling (SPPF) and the cross stage partial (CSP) structure. It expands the receptive field through multi-scale pooling operations to capture information at different scales in the image, while also using the CSP structure to achieve efficient segmentation and fusion of feature maps. This reduces redundant computations and improves gradient flow efficiency.

In traditional multi-scale pyramid pooling, although multi-scale features can be obtained, there may be limitations in information integration and detail preservation, especially in small-object-detection scenarios, where key information is often lost. The SPPFCSPC module, however, effectively aggregates global features while better preserving local and fine-grained information, enabling the model to more sensitively capture small object features.

As a result, when the detection head of YOLOv7 receives feature maps processed by the SPPFCSPC module, the information passed through is richer and more accurate. This not only improves the recall rate for small-object detection but also effectively reduces false-positive rates, thereby enhancing overall detection performance and robustness.

3. Experimental Results Analysis

3.1. Dataset Creation

This study collected surveillance video streams from seven diverse environments—classrooms, Internet cafés, private residences, airports, retail stores, office and public facilities, and passageways. Frames were sampled every 20 frames, and images lacking visible human heads were manually filtered out, resulting in a dataset of 3075 head images. The settings encompass densely populated areas such as classrooms, Internet cafés, and airports. Spanning over 14 years, the dataset includes footage ranging from standard definition to high definition, black-and-white to color, various sensor and lens configurations, and a wide array of lighting, occlusion, and resolution conditions. This “cross-era, cross-device, cross-scene” diversity robustly evaluates detection accuracy under varying layouts, occlusions, and illumination, while enabling the model to learn device-agnostic head features, thereby enhancing robustness and generalization across scenarios. Detailed dataset specifications are provided in Table 1, and Figure 5 illustrates sample head crops from different scene types.The dataset is available at https://github.com/Wdong886/head (accessed on 7 March 2025).

The collected and organized images were uniformly scaled to a resolution of 640 × 640 pixels. Then, the open-source tool LabelImg software (https://github.com/tzutalin/labelImg (accessed on 5 March 2025)) was used to annotate the dataset. During the annotation process, care was taken to ensure that the bounding boxes were appropriately sized for the objects and that class labels were correctly assigned. For severely blurred target images, small boxes were used to enclose as much of the exposed target as possible for annotation. After completing the annotation, the dataset was split into training, testing, and validation sets in a ratio of 8:1:1 for subsequent use.

3.2. Evaluation Metrics

In this study, the evaluation is conducted from four aspects: precision (P), recall (R), average precision (AP), and frames per second (FPS). Precision indicates the ability to classify samples, recall reflects the ability to detect positive samples, detection precision shows the overall performance of the object detection, and FPS represents the number of frames processed per second. The P and R are used as the y-axis and x-axis, respectively, to plot the P-R curve. The area under the curve represents AP, and the mean of AP values is the mean average precision (mAP). The formulas for calculating precision (P), recall (R), detection precision (AP), and mAP are as follows:

P = \frac{T P}{T P + F P}

(8)

R = \frac{T P}{T P + F N}

(9)

A P = \int_{0}^{1} P \cdot R d R

(10)

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(11)

where TP represents the number of positive samples correctly predicted as positive; FP refers to the number of negative samples incorrectly predicted as positive; FN denotes the number of positive samples incorrectly predicted as negative; and N is the number of sample classes in the dataset.

3.3. Experiment Results Analysis

In this study, the Pytorch framework was used as the experimental environment for algorithm training. The running environment was CUDA v11.6, Pytorch version 1.8, with an NVIDIA Quadro P5000 GPU, 16GB of GPU memory, 128GB of system memory, and Python version 3.9. The batch size for each training batch was set to 16, with a total of 300 training epochs. The initial learning rate was set to 0.01, and the weight decay coefficient was 0.0005. The experimental training results are shown in Figure 6.

As shown in Figure 6, with continuous iterations, the loss metric gradually decreases, while recall and precision gradually increase. This indicates that the model is learning and improving its ability to accurately detect objects. When the iteration reaches around 150 epochs, the model’s loss function tends to stabilize, and the model reaches a convergent state.

3.3.1. Attention Mechanism Comparison Experiment

Common attention modules include SE (squeeze-and-excitation) [27], CA (coordinate attention) [31], ECA (efficient channel attention) [32], GAM (global attention with multi-head) [33], and CBAM (convolutional block attention module) [26]. To validate the effect of different attention modules on the accuracy improvement of the YOLOv7 model on the dataset used in this study, various attention modules—SE, CA, ECA, GAM, and CBAM—were introduced into the YOLOv7 model. The model’s accuracy after incorporating each attention mechanism was compared, and the experimental results are shown in Table 2.

According to the experimental results in Table 2, there are significant differences in the performance improvement of the YOLOv7 model when different attention modules are introduced. In terms of accuracy, the YOLOv7 model with CA (Coordinate Attention) and CBAM (Convolutional Block Attention Module) aachieved a precision of 0.937, outperforming the other three attention mechanisms. However, in terms of recall, the YOLOv7 model with the CA module had a recall of only 0.912, which is significantly lower than the model with the CBAM module. The YOLOv7 model with GAM (global attention with multi-head) achieved the highest recall, reaching 0.931. When the IOU threshold is 0.5, the YOLOv7 model with the CBAM module had the highest precision, reaching 0.906.

This indicates that the CBAM module, while effectively suppressing background noise, can accurately capture target features, significantly reducing false positives and false negatives, and thus improving overall detection performance. In contrast, although the CA module performed well in terms of precision, its recall was relatively low, leading to the lowest mAP@0.5 among the five attention mechanisms. This might be due to its stronger local focus on target localization but limitations in detecting diverse targets. Meanwhile, the ECA (efficient channel attention) module performed relatively well in both precision and recall, achieving a higher mAP@0.5 value, demonstrating good stability and generalization. The GAM module showed a slight advantage in recall, indicating that it is more sensitive to capturing target information, but its lower mAP@0.5 also reflects certain deficiencies in bounding-box accuracy or confidence distribution. The SE (squeeze-and-excitation) module, while performing moderately, did not show outstanding advantages in any of the metrics but still managed to improve the model’s detection performance to some extent.

The above phenomena are mainly attributed to the different ways in which each attention module focuses on target information during feature extraction and fusion. The CBAM module strengthens key information through a dual enhancement of channel and spatial attention, thus better extracting and integrating fine-grained features of small targets. On the other hand, the CA module primarily focuses on capturing coordinate information. While it can reduce false positives, its recall is lacking when dealing with small targets in diverse or dynamic backgrounds. The ECA and SE modules improve feature representation by adjusting channel weights but show different effects across different scenarios. The GAM module, while aggregating features using multi-head attention, may lose some detailed information due to increased computational complexity, which affects the bounding-box precision.

Overall, these results suggest that introducing the appropriate attention mechanism plays a crucial role in improving the performance of YOLOv7 in small-object detection. Among them, the CBAM module, with its advantages in precise localization and global information fusion, emerges as the ideal choice for improving small-object-detection performance.

3.3.2. Ablation Experiment

In order to verify the contributions of the introduced CBAM attention module, the Gaussian receptive field-based label-assignment strategy (RFLAGauss), and the SPPFCSPC module to the performance of the YOLOv7 model, ablation experiments were conducted. Each model was trained for 300 epochs, and the experimental results are shown in Table 3.

From Table 3, it can be seen that after introducing the CBAM attention mechanism and the Gaussian receptive field-based label-assignment strategy (RFLAGauss), the YOLOv7 model achieved improvements in both accuracy and recall. The main reasons for these improvements are twofold: On one hand, the CBAM attention module allows the model to focus more on areas with significant features in the image, while enhancing important feature channels and suppressing less important ones, thus better extracting useful information from the features and improving detection accuracy. On the other hand, the Gaussian receptive field-based label-assignment strategy considers the spatial relationships around the target and places more emphasis on feature extraction of small targets. This enables the generated labels to more accurately reflect the target’s position and size, thereby improving the accuracy of target localization.

However, compared to the YOLOv7 model, after introducing the CBAM attention mechanism and the RFLAGauss label-assignment strategy, the FPS of the YOLOv7 model decreased. This is mainly due to the increase in the number of model parameters after the introduction of new modules. More parameters mean higher memory and computational requirements, leading to a reduction in the model’s inference speed.

When the SPPFCSPC module was introduced, there was a slight decrease in precision and recall, with recall dropping from 0.951 to 0.939. This reduction may be due to the feature-partition-and-fusion strategy of the CSP (cross stage partial) structure within the SPPFCSPC module. It may fail to fully capture the global contextual relationships among densely packed small objects, resulting in missed detections, especially for peripheral targets. However, the inference capability of the model was improved, and the FPS reached 94.2. This improvement is primarily due to the introduction of spatial pyramid pooling, focus mechanisms, and cross-stage partial connections in the SPPFCSPC module, which help enhance the model’s ability to represent target features. With better feature representation, the model can more accurately localize and identify targets, reducing unnecessary computations during inference. In practical deployments where real-time performance is prioritized, faster inference is essential for handling high-resolution video streams. In such cases, a slight reduction in recall can be compensated by applying post-processing algorithms. For security-critical applications where minimizing missed detections is paramount, auxiliary detection modules can be introduced to offset the recall loss while preserving the real-time advantages of the SPPFCSPC module. In the field of public safety, real-time detection is one of the core requirements. During rapid emergency responses, accumulated system latency can degrade overall responsiveness—particularly during peak periods or in complex environments where frame drops may occur—thereby exacerbating safety risks. For example, the “12.31” crowd stampede in Shanghai Bund showed that high detection accuracy alone is not sufficient. If system delays prevent timely warnings, disasters may still occur. Therefore, although integrating the SPPFCSPC module resulted in a 1.2% decrease in recall, its significant improvement in inference speed is crucial for real-time monitoring.

The ablation experiment results show that the CBAM, embedded in the neck module, dynamically adjusts information across both channel and spatial dimensions, effectively highlighting target-related features while suppressing redundant information, and better capturing the subtle features of small targets. Additionally, the Gaussian receptive field-based label-assignment strategy RFLA_Gauss, introduced between the traditional feature-fusion module and detection head, reasonably distributes the receptive field of each point in the target region using Gaussian weights, enabling the model to more precisely locate target boundaries and shapes during feature fusion. As a result, the model’s recall rate markedly increases to 95.1%, demonstrating a strong synergy between RFLA_Gauss and CBAM. Finally, replacing traditional multi-scale pyramid pooling with the SPPFCSPC module effectively addresses the potential information-loss problem in multi-scale feature representation. This module not only retains the advantages of pyramid pooling in processing features at different scales but also enhances the model’s robustness and sensitivity to small-target details through deeper feature extraction and fusion. The model’s inference speed increased to 94.2 FPS, validating its efficiency advantage in multi-scale feature fusion. The integration of all three components yields overall performance that surpasses the simple accumulation of individual modules, demonstrating the effectiveness of the collaborative optimization strategy.

In summary, these improvements make the improved YOLOv7 model effectively alleviate the problems of missed detection and false positives in small-target detection while maintaining overall detection efficiency and accuracy. This provides a more ideal solution for object-detection tasks in complex scenarios.

3.3.3. Precision Comparison Analysis

To further validate the performance of the proposed algorithm, it was compared with single-stage object-detection networks YOLOv7 [25], YOLOv8 [34], RetinaNet [35], SSD [11]; two-stage object-detection networks Faster R-CNN [9], Mask R-CNN [10]; and transformer-based methods DINOv2 and RT-DETRv2. The experimental results are shown in Table 4.

Table 4 presents the experimental results of different object-detection algorithms based on four metrics: precision, recall, average precision (AP), and frames per second (FPS). From Table 4, it can be seen that the optimized YOLOv7 model achieves a precision of 0.944, which is an improvement of 2.0% over the original YOLOv7. The recall value increases from 0.905 to 0.939, and the AP reaches 0.916, which is also an improvement over YOLOv7. In terms of model inference, the optimized YOLOv7 (ours) achieves 94.2 FPS, which shows a significant improvement compared to YOLOv7 (87.2 FPS) and YOLOv8 (84.6 FPS). The results indicate that the optimized YOLOv7 model performs better in reducing false positives.

Compared to other mainstream object-detection methods, the optimized YOLOv7 achieves a better balance between detection accuracy and computational efficiency. For example, compared to the two-stage object-detection methods Faster R-CNN and Mask R-CNN, the optimized model shows good performance in both recall and AP, but with a significant increase in inference speed (94.2 FPS vs. 63.6 FPS and 72.5 FPS). Additionally, compared to single-stage methods like SSD and RetinaNet, the optimized model outperforms these methods in precision, recall, and AP, showing that it performs more stably in detecting targets at different scales. Furthermore, compared to YOLOv8, the optimized model improves recall by 11.2%, AP by 8.8%, and FPS by 11.3%, indicating that the proposed optimization strategy offers significant advantages in both detection accuracy and inference efficiency. It is worth noting that, compared with the latest transformer-based models DINOv2 and RT-DETRv2, our method achieves a superior balance between real-time performance and accuracy. In terms of precision, our approach (0.944) surpasses DINOv2 (0.924) and RT-DETRv2 (0.907) by 2.1% and 3.9%, respectively, indicating significantly stronger false-positive suppression; the recall rate (0.939) improves by 2.4% and 2.3% over DINOv2 (0.915) and RT-DETRv2 (0.916), respectively, demonstrating more effective control of missed detections under occlusion. Although DINOv2’s AP (0.912) is close to ours (0.916), its inference speed (62.4 FPS) is only 66% that of our method, making it unsuitable for real-time monitoring; RT-DETRv2, while accelerating to 68.2 FPS through its deformable attention mechanism, still lags behind in AP (0.894 vs. 0.916) and exhibits a higher false-positive rate (precision 0.907) in densely populated scenes.

The main reasons for the observed experimental results are as follows: The introduction of the CBAM attention mechanism enhances the expression ability of key target features. By combining channel attention and spatial attention, the network can focus more on target regions and ignore irrelevant background information, thus improving detection precision. Additionally, RFLAGauss optimizes the distribution of positive and negative samples, improving the accuracy of boundary box predictions, which in turn boosts the AP. Finally, the introduction of the SPPFCSPC module replaces traditional Spatial Pyramid Pooling (SPP) with a more efficient cross-scale feature-fusion method, improving the model’s detection stability across different scale targets.

Overall, the experimental results show that the proposed optimization strategy can effectively enhance the performance of YOLOv7 in object-detection tasks. While maintaining high accuracy, the model’s inference speed remains unaffected and even surpasses the original YOLOv7 in computational efficiency. The introduction of the CBAM attention mechanism enhances the focus on key target regions, the RFLAGauss Gaussian receptive field label-assignment strategy optimizes target boundary box prediction, and the SPPFCSPC module improves multi-scale feature-fusion capability. These improvements enable the optimized YOLOv7 to achieve the best balance of accuracy and efficiency in dense-scene head-detection tasks, with greater practicality and competitiveness.

Figure 7 shows the results of head detection performed by different models trained on the same dataset for the same image.

From Figure 7, it can be observed that different detection algorithms perform well on image (1), likely because the image is clear, and the head is mostly unobstructed. However, YOLOv8, Faster R-CNN, and Mask R-CNN exhibit false detection on image (2). For example, the pedestrian’s back, indicated by the yellow arrow, is mistakenly identified as a head. In YOLOv8’s detection result on image (4), a milk carton is falsely recognized as a human head. From the detection results of image (3), it is evident that algorithms like YOLOv8, DINOv2, and RT-DETRv2 suffer from missed detections, while the proposed method accurately identifies all targets in the image. In the detection results of images (4) and (5), the proposed method outperforms other algorithms, with higher detection accuracy. This is mainly because the CBAM attention mechanism in the proposed method allows the model to focus more on regions with significant features, enhancing important feature channels and suppressing less important ones, thus improving detection accuracy.

The proposed method did not produce false detections in any of the five images, although it did have a few missed detections in images (2) and (5). This is mainly due to the fact that in both images, the back of the head is centered, and the background is similar in color to the head, leading to missed detection in those cases.

4. Conclusions

This study focuses on personnel detection in dense environments and proposes detecting human heads to address issues like occlusion. The YOLOv7 model was selected as the base network for head detection in this study. Based on the characteristics of the dataset used, several improvements were made to the YOLOv7 network. Firstly, the CBAM attention module was added. Secondly, a Gaussian receptive field-based label-assignment strategy was applied at the feature-fusion and detection-head connection in the original model. Finally, the SPPFCSPC module was used to replace the multi-scale pyramid pooling (SPP). The experimental comparative analysis led to the following conclusions.

In this dataset, the YOLOv7 model with the CBAM attention mechanism achieved a precision of 93.7%, recall of 93.0%, and mAP@0.5 of 0.906. The precision outperformed the SE, ECA, and GAM attention mechanisms, while the recall outperformed SE, CA, and ECA, and the mAP value was superior to SE, CA, ECA, and GAM attention mechanisms.

The three improvements made to YOLOv7 enhanced the model’s precision and inference capability. The precision increased from 92.4% to 94.4%, the recall improved from 90.5% to 93.9%, and the inference speed improved from 87.2 FPS to 94.2 FPS. Compared to single-stage object-detection models like YOLOv8, the improved YOLOv7 model exhibited better accuracy and inference speed. Compared to two-stage detection models, the model’s inference speed is significantly superior to networks like Faster R-CNN, Mask R-CNN, DINOv2, and RT-DETRv2.

Future research will focus on improving the efficiency and generalization ability of the model. On the one hand, by further expanding the data set to cover more diverse scenes and lighting conditions, the adaptability and robustness of the model can be enhanced. On the other hand, more efficient network architecture and optimization algorithms are explored to further improve the detection performance. Meanwhile, it may be beneficial to explore the development of adaptive confidence threshold algorithms to adjust the detection sensitivity in real time according to the scene risk level, taking into account efficiency and accuracy. Further, future studies may wish to consider deploying the model on different devices for testing to improve its application value.

Author Contributions

Conceptualization, D.W. and W.Y.; methodology, D.W. and J.W.; software, J.W.; formal analysis, D.W.; investigation, D.W. and J.W.; resources, J.W.; writing—original draft preparation, D.W.; writing—review and editing, D.W. and W.Y.; supervision, D.W.; project administration, J.W; funding acquisition, J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Liaoning Provincial Department of Science and Technology, People’s Livelihood Science and Technology Plan Project, grant number 2021JH2/10100005.

Data Availability Statement

The dataset is available at https://github.com/Wdong886/head (accessed on 7 March 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Gupta, J.; Sharma, S. Single and Multi-Spectral Methods for Pedestrian Detection: A Comprehensive Review. In Proceedings of the 2025 International Conference on Multi-Agent Systems for Collaborative Intelligence (ICMSCI), Erode, India, 20–22 January 2025; pp. 637–642. [Google Scholar]
Xiao, Y.; Zhou, K.; Cui, G.; Jia, L. Deep Learning for Occluded and Multi-Scale Pedestrian Detection: A Review. IET Image Process. 2021, 15, 286–301. [Google Scholar] [CrossRef]
K, V.; Trojovský, P.; Hubálovský, Š. VIOLA Jones Algorithm with Capsule Graph Network for Deepfake Detection. PeerJ Comput. Sci. 2023, 9, e1313. [Google Scholar] [CrossRef] [PubMed]
Marappan, S.; Kuppuswamy, P.; John, R.; Shanmugavadivu, N. Human Detection in Still Images Using HOG with SVM: A Novel Approach. In Proceedings of the Intelligent Computing Paradigm and Cutting-Edge Technologies. ICICCT 2020, Bangkok, Thailand, 11–12 September 2020; Springer: Cham, Switzerland, 2021. [Google Scholar]
Hasan, M.K.; Ahsan, M.S.; Abdullah-Al-Mamun; Newaz, S.H.S.; Lee, G.M. Human Face Detection Techniques: A Comprehensive Review and Future Research Directions. Electronics 2021, 10, 2354. [Google Scholar] [CrossRef]
Lei, H.; Yixiao, W.; Guoying, C. The Hierarchical Local Binary Patterns for Pedestrian Detection. In Proceedings of the 2021 5th CAA International Conference on Vehicular Control and Intelligence (CVCI), Tianjin, China, 29–31 October 2021; pp. 1–8. [Google Scholar]
Zhou, H.; Yu, G. Research on Pedestrian Detection Technology Based on the SVM Classifier Trained by HOG and LTP Features. Future Gener. Comput. Syst. 2021, 125, 604–615. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 42, 386–397. [Google Scholar] [CrossRef]
Wei, L.; Dragomir, A.; Dumitru, E.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Cai, B.; Wang, H.; Yao, M.; Fu, X. Focus More on What? Guiding Multi-Task Training for End-to-End Person Search. IEEE Trans. Circuits Syst. Video Technol. 2025; early access. [Google Scholar]
Wang, G.; Liu, C.; Xu, L.; Qu, L.; Zhang, H.; Tian, L.; Li, C.; Sun, L.; Zhou, M. YOLOv5s-Based Lightweight Object Recognition with Deep and Shallow Feature Fusion. Electronics 2025, 14, 971. [Google Scholar] [CrossRef]
Choi, J.; Chun, D.; Kim, H.; Lee, H.-J. Gaussian YOLOv3: An Accurate and Fast Object Detector Using Localization Uncertainty for Autonomous Driving. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Cheng, S.; Wang, Q. Indoor Pedestrian Detection Based on Deep Learning. In Proceedings of the 2020 IEEE 5th Information Technology and Mechatronics Engineering Conference (ITOEC), Chongqing, China, 12–14 June 2020. [Google Scholar]
Li, T.; Li, Z.; Mu, Y.; Su, J. Pedestrian Multi-Object Tracking Based on YOLOv7 and BoT-SORT. Proc. SPIE 2023, 12754, 7. [Google Scholar]
Pan, L.; Diao, J.; Wang, Z.; Peng, S.; Zhao, C. HF-YOLO: Advanced Pedestrian Detection Model with Feature Fusion and Imbalance Resolution. Neural Process. Lett. 2024, 56, 90. [Google Scholar] [CrossRef]
Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. DINOv2: Learning Robust Visual Features without Supervision. arXiv 2023, arXiv:2304.07193. [Google Scholar]
Lv, W.; Zhao, Y.; Chang, Q.; Huang, K.; Wang, G.; Liu, Y. RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer. arXiv 2024, arXiv:2407.17140. [Google Scholar]
Luo, J. Research on Pedestrian Head Detection and Counting Based on an Improved YOLOv5s Algorithm; Guizhou University: Guiyang, China, 2023. [Google Scholar]
Chen, Y.; Xie, W.Y.; Liu, H.L.; Wang, B.; Huang, M. Multi-Feature Fusion Pedestrian Detection Combining Head and Whole-Body Information. J. Electron. Inf. Technol. 2022, 44, 1453–1460. [Google Scholar]
Yao, M.; Wang, H.; Chen, Y.; Fu, X. Between/Within View Information Completing for Tensorial Incomplete Multi-View Clustering. IEEE Trans. Multimed. 2025, 27, 1538–1550. [Google Scholar] [CrossRef]
Wang, H.; Yao, M.; Chen, Y.; Xu, Y.; Liu, H.; Jia, W.; Fu, X.; Wang, Y. Manifold-Based Incomplete Multi-View Clustering via Bi-Consistency Guidance. IEEE Trans. Multimed. 2024, 26, 10001–10014. [Google Scholar] [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the Computer Vision–ECCV 2018, Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018. [Google Scholar]
Jin, X.; Xie, Y.; Wei, X.S.; Zhao, B.R.; Chen, Z.M.; Tan, X. Delving Deep into Spatial Pooling for Squeeze-and-Excitation Networks. Pattern Recognit. 2022, 121, 108159. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
Yang, H.; Liu, Y.; Wang, S.; Qu, H.; Li, N.; Wu, J.; Yan, Y.; Zhang, H.; Wang, J.; Qiu, J. Improved Apple Fruit Target Recognition Method Based on YOLOv7 Model. Agriculture 2023, 13, 1278. [Google Scholar] [CrossRef]
Yan, J.; Zhou, Z.; Zhou, D.; Su, B.; Xuanyuan, Z.; Tang, J.; Lai, Y.; Chen, J.; Liang, W. Underwater Object Detection Algorithm Based on Attention Mechanism and Cross-Stage Partial Fast Spatial Pyramidal Pooling. Front. Mar. Sci. 2022, 9, 1056300. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Liu, Y.; Shao, Z.; Hoffmann, N. Global Attention Mechanism: Retain Information to Enhance Channel-Spatial Interactions. arXiv 2021, arXiv:2112.05561. [Google Scholar]
Terven, J.; Córdova-Esparza, D.-M.; Romero-González, J.-A. A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 99, 2999–3007. [Google Scholar] [CrossRef]

Figure 1. Architecture of the improved YOLOv7 model.

Figure 2. Structural diagram of the convolutional block attention module (CBAM).

Figure 3. Schematic illustration of the label-assignment strategy based on Gaussian receptive fields.

Figure 4. Schematic diagram of (a) SPPCSPC and (b) SPPFCSPC structures.

Figure 5. Head images of individuals in different types of scenes: (a) Airport; (b) Classroom; (c) Store; (d) Barbershop; (e) Internet Cafe; (f) Staircase.

Figure 6. Training process.

Figure 7. Comparison of detection results of different algorithms for different scenarios: (a) YOLOv7; (b) YOLOv8; (c) RetinaNet; (d) SSD; (e) Faster R-CNN; (f) Mask R-CNN; (g) DINOv2; (h) RT-DETRv2; (i) Ours. (1) Classroom 1; (2) Staircase entrance; (3) Internet cafe; (4) Retail store; (5) Classroom 2.

Table 1. Dataset.

Serial No.	Scene	Number of Images
1	Classroom	695
2	Internet Cafe	603
3	Home	289
4	Airport	211
5	Retail & Shops (Store + Barbershop)	276
6	Office & Public Facilities (Office Building/Office/Control Room/Lobby)	798
7	Passageways (Stairwell/Stairwell Outdoor/Corridor)	203
Total		3075

Table 2. Experimental results of different attention modules.

NO.	Attention Modules	P	R	mAP@0.5
1	SE	0.930	0.905	0.895
2	CA	0.937	0.912	0.849
3	ECA	0.935	0.919	0.891
4	GAM	0.934	0.931	0.867
5	CBAM	0.937	0.930	0.906

Table 3. Ablation experiment results.

NO.	CBAM	RFLAGauss	SPPFCSPC	P	R	Model Size	FPS
1	×	×	×	0.924	0.905	71.3 M	87.2
2	√	×	×	0.937	0.930	71.6 M	84.6
3	√	√	×	0.945	0.951	73.0 M	72.7
4	√	√	√	0.944	0.939	73.0 M	94.2

Note: The symbols “×” and “√” indicate whether the module is added to the YOLOv7 baseline, where “×” means not added and “√” means added.

Table 4. Experimental results of different algorithms.

Method	Precision	Recall	AP	FPS
YOLOv7	0.924	0.905	0.912	87.2
YOLOv8	0.833	0.827	0.842	84.6
RetinaNet	0.805	0.832	0.828	79.4
SSD	0.901	0.884	0.891	88.8
Faster R-CNN	0.950	0.936	0.915	63.6
Mask R-CNN	0.893	0.902	0.887	72.5
DINOv2	0.924	0.915	0.912	62.4
RT-DETRv2	0.907	0.916	0.894	68.2
Ours	0.944	0.939	0.916	94.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, D.; Yan, W.; Wang, J. Study on an Improved YOLOv7-Based Algorithm for Human Head Detection. Electronics 2025, 14, 1889. https://doi.org/10.3390/electronics14091889

AMA Style

Wu D, Yan W, Wang J. Study on an Improved YOLOv7-Based Algorithm for Human Head Detection. Electronics. 2025; 14(9):1889. https://doi.org/10.3390/electronics14091889

Chicago/Turabian Style

Wu, Dong, Weidong Yan, and Jingli Wang. 2025. "Study on an Improved YOLOv7-Based Algorithm for Human Head Detection" Electronics 14, no. 9: 1889. https://doi.org/10.3390/electronics14091889

APA Style

Wu, D., Yan, W., & Wang, J. (2025). Study on an Improved YOLOv7-Based Algorithm for Human Head Detection. Electronics, 14(9), 1889. https://doi.org/10.3390/electronics14091889

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Study on an Improved YOLOv7-Based Algorithm for Human Head Detection

Abstract

1. Introduction

2. YOLOv7 Model Optimization

2.1. Overall Model Framework

2.2. CBAM

2.3. Receptive Field Enhancement

2.4. SPPFCSPC

3. Experimental Results Analysis

3.1. Dataset Creation

3.2. Evaluation Metrics

3.3. Experiment Results Analysis

3.3.1. Attention Mechanism Comparison Experiment

3.3.2. Ablation Experiment

3.3.3. Precision Comparison Analysis

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI