Research on a Lightweight YOLOv9 Object Detection Algorithm Fused with Adaptive Gated Coordinate Attention

Lv, Condong; Zhou, Wenjie; Li, Yi; Song, Yupeng; Zhang, Xiaodong

doi:10.3390/math14101738

Open AccessArticle

Research on a Lightweight YOLOv9 Object Detection Algorithm Fused with Adaptive Gated Coordinate Attention

by

Condong Lv

^*

,

Wenjie Zhou

,

Yi Li

,

Yupeng Song

and

Xiaodong Zhang

School of Computer Science, Nanjing Audit University, Nanjing 211815, China

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(10), 1738; https://doi.org/10.3390/math14101738

Submission received: 9 April 2026 / Revised: 13 May 2026 / Accepted: 15 May 2026 / Published: 19 May 2026

Download

Browse Figures

Versions Notes

Abstract

Safety gear detection in complex industrial environments faces challenges such as strong background interference, multi-scale spatial perturbations, and the loss of small target features. Furthermore, existing attention-based object detection methods often struggle to balance fine-grained feature retention with background noise suppression. To address these issues, this paper proposes AGCA-YOLOv9, a lightweight object detection model (9.77 M parameters and 39.6 GFLOPs). The core contribution is the Adaptive Gated Coordinate Attention (AGCA) module integrated into the GELAN backbone. Unlike standard coordinate attention mechanisms, AGCA employs a dual-path hybrid pooling strategy combined with an adaptive gated weight fusion mechanism. This design dynamically regulates the synergy between global semantic information and local salient textures, differentiating it from traditional linear feature aggregation. Consequently, it effectively suppresses false detections caused by visually isomorphic backgrounds, such as dense steel frames, while enhancing the representation of distant tiny targets. Validation on the Safety Helmet and Reflective Jacket dataset and the Helmet-Vest-Belt dataset shows that, compared to the YOLOv9s baseline, AGCA-YOLOv9 increases the mAP@50:95 on the Safety Helmet and Reflective Jacket dataset by 0.6% (reaching 80.9%) and the recall rate by 0.4% (reaching 91.9%). Specifically, the mAP@50:95 for the safety helmet category improves by 0.8%. On the Helmet-Vest-Belt dataset, the mAP@50:95 increases by 1.5% (reaching 60.5%). The single-image inference time is 4.6 ms. These results indicate that the proposed algorithm achieves a practical trade-off between detection accuracy and real-time processing speed, demonstrating its potential for safety compliance monitoring in industrial scenarios.

Keywords:

object detection; YOLOv9; safety gear detection; adaptive gated attention; dual-path hybrid pooling; complex background suppression

MSC:

68T45; 68T07

1. Introduction

China boasts a large-scale industrial production system with booming infrastructure construction. In high-risk industries such as construction, mining, and precision manufacturing, the operating environment at job sites is complex and changeable. Wearing protective equipment such as safety helmets and reflective vests correctly is the last line of defense to safeguard the lives of workers [1]. Failure to strictly implement safety gear regulations will directly lead to serious consequences in the event of an accident. At present, safety supervision at construction sites mainly relies on manual inspections and traditional closed-circuit television monitoring, which has drawbacks such as low efficiency, supervision blind spots, and an inability to meet the demand for real-time early warning. Compared with traditional methods, deep learning-based object detection technology has the advantages of strong real-time performance and all-weather operation, becoming the mainstream technology for industrial intelligent supervision. Object detection algorithms are mainly divided into two categories: one-stage detectors (e.g., YOLO [2], RetinaNet [3], RT-DETR [4]) and two-stage detectors (e.g., Faster R-CNN [5], Mask R-CNN [6]). In comparison, one-stage object detection has a faster detection speed and higher real-time performance, making it particularly suitable for construction safety monitoring scenarios that require fast and accurate target recognition. Therefore, this paper selects the one-stage detector as the basic framework for research.

To address the complex environmental challenges in industrial sites, recent studies on safety gear detection can be broadly categorized into two methodological approaches. The first group focuses on attention-based feature enhancement to address occlusion and multi-scale issues. For instance, researchers [7,8] integrated Coordinate Attention (CA) to capture weak features of dense small targets. Similarly, other studies [9,10] employed modules like DWR, AKConv, and CBAM to optimize feature fusion paths in unstructured backgrounds. The second group focuses on lightweight network architectures designed for resource-constrained edge devices. Works in this category [11,12] utilized lightweight components such as Ghost convolution variants and VoVGSCSP modules to achieve a reduction in both parameters and computational complexity.

However, a critical analysis reveals that these existing approaches still exhibit significant limitations when confronted with the multiple endogenous challenges of industrial monitoring. First, current attention-based methods are insufficient for complex background suppression. Standard attention mechanisms often rely on linear aggregation and lack the dynamic gating ability required to differentiate highly unstructured backgrounds (e.g., dense steel frames and crisscross pipelines) that share geometric visual isomorphism with safety helmets [13]. Second, current lightweight methods inherently sacrifice fine-grained feature retention. The aggressive downsampling used to reduce parameters leads to the severe sparsification of spatial geometric features of non-rigid tiny targets (e.g., workers viewed from high-angle surveillance cameras), making the models highly vulnerable to extreme spatial perturbations and metallic highlight reflections, which cause feature submergence [14]. To bridge this research gap, this study proposes a highly robust object detection model for complex industrial scenarios: AGCA-YOLOv9. The core innovation lies in breaking the limitations of general feature extraction networks by redesigning the lightweight backbone, AGCA-GELAN. Rather than simply appending attention modules to the neck, this study explicitly adopts a strategy of backbone deep integration. By embedding the adaptive gated coordinate attention (AGCA) mechanism deep into the core computational flow of the YOLOv9 GELAN architecture, the network alters the fundamental direction of information flow. It utilizes a dual-path hybrid pooling strategy to actively capture salient texture features, effectively mitigating the loss of tiny target features during downsampling. Simultaneously, it relies on an adaptive gated fusion mechanism to dynamically suppress isomorphic background noise. While maintaining lightweight advantages, this overall architecture improves feature discrimination against complex interferences, providing strong technical support for safety production supervision in smart construction sites.

2. Overview of YOLOv9

In this study, YOLOv9s [15] is selected as the baseline model. Within the YOLO series, YOLOv9s provides a suitable architecture for resource-constrained edge computing. Specifically, the original YOLOv9s model utilizes 9.74 M parameters and 39.6 GFLOPs while maintaining robust feature representation capabilities. Furthermore, while recent iterations such as YOLOv10 and YOLOv11 have introduced architectural shifts focusing on NMS-free designs and generalized efficiency blocks, they primarily optimize the macro-level inference pipeline. In contrast, the YOLOv9s architecture, particularly its GELAN backbone, offers a more straightforward gradient path. This structural characteristic makes it an appropriate baseline for isolating and validating the micro-level spatial feature calibration provided by our proposed attention mechanism, mitigating potential interference from the architecture.

The network components most relevant to our proposed modifications are concentrated in the backbone. The YOLOv9s backbone utilizes the Generalized Efficient Layer Aggregation Network (GELAN) architecture, with the RepNCSPELAN4 module as its core building block. This module integrates the gradient path advantages of CSPNet with the inference efficiency of ELAN [16], ensuring robust feature extraction with a reduced parameter count.

To address the challenges of strong background interference and tiny target feature loss in industrial environments, this study modifies the architectural blocks within the GELAN backbone. Specifically, the Adaptive Gated Coordinate Attention (AGCA) module is integrated into the internal branches of the deep RepNCSPELAN4 modules. GELAN is determined to be the most appropriate location for this integration because the deep nodes of the backbone possess high-level semantic receptive fields [17]. Embedding the attention mechanism at this stage allows the network to proactively filter out highly unstructured background noise (such as dense steel frames) and preserve the salient geometric textures of tiny targets during the downsampling process. Compared to appending attention modules in the neck network, this deep integration strategy within GELAN prevents feature submergence more effectively, ensuring that high-quality, noise-filtered representations are passed to the subsequent feature pyramid and detection head.

3. Research Methods

In the task of safety gear compliance detection at industrial production sites, targets in monitoring images exhibit multi-scale spatial perturbations and complex background interference. Specifically, in high-angle surveillance, workers typically appear as small-scale targets (under

32 \times 32

pixels), making their edge features susceptible to attenuation during network downsampling. Furthermore, dense steel structures and metal highlights share visual isomorphism with safety helmets and reflective vests, which increases the probability of false positive detections. In practical industrial deployments, addressing these challenges requires adhering to specific quantitative constraints: maintaining detection accuracy while constraining parameter size (e.g., <10 M) and achieving real-time inference speeds (<5 ms per image). Driven by these task specifications, this chapter focuses on the analysis and reconstruction of the feature extraction network of YOLOv9: the original GELAN backbone architecture lacks a long-range dependency maintenance mechanism for weak features in the process of deep feature transmission, leading to the gradual sparsification of semantic information of distant small targets; at the same time, it performs undifferentiated convolution operations on all channel features and lacks dynamic salient screening capability, unable to effectively suppress redundant noise introduced by complex industrial backgrounds. To address the problem of easy loss of tiny target features, this study reconstructs the backbone network (AGCA-GELAN) from the underlying logic and introduces a dual-path hybrid pooling strategy designed to extract salient texture features; to address the problem of false detections easily induced by isomorphic backgrounds and highlight reflections, an adaptive gated fusion mechanism is designed to dynamically suppress redundant noise and separate foreground features. The following will first discuss the overall architecture design and multi-scale deployment strategy of AGCA-GELAN, then deeply analyze the dual-path hybrid pooling principle and gated fusion mechanism of the core component AGCA module, and finally verify the parameter effectiveness of the design.

3.1. AGCA-GELAN Deeply Integrated Backbone Network Architecture

The original GELAN backbone architecture of YOLOv9 integrates the gradient path advantages of CSPNet [18] and the inference efficiency of ELAN [19], and has good general feature extraction capability. However, to address the problems of feature sparsification and background confusion, the AGCA-GELAN backbone network designed in this study (shown in the dashed box in Figure 1) reconstructs it at the underlying level. In the feature extraction stage, instead of following the simple idea of only “attaching” the attention mechanism to the end of the network as in traditional algorithms, the network adopts a deep integration strategy. Referring to the embedding paradigm of Coordinate Attention [20] in lightweight networks, AGCA-GELAN embeds and couples the Adaptive Gated Coordinate Attention (AGCA) module into the core computational flow. This design is proposed to enhance the model’s resistance to unstructured noise and discrimination capability for tiny targets from the physical level while maintaining the advantage of efficient inference [21]. The improvement strategies of its core architecture are mainly reflected in the following two aspects.

First, AGCA-GELAN designs a deeply integrated feature extraction unit, namely the RepNCSP-AGCA module (the structure is shown in Figure 2). This design aims to solve the problem of noise accumulation in the feature extraction process. Specifically, this study embeds the AGCA module into the key path between the RepNCSP convolution operation and the feature concatenation layer. Specifically, the reparameterized convolution structure adopted by the RepNCSP module draws on the design idea of RepVGG [22], so that after the feature map undergoes multi-branch nonlinear transformation and fusion, it can receive dynamic calibration in the spatial and channel dimensions in real time. If the input feature is defined as

X_{i n}

, and the input of the computational branch after splitting by

1 \times 1

convolution is

X_{s p l i t}

, the improved single-branch feature extraction process can be formally expressed as:

Y_{b r a n c h} = F_{c o n v} (M_{a g c a} (F_{c s p} (X_{s p l i t})))

(1)

where

F_{c s p}

represents the convolution transformation of the RepNCSP module,

M_{a g c a}

represents the adaptive gated attention modulation operator proposed in this paper, and

F_{c o n v}

is the mapping operation at the end of the branch. Based on this computing paradigm, the network establishes a layer-by-layer calibration mechanism inside the backbone level, aiming to dynamically suppress background redundant textures generated by steel frames and pipelines during downsampling, and ensure the purity of features flowing into the next level. In addition, since

M_{a g c a}

is located on the main path of gradient backpropagation, deep semantic signals can directly guide shallow convolution kernels to focus on the highlight areas of reflective vests, realizing end-to-end optimization of feature extraction.

Second, AGCA-GELAN adopts a full-scale enhanced deployment strategy to cope with the severe target scale changes in industrial scenarios. Considering that different levels of the feature pyramid carry differentiated semantic information, this study deeply integrates the improved unit at three key scales P3, P4 and P5 of the Backbone. At the shallow layer P3, AGCA is used to focus on and extract the helmet contours of distant workers, alleviating the problem of geometric feature loss caused by downsampling; at the deep layer P5, global semantic information is used to suppress metal highlight interference similar to reflective vests, aiming to mitigate the problem of false detections in complex industrial backgrounds. Through this multi-scale collaborative design, the model is structurally optimized to enhance its robustness to complex industrial environments while maintaining a lightweight design. The detailed architecture and computational cost parameters of the modified AGCA-GELAN backbone network are summarized in Table 1. As shown in the table, the deep integration of the AGCA module introduces a negligible computational overhead (with an increment of only 30,268 parameters and 0.0105 GFLOPs), verifying its lightweight nature.

3.2. Dual-Path Hybrid Pooling Feature Aggregation Strategy

The background of monitoring images is highly unstructured due to the combined effects of steel frame structures, pipeline facilities and complex illumination conditions in actual industrial production and operation environments, and the highlight areas of reflective vests are visually confused with the specular reflections of surrounding metal equipment, leading to the model’s difficulty in effectively separating the foreground from the background in the feature extraction stage, which is very likely to cause missed detections and false detections. To address the problem of balancing the model’s retention of the overall semantic information of targets and accurate capture of salient texture features, this study introduces a dual-path hybrid pooling strategy into the AGCA module, designed to enhance the model’s representation capability for key features.

The dual-path feature aggregation structure of the AGCA module is shown in Figure 3. The design inspiration of this module comes from the Dual Coordinate Attention Feature Extraction (DCAFE) in Flora-NET [23] and the dual-path pooling idea of CBAM [24]. In processing industrial monitoring images, a single pooling operation often has limitations: using only average pooling will smooth the edges of reflective vests, and using only max pooling will lose the posture context of workers. This one-way loss of frequency band information is easily amplified by the network downsampling process when facing high-noise backgrounds with a high degree of visual isomorphism, covering up real targets and thus inducing feature submergence and false detections. For this reason, the AGCA module constructs parallel average pooling (Mean-Stream) and max pooling (Max-Stream) branches along the spatial coordinate axes to facilitate complementary extraction of features in different frequency bands.

Specifically, the AGCA module performs feature aggregation along the horizontal (X-axis) and vertical (Y-axis) directions for the input feature map X, respectively.

For the average pooling branch (Mean-Stream), its main function is to extract the overall spatial position distribution and contextual background information of the target. The feature aggregation process is defined as:

z_{c, m e a n}^{h} (h) = \frac{1}{W} \sum_{0 \leq i < W} x_{c} (h, i)

(2)

z_{c, m e a n}^{w} (w) = \frac{1}{H} \sum_{0 \leq j < H} x_{c} (j, w)

(3)

where

z_{c, m e a n}^{h}

and

z_{c, m e a n}^{w}

represent the average aggregated features of the c-th channel at height h and width w, respectively.

For the max pooling branch (Max-Stream), its main function is to capture high-frequency detail features represented by the highlight areas of reflective vests and the hard edges of safety helmets. The feature aggregation process is defined as:

z_{c, m a x}^{h} (h) = \max_{0 \leq i < W} x_{c} (h, i)

(4)

z_{c, m a x}^{w} (w) = \max_{0 \leq j < H} x_{c} (j, w)

(5)

where max represents the max pooling operation.

To establish dependencies between channels and prevent a sharp increase in the number of model parameters, the AGCA module introduces a parameter-sharing strategy in the feature transformation stage. The spatial direction features generated by the above two branches are concatenated separately and input into a parameter-shared

1 \times 1

convolution layer for feature transformation and dimensionality reduction to generate an intermediate feature map F. Taking the average pooling branch as an example, its feature transformation process is as follows:

F_{m e a n} = δ (C o n v_{s h a r e d} (C o n c a t ([z_{m e a n}^{h}, z_{m e a n}^{w}])))

(6)

where

C o n c a t

represents the concatenation operation along the spatial dimension,

C o n v_{s h a r e d}

is the parameter-shared convolution transformation layer, and

δ

is the non-linear activation function.

By introducing the dual-path hybrid pooling strategy, the model is structured to perceive low-frequency information such as the contour of workers’ torsos, while aiming to enhance its response capability to high-frequency information such as the texture of safety equipment. This design seeks to mitigate the problem of feature blurring caused by a single pooling operation.

3.3. Adaptive Gated Weight Fusion Mechanism

After constructing the dual-path hybrid pooling feature descriptor, the method of fusing the global semantic information contained in average pooling and the salient texture features extracted by max pooling influences the anti-interference performance of the attention mechanism. Existing feature fusion methods mostly adopt linear superposition or channel concatenation strategies. This static fusion method assumes that features from all sources have equal importance and lacks adaptive discriminative ability for varying input content. To overcome this limitation, adaptively weighting the outputs of multiple pooling operations has been utilized as an alternative to simple concatenation to optimize attention mechanisms [25]. Similar adaptive strategies have also been explored in other domains, such as graph adaptive pooling based on information bottlenecks [26] and complementary fusion networks for complex scene perception [27]. This adaptive capability is especially critical in industrial monitoring scenarios, where target features are frequently coupled with severe background noise (such as dense steel frame textures). In such environments, undifferentiated feature superposition risks amplifying the noise, thereby inducing false detections of targets like safety helmets. Consequently, the AGCA module constructs an Adaptive Gating Unit, which aims to dynamically assign confidence weights to the two feature branches according to the global contextual information of the input feature map.

The unit first performs global average pooling on the original input feature X, compresses the two-dimensional feature map into a global context vector, and then establishes a nonlinear mapping relationship between channels through a lightweight network containing two layers of convolution. Drawing on the context aggregation idea of the Squeeze-and-Excitation (SE) mechanism [28], the feature encoding process can be expressed as:

S = W_{2} (δ (W_{1} (G A P (X))))

(7)

where

G A P

represents the global average pooling operation,

W_{1}

and

W_{2}

represent the weight parameters of the two convolution layers, respectively,

δ

is the ReLU activation function, and S is the generated intermediate feature vector.

To realize weight normalization and the mutual exclusion competition, this study refers to the multi-branch feature fusion strategy in the Selective Kernel Network (SKNet) [29]. The model uses the Softmax function to generate the final gated coefficients

α_{m e a n}

and

α_{m a x}

:

[α_{m e a n}, α_{m a x}] = S o f t m a x (S)

(8)

where

α_{m e a n}

and

α_{m a x}

are the confidence weights corresponding to the average pooling branch and the max pooling branch, respectively, and satisfy

α_{m e a n} + α_{m a x} = 1

.

Finally, the AGCA module uses the generated gated coefficients to perform weighted fusion on the dual-path attention maps (denoted as

A_{m e a n}

and

A_{m a x}

, respectively) generated in the previous stage to generate the final enhanced feature map

A_{o u t}

:

A_{o u t} = α_{m e a n} \cdot A_{m e a n} + α_{m a x} \cdot A_{m a x}

(9)

Through the adaptive gating mechanism, the network is structured to perform content-based dynamic feature screening: when the illumination is insufficient in the detection scene and target features are weak, the mechanism is designed to increase

α_{m a x}

to prioritize the highlight features of reflective vests; conversely, when there is strong texture interference in the background, it aims to increase

α_{m e a n}

to utilize contextual information for suppressing false textures. This nonlinear fusion strategy is formulated to improve the algorithm’s robustness in complex industrial environments by avoiding the premature amplification of background noise.

4. Experiments

4.1. Dataset

To evaluate the performance of the proposed improved YOLOv9 model, this study utilizes two safety equipment detection datasets. The configurations and split ratios of both datasets are summarized in Table 2.

Images in these datasets were captured in various industrial environments, including construction sites, factories, and mining areas. The data collection involved diverse environmental backgrounds and camera perspectives. Additionally, the datasets include real-world scenarios with factors such as object occlusion and overlapping. These datasets are utilized for model training, validation, and testing.

The visualizations in Figure 4 and Figure 5 present the Safety Helmet and Reflective Jacket and Helmet-Vest-Belt datasets, respectively. In subfigure (A), the types and corresponding label information are displayed (specifically, Safety Helmet and Reflective Jacket for Figure 4; Helmet, Vest, and Belt for Figure 5). Subfigure (B) illustrates the dimensions of the label boxes, while subfigure (C) shows the distribution of center-point locations. Subfigure (D) provides information on the distribution of object sizes, and subfigure (E) assigns details to the labels.

4.2. Experimental Environment

This study is based on the open-source machine learning framework PyTorch 2.5.1, using a 64-bit Linux operating system, a 12-core Intel Xeon Platinum 8352V CPU @2.10 GHz, and an NVIDIA vGPU with 32 GB of VRAM. CUDA 12.1 is utilized for GPU acceleration. The Python version used is 3.12. Regarding the experimental settings, the hyperparameters of the proposed AGCA-YOLOv9 and the compared baseline models were primarily initialized using the standard configurations recommended by the YOLOv9 architecture. To adapt to our specific dataset, an empirical fine-tuning method was applied. The detailed training and inference parameters are set as shown in Table 3.

4.3. Evaluation Metrics

Six commonly used object detection evaluation metrics are adopted in the experiment to evaluate the model performance: Precision (P), Recall (R), mean Average Precision at the Intersection over Union (IoU) threshold of 0.5 (mAP@0.5), mean Average Precision at IoU thresholds from 0.5 to 0.95 (mAP@0.5:0.95), number of Parameters (Params), and Floating Point Operations per Second (GFLOPs).

Among the above metrics, Precision (P) is the proportion of correctly predicted samples among all samples predicted as positive, and its calculation method is shown in Formula (10), which is used to measure the accuracy of model prediction—the closer the value is to 1, the better; Recall (R) is the proportion of correctly predicted samples among all real positive samples, and its calculation method is shown in Formula (11), which is used to check the probability of missed detections of the model—the closer the value is to 1, the better.

P = \frac{T P}{T P + F P}

(10)

R = \frac{T P}{T P + F N}

(11)

where

T P

is the number of correctly identified positive samples;

F P

is the number of negative samples incorrectly judged as positive samples; and

F N

is the number of positive samples misjudged as negative samples.

mAP@0.5 is used to evaluate the comprehensive detection performance of the model under a loose threshold; mAP@0.5:0.95 is used to evaluate the comprehensive detection performance of the model under a strict threshold. The larger the mAP value, the better the overall performance of the detection model. Its calculation process needs to first calculate the Average Precision (AP) of a single category, and then calculate the mean value of AP for all categories. The calculation method is as follows:

A P = \int P (R) d R

(12)

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(13)

where

P R

represents the PR curve function with R as the abscissa and P as the ordinate; N is the total number of target categories in the detection task (

N = 2

in this study).

Params and GFLOPs are lightweight evaluation metrics of the model. Among them, Params is used to measure the model complexity and its memory occupation of the device; GFLOPs is the number of billions of floating-point operations that the model needs to perform during inference. The smaller the value, the lower the model’s demand for computing resources, which is more conducive to deployment in industrial sites.

4.4. Ablation Experiments

In object detection algorithms, ablation experiments are often used to evaluate the effectiveness and interpretability of different module improvements. While the comprehensive generalization capability of the model is validated across multiple datasets in the main experiments, the ablation studies in this section are conducted within a controlled, single-dataset environment using the Safety Helmet and Reflective Jacket dataset. This approach is adopted to strictly isolate the performance impact of internal structural changes and avoid redundant multi-dataset computations during the mechanism verification phase. To verify the actual effectiveness of each core component in the improved algorithm, under the same experimental environment, this study takes YOLOv9s as the baseline model, and briefly denotes the key components inside the AGCA module as A, B and C, where A represents the average pooling branch, B represents the max pooling branch, and C represents the adaptive gating mechanism, and Experiment 0 represents the baseline model YOLOv9s. Based on the above definitions, this study successively constructs a single-branch model with only A and C introduced, a linear superposition model with A and B introduced, and the final complete AGCA-YOLOv9 model containing A, B and C. The complete ablation experiment combinations and performance evaluation results are shown in Table 4.

Combining the quantitative data from Table 4 and the qualitative visualization results from Figure 6, the impact of the dual-path hybrid pooling strategy introduced in the AGCA module can be evaluated. Comparing Experiment 1 and Experiment 3, the integration of the max-pooling branch recovers the model’s precision from 0.934 to 0.937. Concurrently, the mAP@50:95—a metric reflecting high-precision localization capability—improves by 0.2%. The heatmap comparisons in Figure 6 intuitively corroborate these findings: Experiment 1 exhibits a relatively weak response to minute or edge features, whereas the high-response regions in Experiment 3 are noticeably sharper. This demonstrates that while the average pooling branch (A) maintains the global posture context, the max-pooling branch (B) acts as a specific compensator for high-frequency information, effectively capturing the salient texture features of reflective vests and safety helmets.

Regarding the feature fusion methodology, a comparison between Experiment 2 and Experiment 3 reveals that removing the adaptive gating mechanism in favor of linear superposition yields a recall of 0.929, but causes a substantial drop in precision to 0.926. This indicates that simply superimposing the contextual features (from A) and the high-frequency features (from B) without differentiation introduces background noise interference. The heatmap comparisons in Figure 6 visually confirm this phenomenon: the model in Experiment 2 suffers from attention dispersion in certain complex scenarios, inadvertently highlighting irrelevant background regions. Conversely, with the introduction of the adaptive gating mechanism in Experiment 3, the precision improves by 1.1%. Furthermore, the heatmaps illustrate that the model’s high-response areas converge more effectively on core target regions, such as reflective vests. This suggests that the adaptive gating unit (C) acts as a crucial mediator; by dynamically allocating weights, it prevents the high-frequency noise potentially captured by B from overwhelming the global context provided by A, facilitating the suppression of complex industrial background noise.

Moreover, the ablation study results reveal the underlying synergy of the complete AGCA module: The dual-path pooling (A and B) generates a comprehensive feature descriptor containing both contextual and edge information, while the gating mechanism (C) acts as a content-aware filter to dynamically balance these two complementary signal streams. It is this coordinated interaction—rather than the isolated addition of individual modules—that prevents feature submergence and enables the final model to achieve performance gains of 0.6% and 0.4% in mAP@50:95 and Recall (R) compared to the baseline YOLOv9s. These findings validate the structural logic and effectiveness of the proposed algorithm in complex industrial scenarios.

4.5. Comparative Experiments

To comprehensively validate the effectiveness of the proposed enhancements, the AGCA-YOLOv9 is evaluated against mainstream lightweight object detection models—YOLOv5s, YOLOv8s, YOLOv10s, YOLOv11s, and the baseline YOLOv9s. The performance of these algorithms is systematically analyzed using standard evaluation metrics to demonstrate the overall superiority of the proposed method. A detailed quantitative comparison of these models is summarized in Table 5.

According to the data in Table 5, in terms of recognition accuracy on Dataset 1, the precision of the improved model is 93.7%, which is 1.9 percentage points, 0.9 percentage points, 1.2 percentage points, and 0.9 percentage points higher than that of YOLOv5s, YOLOv8s, YOLOv10s, and YOLOv11s, respectively, and 0.1 percentage points lower than YOLOv9s; the mAP@0.5 is 96.7%, which is 1.1 percentage points, 1.2 percentage points, 1.0 percentage points, and 0.6 percentage points higher than that of YOLOv5s, YOLOv8s, YOLOv10s, and YOLOv11s, respectively, and 0.2 percentage points lower than YOLOv9s; the mAP@0.5:0.95, which reflects the high-precision localization capability, is 80.9%, which is 2.3 percentage points, 1.2 percentage points, 0.6 percentage points, 0.7 percentage points, and 0.7 percentage points higher than that of YOLOv5s, YOLOv8s, YOLOv9s, YOLOv10s, and YOLOv11s, respectively. Similarly, on Dataset 2, the precision of the improved model is 89.4%, which is 2.9 percentage points, 1.1 percentage points, and 0.3 percentage points higher than that of YOLOv8s, YOLOv10s, and YOLOv11s, respectively, and 0.9 percentage points and 0.1 percentage points lower than YOLOv5s and YOLOv9s, respectively; the mAP@0.5 is 87.6%, which is 1.4 percentage points, 1.0 percentage points, 0.7 percentage points, 2.5 percentage points, and 1.8 percentage points higher than that of YOLOv5s, YOLOv8s, YOLOv9s, YOLOv10s, and YOLOv11s, respectively; the mAP@0.5:0.95 is 60.5%, which is 5.0 percentage points, 2.3 percentage points, 1.5 percentage points, 2.5 percentage points, and 1.9 percentage points higher than that of YOLOv5s, YOLOv8s, YOLOv9s, YOLOv10s, and YOLOv11s, respectively. It shows that compared with the above five mainstream comparison models, the improved model is more accurate in detecting safety gear and can reduce the misjudgment rate.

In terms of model lightweight design, the Params of the improved model is 9.77 M, which is slightly higher than that of YOLOv5s, YOLOv10s, and YOLOv11s, but 12.3% lower than that of the SOTA model YOLOv8s of the same level. Compared to the baseline YOLOv9s (9.74 M Params and 39.6 GFLOPs), the improved model only increases the Params by 0.03 M, effectively maintaining the overall GFLOPs at approximately 39.6. Although this GFLOPs value is higher than that of YOLOv5s, YOLOv8s, YOLOv10s, and YOLOv11s, the computational overhead introduced by the AGCA module is negligible relative to its baseline. Therefore, the improved model can still achieve good detection results on the premise of ensuring lightweight design, which is conducive to the deployment of the detector in the environment with limited computing power at industrial sites, and can show better performance in the safety gear detection task under complex backgrounds.

Combined with the results of ablation experiments and comparative experiments, the correlation between the model performance improvement and module design is analyzed. The introduction of the AGCA module mainly improves the model’s precision and comprehensive localization accuracy (mAP@0.5:0.95), which stems from the collaborative design of its dual-path hybrid pooling and adaptive gating mechanism. Among them, the parallel processing of max pooling and average pooling enables the model to extract the texture features of safety helmets and the salient contours of reflective vests in a targeted manner, avoiding them being covered by complex industrial backgrounds; at the same time, the adaptive gating unit effectively suppresses metal highlight interference in the background through dynamic weight allocation, jointly reducing false detections significantly.

In addition to the comprehensive comparison above, to evaluate the stability and reliability of the proposed AGCA-YOLOv9, we conducted repeated experiments across multiple runs. Specifically, both the baseline YOLOv9s and the proposed model were trained and evaluated three independent times under identical hardware settings on both the Safety Helmet and Reflective Jacket dataset and the Helmet-Vest-Belt dataset. The statistical variations are reported using the mean and standard deviation (Mean ± Std), as presented in Table 6. The results indicate that AGCA-YOLOv9 achieves higher mAP@0.5:0.95 scores than the baseline across both datasets. Furthermore, both models exhibit standard deviations below 0.35%, indicating that the integration of the AGCA module improves model accuracy without compromising training stability or introducing significant variance.

4.6. Limitations and Future Work

Although the proposed AGCA-YOLOv9 architecture has improved the detection accuracy for industrial safety equipment, several limitations must be addressed prior to large-scale practical deployment. First, regarding dataset diversity, the current samples exhibit potential scene biases. The data primarily consists of regular daylight conditions, lacking representation of extreme meteorological variations (such as heavy rain, snow, or dense fog) and ultra-dense scenarios involving hundreds of workers simultaneously. These factors could compromise the model’s feature extraction capabilities in highly complex and dynamic real-world environments. Second, although the integration of the AGCA module introduces negligible computational overhead, the overall inference time and hardware requirements of the model could still pose challenges for deployment on ultra-low-power embedded edge platforms.

Addressing these issues is particularly critical because deploying models prone to missed detections or false alarms in high-risk industrial environments poses severe safety hazards. For instance, failing to detect a worker operating without a safety helmet due to severe occlusion or poor visibility could lead to delayed safety interventions, directly risking fatal injuries.

To overcome these limitations, future research will focus on the following directions. On the data front, we plan to systematically collect diverse scenario data and develop robust augmentation strategies tailored to extreme weather and high-density industrial sites [30]. On the application front, the detection categories will be expanded to include finer-grained personal protective equipment (PPE), such as safety goggles and insulated gloves [31]. Furthermore, regarding lightweight optimization, techniques such as model quantization and knowledge distillation will be explored to reduce the computational and memory demands of the model while maintaining high detection accuracy [32]. These future directions are designed to address the current limitations of the proposed model and overcome the practical challenges it faces, thereby further optimizing its performance and expanding its applicability in real-world industrial safety monitoring tasks.

5. Conclusions

This study proposes AGCA-YOLOv9, a lightweight object detection algorithm designed to address background interference and small target feature loss in industrial safety gear detection. By deeply integrating the Adaptive Gated Coordinate Attention (AGCA) module into the GELAN architecture, the model leverages dual-path hybrid pooling and adaptive gating to improve multi-scale feature fusion.

Experimental evaluations confirm that AGCA-YOLOv9 improves high-precision localization performance (mAP@0.5:0.95) while maintaining comparable precision and recall compared to the baseline YOLOv9s. Furthermore, comparative analyses demonstrate that the proposed method outperforms other mainstream models, achieving an effective balance between detection accuracy and parameter efficiency. While these metric improvements are experimentally demonstrated findings, the expected practical benefit is a reliable reduction in missed and false detection rates for multi-scale targets in complex, real-world industrial scenarios. Considering its computational load of 39.6 GFLOPs, the model is theoretically estimated to achieve an inference speed of approximately 30 FPS on mainstream industrial edge-computing devices (e.g., NVIDIA Jetson Xavier NX) with TensorRT acceleration. Such performance aligns with the real-time processing demands for video-based safety compliance monitoring, suggesting its potential for practical application in industrial construction scenarios.

The current validation is limited by the sample-collection scenarios of the utilized dataset, which predominantly consist of workshop and construction site environments under conventional illumination. Future research will expand the dataset to include extreme environmental conditions (such as night, rain, fog, and strong backlight) to systematically verify and optimize the model’s generalization capability. Additionally, subsequent efforts will focus on deploying the algorithm on embedded edge devices to advance its practical application in the safety monitoring systems of smart factories.

Author Contributions

Conceptualization, C.L. and W.Z.; methodology, C.L.; software, C.L.; validation, C.L., W.Z. and Y.L.; formal analysis, C.L.; investigation, C.L.; resources, Y.S.; data curation, Y.L.; writing—original draft preparation, C.L.; writing—review and editing, W.Z. and X.Z.; visualization, Y.L.; supervision, X.Z.; project administration, Y.S.; funding acquisition, C.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Key Research and Development Program Project grant number 2024YFC3307901.

Data Availability Statement

Restrictions apply to the availability of these data. Data were obtained from [Kaggle] and are available [https://www.kaggle.com/datasets/niravnaik/safety-helmet-and-reflective-jacket (accessed on 14 May 2026)].

Acknowledgments

We thank the editors and the anonymous reviewers for their valuable comments and suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Huang, L.; Fu, Q.; He, M.; Jiang, D.; Hao, Z. Detection algorithm of safety helmet wearing based on deep learning. Concurr. Comput. Pract. Exp. 2021, 33, e6234. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Chen, Y.; Wang, H.; Li, W.; Sakaridis, C.; Dai, D.; Van Gool, L. Scale-aware domain adaptive faster r-cnn. Int. J. Comput. Vis. 2021, 129, 2223–2243. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Han, K.; Zhang, T.; Peng, B.; Zhong, L.; Wu, S. Safety helmet detection algorithm based on improved YOLOv5. Mod. Electron. Tech. 2024, 47, 85–92. (In Chinese) [Google Scholar] [CrossRef]
Hou, G.; Chen, Q.; Yang, Z.; Zhang, Y.; Zhang, D.; Li, H. Helmet detection method based on improved YOLOv5. Chin. J. Eng. 2024, 46, 329–342. (In Chinese) [Google Scholar] [CrossRef]
Xiao, Z.; Yan, S.; Qu, H. Safety helmet detection method in complex environments based on multi-mechanism optimized YOLOv8. Comput. Eng. Appl. 2024, 60, 172–182. (In Chinese) [Google Scholar]
Lei, Y.; Zhu, W.; Liao, H. Improved YOLOv8n safety helmet wearing detection algorithm in complex scenes. Softw. Eng. 2023, 26, 46–51. (In Chinese) [Google Scholar] [CrossRef]
Han, B.; Zhang, J.; Lu, Z. FEV-YOLOv8n: Lightweight safety helmet wearing detection method. Comput. Meas. Control 2025, 33, 69–77. (In Chinese) [Google Scholar] [CrossRef]
Hu, L.; Ren, J. YOLO-LHD: An enhanced lightweight approach for helmet wearing detection in industrial environments. Front. Built Environ. 2023, 9, 1288445. [Google Scholar] [CrossRef]
Han, G.; Zhu, M.; Zhao, X.; Gao, H. Method based on the cross-layer attention mechanism and multiscale perception for safety helmet-wearing detection. Comput. Electr. Eng. 2021, 95, 107458. [Google Scholar] [CrossRef]
Xiao, J.; Guo, H.; Yao, Y.; Zhang, S.; Zhou, J.; Jiang, Z. Multi-scale object detection with the pixel attention mechanism in a complex background. Remote Sens. 2022, 14, 3969. [Google Scholar] [CrossRef]
Wang, C.Y.; Yeh, I.H.; Mark Liao, H.Y. Yolov9: Learning what you want to learn using programmable gradient information. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2024; pp. 1–21. [Google Scholar]
Pan, H.; Wei, Z.; Lei, X.; Yao, C.; Jiang, Z.; Zhang, L. CoordEF-YOLOv9t-based personnel behavior recognition in underground coal mines. Ind. Mine Autom. 2025, 51, 59–66. (In Chinese) [Google Scholar] [CrossRef]
Liu, W.; Zhang, D. Pavement distress detection model based on improved YOLOv9. China Meas. Test 2025, 51, 19–29. (In Chinese) [Google Scholar]
Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 390–391. [Google Scholar]
Wang, C.Y.; Liao, H.Y.M.; Yeh, I.H. Designing network design strategies through gradient path analysis. arXiv 2022, arXiv:2211.04800. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Zhang, X.; Wang, H.; Zhang, Y. LRM-YOLO: A lightweight safety helmet wearing detection method for industrial scenes. J. Saf. Environ. 2026, 26, 151–159. (In Chinese) [Google Scholar] [CrossRef]
Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13733–13742. [Google Scholar]
Gupta, S.; Tripathi, A.K. Flora-NET: Integrating dual coordinate attention with adaptive kernel based convolution network for medicinal flower identification. Comput. Electron. Agric. 2025, 230, 109834. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Li, W.; Liu, K.; Zhang, L.; Cheng, F. Object detection based on an adaptive attention mechanism. Sci. Rep. 2020, 10, 11307. [Google Scholar] [CrossRef] [PubMed]
Cao, Z.; Xu, L.; Zhang, R.; Zhang, J.; Pei, H.; Zhou, D.; Qiu, J. ADP: Graph Adaptive Pooling based on Edge Understanding with Graph Pooling Information Bottleneck. IEEE Trans. Consum. Electron. 2025, 72, 692–704. [Google Scholar] [CrossRef]
Zhang, J.; Zhang, R.; Cao, Z.; Xu, L.; Chen, X.; Xu, M. It Takes Two: Multi-frequency Perception with Complementary Fusion Network for Complex Scene Segmentation. IEEE Trans. Circuits Syst. Video Technol. 2025, 36, 5288–5300. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–21 June 2019; pp. 510–519. [Google Scholar]
Gupta, H.; Kotlyar, O.; Andreasson, H.; Lilienthal, A.J. Robust object detection in challenging weather conditions. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 7523–7532. [Google Scholar]
Malaikrisanachalee, S.; Wongwai, N.; Kowcharoen, E. ESPCN-YOLO: A high-accuracy framework for personal protective equipment detection under low-light and small object conditions. Buildings 2025, 15, 1609. [Google Scholar] [CrossRef]
Setyanto, A.; Sasongko, T.B.; Fikri, M.A.; Ariatmanto, D.; Agastya, I.M.A.; Rachmanto, R.D.; Ardana, A.; Kim, I.K. Knowledge distillation in object detection for resource-constrained edge computing. IEEE Access 2025, 13, 18200–18214. [Google Scholar] [CrossRef]

Figure 1. Network architecture of AGCA-YOLOv9.

Figure 2. Architecture of RepNCSP-AGCA.

Figure 3. Architecture of AGCA.

Figure 4. Safety Helmet and Reflective Jacket dataset information visualization.

Figure 5. Helmet-Vest-Belt dataset information visualization.

Figure 6. Grad-CAM visualization results of the baseline YOLOv9s and proposed models with various AGCA components.

Table 1. Architectural details and computational overhead increment of AGCA-YOLOv9.

Layer	Module/Component	Input Tensor ( $C \times H \times W$ )	Output Tensor ( $C \times H \times W$ )	Kernel Size/Stride	Params Inc.	GFLOPs Inc.
0–3	Stem & ELAN-1	$3 \times 640 \times 640$	$128 \times 80 \times 80$	Various	–	–
4	RepNCSP_AGCA	$128 \times 80 \times 80$	$128 \times 80 \times 80$	$1 \times 1$ , $3 \times 3$ /1	+5492	+0.0063
5	AConv (Downsample P4)	$128 \times 80 \times 80$	$192 \times 40 \times 40$	$3 \times 3$ /2	–	–
6	RepNCSP_AGCA	$192 \times 40 \times 40$	$192 \times 40 \times 40$	$1 \times 1$ , $3 \times 3$ /1	+9748	+0.0029
7	AConv (Downsample P5)	$192 \times 40 \times 40$	$256 \times 20 \times 20$	$3 \times 3$ /2	–	–
8	RepNCSP_AGCA	$256 \times 20 \times 20$	$256 \times 20 \times 20$	$1 \times 1$ , $3 \times 3$ /1	+15,028	+0.0013
9–29	Neck & Detect Head	$256 \times 20 \times 20$	Various	Various	–	–
Total	AGCA-YOLOv9 (Ours)	–	–	–	+30,268	+0.0105

Table 2. Details and partition of the datasets used in this study.

Attribute	Dataset 1	Dataset 2
Dataset Name	Safety Helmet and Reflective Jacket	Helmet-Vest-Belt
Total Images	10,500	9270
Classes	Safety Helmet, Reflective Jacket	Helmet, Vest, Belt
Train Set	7350 (70%)	7075 (76%)
Valid Set	1575 (15%)	1537 (17%)
Test Set	1575 (15%)	658 (7%)
Source (Ver.)	Kaggle (v1)	Roboflow (v1)
Dataset ID	niravnaik/safety-helmet-and-reflective-jacket	safety-detection-ftkxk/helmet-vest-belt
URL	https://www.kaggle.com/datasets/niravnaik/safety-helmet-and-reflective-jacket (accessed on 14 May 2026)	https://universe.roboflow.com/safety-detection-ftkxk/helmet-vest-belt/dataset/1 (accessed on 14 May 2026)

Table 3. Training parameter settings.

Parameter	Value
epochs	150
batch_size	16
imgsz	640
optimizer	SGD
initial_lr ( $l r_{0}$ )	0.01
momentum	0.937
weight_decay	0.00075
warmup_epochs	3.0
warmup_momentum	0.8
box	7.5
cls	0.5
dfl	1.2
close_mosaic	15
mixup	0.15
copy_paste	0.3
conf_thres	0.001
iou_thres	0.7

Table 4. Results of ablation experiments.

Algorithm	A	B	C	P/%	R/%	mAP@50/%	mAP@50:95/%
0	×	×	×	0.938	0.915	0.969	0.803
1	✓	×	✓	0.934	0.922	0.968	0.807
2	✓	✓	×	0.926	0.929	0.968	0.807
3	✓	✓	✓	0.937	0.919	0.967	0.809

Table 5. Comparison of evaluation indicators between improved model and classical models.

Algorithm Model	Params/ $10^{6}$	GFLOPs	Latency/ms	P/%	R/%	mAP@0.5/%	mAP@50:95/%
Dataset 1: Safety Helmet and Reflective Jacket
YOLOv5s	7.03	16.0	1.6	91.8	92.0	95.6	78.6
YOLOv8s	11.14	28.6	2.3	92.8	91.1	95.5	79.7
YOLOv9s	9.74	39.6	4.1	93.8	91.5	96.9	80.3
YOLOv10s	8.06	24.8	2.1	92.5	91.0	95.7	80.2
YOLOv11s	9.42	21.6	1.8	92.8	91.9	96.1	80.2
AGCA-YOLOv9	9.77	39.6	4.6	93.7	91.9	96.7	80.9
Dataset 2: Helmet-Vest-Belt
YOLOv5s	7.03	16.0	1.4	90.3	78.8	86.2	55.5
YOLOv8s	11.14	28.6	1.8	86.5	82.5	86.6	58.2
YOLOv9s	9.74	39.6	4.2	89.5	81.6	86.9	59.0
YOLOv10s	8.06	24.8	2.2	88.3	79.7	85.1	58.0
YOLOv11s	9.42	21.6	1.9	89.1	80.5	85.8	58.6
AGCA-YOLOv9	9.77	39.6	4.6	89.4	81.1	87.6	60.5

Table 6. Stability and reliability evaluation of the proposed model on different datasets.

Dataset	Model	mAP@0.5 (%)	mAP@0.5:0.95 (%)
Safety Helmet and Reflective Jacket	YOLOv9s (Baseline)	96.93 ± 0.06	80.37 ± 0.06
Safety Helmet and Reflective Jacket	AGCA-YOLOv9 (Ours)	96.77 ± 0.12	80.80 ± 0.10
Helmet-Vest-Belt	YOLOv9s (Baseline)	87.07 ± 0.29	59.13 ± 0.15
Helmet-Vest-Belt	AGCA-YOLOv9 (Ours)	87.53 ± 0.31	60.37 ± 0.12

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lv, C.; Zhou, W.; Li, Y.; Song, Y.; Zhang, X. Research on a Lightweight YOLOv9 Object Detection Algorithm Fused with Adaptive Gated Coordinate Attention. Mathematics 2026, 14, 1738. https://doi.org/10.3390/math14101738

AMA Style

Lv C, Zhou W, Li Y, Song Y, Zhang X. Research on a Lightweight YOLOv9 Object Detection Algorithm Fused with Adaptive Gated Coordinate Attention. Mathematics. 2026; 14(10):1738. https://doi.org/10.3390/math14101738

Chicago/Turabian Style

Lv, Condong, Wenjie Zhou, Yi Li, Yupeng Song, and Xiaodong Zhang. 2026. "Research on a Lightweight YOLOv9 Object Detection Algorithm Fused with Adaptive Gated Coordinate Attention" Mathematics 14, no. 10: 1738. https://doi.org/10.3390/math14101738

APA Style

Lv, C., Zhou, W., Li, Y., Song, Y., & Zhang, X. (2026). Research on a Lightweight YOLOv9 Object Detection Algorithm Fused with Adaptive Gated Coordinate Attention. Mathematics, 14(10), 1738. https://doi.org/10.3390/math14101738

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on a Lightweight YOLOv9 Object Detection Algorithm Fused with Adaptive Gated Coordinate Attention

Abstract

1. Introduction

2. Overview of YOLOv9

3. Research Methods

3.1. AGCA-GELAN Deeply Integrated Backbone Network Architecture

3.2. Dual-Path Hybrid Pooling Feature Aggregation Strategy

3.3. Adaptive Gated Weight Fusion Mechanism

4. Experiments

4.1. Dataset

4.2. Experimental Environment

4.3. Evaluation Metrics

4.4. Ablation Experiments

4.5. Comparative Experiments

4.6. Limitations and Future Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI