Abstract
The You Only Look Once (YOLO) series of models, particularly the recently introduced YOLOv12 model, have demonstrated significant potential in achieving accurate and rapid recognition of electric power operation violations, due to their comprehensive advantages in detection accuracy and real-time inference. However, the current YOLO models still have three limitations: (1) the absence of a dedicated feature extraction for multi-scale objects, resulting in suboptimal detection capabilities for objects with varying sizes; (2) naive integration of spatial and channel attentions, which restricts the enhancement of feature discriminability and consequently impairs the detection performance for challenging objects in complex backgrounds; and (3) weak representation capability in low-level features, leading to insufficient accuracy for small-sized objects. To address these limitations, a novel YOLO model named DFA-YOLO is proposed, a real-time object detection model with YOLOv12n as its baseline, which makes three key contributions. Firstly, a dynamic weighted multi-scale convolution (DWMConv) module is proposed to address the first limitation, which employs lightweight multi-scale convolution followed by learnable weighted fusion to enhance feature representation for multi-scale objects. Secondly, a full-dimensional attention (FDA) module is proposed to address the second limitation, which gives a unified attention computation scheme that effectively integrates attention across height, width, and channel dimensions, thereby improving feature discriminability. Thirdly, a set of auxiliary detection heads (Aux-Heads) are introduced to address the third limitation and inserted into the backbone network to strengthen the training effect of labels on the low-level feature extraction module. The ablation studies on the EPOVR-v1.0 dataset demonstrate the validity of the proposed DWMConv module, FDA module, Aux-Heads, and their synergistic integration. Relative to the baseline model, DFA-YOLO achieves significant improvements in mAP@0.5 and mAP@0.5–0.95, by 3.15% and 4.13%, respectively, meanwhile reducing parameters and GFLOPS by 0.06M and 0.06, respectively, and increasing FPS by 3.52. Comprehensive quantitative comparisons with nine official YOLO models, including YOLOv13n, confirm that DFA-YOLO achieves superior performance in both detection precision and real-time inference, further validating the effectiveness of the DFA-YOLO model.