IDP-YOLOV9: Improvement of Object Detection Model in Severe Weather Scenarios from Drone Perspective

: Despite their proficiency with typical environmental datasets, deep learning-based object detection algorithms struggle when faced with diverse adverse weather conditions. Moreover, existing methods often address single adverse weather scenarios, neglecting situations involving multiple concurrent adverse conditions. To tackle these challenges, we propose an enhanced approach to object detection in power construction sites under various adverse weather conditions, dubbed IDP-YOLOV9. This model leverages a parallel architecture comprising the Image Dehazing and Enhancement Processing (IDP) module and an improved YOLOV9 object detection module. Specifically, for images captured in adverse weather, our approach employs a parallel architecture that includes the Three-Weather Removal Algorithm (TRA) module and the Deep Learning-based Image Enhancement (DLIE) module, which, together, filter multiple weather factors to enhance image quality. Subsequently, we introduce an improved YOLOV9 detection network module that incorporates a three-layer routing attention mechanism for object detection. Experiments demonstrate that the IDP module significantly improves image quality by mitigating the impact of various adverse weather conditions. Compared to traditional single-processing models, our method improves recognition accuracy on complex weather datasets by 6.8% in terms of mean average precision (mAP50).


Introduction
In the field of electric power, especially in complex environments or secluded areas, the utilization of unmanned aerial vehicles (UAVs) for surveillance, evidence collection, and external risk assessment is crucial for ensuring the safe operation of electrical facilities [1].However, the quality of images acquired by UAVs is frequently compromised by adverse weather conditions, resulting in diminished clarity and subsequently reducing the accuracy and efficacy of detection [2].This limitation significantly hinders the surveillance capacity of UAVs, potentially delaying the identification of risks [3] or problems and posing a threat to the safety of electrical infrastructure.Consequently, enhancing the object detection performance of UAVs under adverse weather conditions is imperative.
While mainstream detection algorithms such as Faster R-CNN [4], the YOLO series [5][6][7], and CenterNet [8] perform satisfactorily on images captured under normal weather conditions, adverse weather conditions often cause image blurring, insufficient illumination, or the overlapping of weather artifacts with objects [9,10].Existing image dehazing methods, which utilize multi-scale feature aggregation networks, have shown inefficiency in processing images [11][12][13].Similarly, rain removal techniques employ window selfattention networks and global residual convolution to produce rain-free images [14,15], while snow removal approaches use restoration algorithms to eliminate the influence of snowflakes [16,17].These methods primarily address individual weather-related issues and are inadequate for dealing with the complexity of real-world adverse weather effects.
To enhance object detection, the proper supplementation of image data is required.Some studies have attempted to combine image enhancement with object detection using adaptive enhancement modules composed of context branches and edge branches to improve degraded images [18,19].These methods dynamically adjust enhancement settings based on the illumination distribution in the input images [20][21][22].However, these approaches often struggle to adapt to various image types, and the adjustment of enhancement parameters heavily relies on manual intervention, leading to suboptimal results.Building on these foundations, some research has integrated image enhancement with object detection [23,24], proposing the Adaptive Enhancement Model for Object Detection Network (ARODNet) [25] to improve detection performance under adverse conditions [26,27].Nonetheless, these methods do not optimize the structure of the detection network itself, limiting their effectiveness during training.
Existing object detection methods struggle with degraded image quality and reduced effectiveness under adverse weather conditions such as rain, snow, and fog.These limitations stem from their inability to adequately address individual weather-related issues and adapt to the complexities of real-world scenarios, resulting in suboptimal detection performance.To address these challenges in the field of electric power and overcome the limitations of current object detection methods, this paper introduces an improved deep learning-based object detection model called IDP-YOLOV9.This algorithm integrates advanced image processing techniques and enhanced parallel architectures, utilizing the parameter estimation of convolutional neural networks, along with an improved YOLOV9 detection network model equipped with a three-layer routing attention mechanism.The objective is to enhance image processing capabilities under complex weather conditions, thereby improving object detection accuracy.
The primary contributions of this paper are as follows: • We designed a parallel optimization architecture for the image processing module TRA (Three-Weather Removal Algorithm) and the image enhancement module DLIE (Deep Learning-based Image Enhancement).The TRA module employs a dynamically adjusted correlation graph construction strategy, allowing it to flexibly adapt to feature relationships in different scenes.The DLIE module introduces self-learning parameters, optimizing deep learning methods for image features and enabling adaptive modifications to the image enhancement procedure.This enhancement significantly boosts the model's detection capabilities.• We propose an improved YOLOV9 detection network, incorporating a three-layer routing attention mechanism.This mechanism captures the features of the restored clear images from the TRA and DLIE modules through joint learning, enhancing the network's ability to detect objects.• We introduce a comprehensive loss function where the parameters of the routing attention loss function and the IoU loss function are dynamically adjusted based on the weather factor features extracted by the TRA module.This approach allows for the dynamic adjustment of scale factor ratios and blur factors, refining bounding box generation and loss function computation.Consequently, the model better adapts to various weather conditions, achieving more accurate object detection under different weather scenarios.
This paper is organized as follows: Section 2 includes an overview of relevant work on image dehazing, deraining, desnowing, and image enhancement.Section 3 elaborates on the specific improvements proposed in this paper.The experimental results and analysis are presented in Section 4. Finally, in Section 5, we engage in a discussion and summary of the research findings.

Related Work
Deep learning has demonstrated remarkable performance in various tasks such as denoising [28], image inpainting, super-resolution [29,30], deblurring, and style transfer [31].In the domain of adverse weather restoration, such as defogging, deraining, and desnowing, deep neural networks significantly outperform traditional methods.For instance, in deraining, CNN networks capture features from rain-affected images, enabling the learning of the physical characteristics of raindrops and rain streaks [32,33].Concurrently, CNNs learn paired images of rain-free and rain-degraded conditions [34,35].However, this method may leave residual blurry regions in images.Other approaches employ GAN (Generative Adversarial Network) networks to eliminate raindrops, but this requires obtaining effective attention maps [36,37].Similarly, to address the issue of removing irregular snowflakes from images, GAN networks are used to focus on the features of snowflake patterns [38].
Before performing object detection tasks in adverse weather scenarios, preprocessing the images is imperative.One direct approach is to remove weather factors from the images and then apply image enhancement techniques before inputting them into the detection network [39][40][41][42].However, solely employing this method may not achieve detection accuracy comparable to that under normal weather conditions.Another approach relies on unsupervised priors [43], combining image enhancement and detection while learning feature representations of weather to eliminate interference from weather-specific information [44,45].For example, Ju et al. [46] proposed a single-image defogging detection framework based on region-line prior [47].Although improving the quality of adverse weather images is beneficial, it does not necessarily translate to high-precision object detection models [48,49].Therefore, some methods connect image processing modules and detection modules end-to-end to address this issue, while others utilize domain adaptation techniques [50,51].
To bolster the effectiveness of object detection networks in challenging weather environments, we introduce a new object detection layer known as SwinFocus [52].This layer enhances both the feature extraction and representation capabilities of YOLOv5, leading to improved detection accuracy, particularly for small and blurry objects in foggy conditions [53].Additionally, methods have been proposed to optimize the structure of YOLOv5 by reducing the depth of the feature pyramid and limiting the maximum downsampling factor to better recognize small objects [54,55].With the emergence of the YOLOv8 detection network, an occlusion-aware attention mechanism has been designed, and variable convolutions have been utilized to enhance the feature extraction capabilities of the YOLOv8 network [56,57].
Despite these significant improvements in object detection performance, notable deficiencies remain in parameter tuning due to the lack of focus on feature scales after image processing enhancement and insufficient modifications to object detection models for image processing and enhancement.Therefore, this study introduces an enhanced object detection methodology tailored to unmanned aerial vehicles (UAVs) navigating through inclement weather conditions, referred to as IDP-YOLOV9.
In this study, we primarily employed a parallel architecture based on image defogging, desnowing, and deraining modules (TRA) and image enhancement modules (DLIE) to process and enhance images captured under various adverse weather conditions at power construction sites by drones.Additionally, we introduced joint training of the improved YOLOV9 module with a three-layer routing attention mechanism to capture features of clear images restored by the TRA and DLIE modules.Finally, object detection was performed on images captured under various adverse weather conditions at power construction sites, thereby enhancing detection accuracy under complex adverse weather conditions.

Proposed Method
Under severe weather conditions, the visibility of images captured by drones is significantly reduced, seriously affecting the accuracy of object detection and posing technical challenges to the risk identification needs of power construction sites.To address this issue, this section details the adverse weather conditions in datasets of power scenes captured by drones and proposes the IDP-YOLOV9 object detection algorithm specifically designed for such conditions.The algorithm aims to reveal more latent information within images by eliminating the interference of weather factors.
The entire network framework consists of the Three-Weather Removal Algorithms (TRA) module, an Adaptive Image Enhancement (DLIE) module, and the Improved YOLOV9 detection module.The TRA module initially estimates atmospheric scattering models and parameters related to rain and snow to obtain preliminary dehazed and derained images.Subsequently, the model rescales images to dimensions of 256 × 256 before feeding them into the DLIE module, which optimizes parameters to enhance the quality of dehazed, derained, and desnowed images.These weakly supervised enhanced images are then employed for object detection.Finally, the DLIE module processes the enhanced images as inputs for the YOLOV9 detector, leading to improved detection accuracy under adverse weather conditions.

The Framework of IDP-YOLOV9
This study optimized the original network architecture to address the issues of error accumulation in the parameter estimation process, which can lead to incomplete image processing and distortion.To overcome these challenges, we employed a parallel architecture that combined the TRA image dehazing, deraining, and desnowing modules with the adaptive image enhancement module (DLIE).The resulting features were then integrated with the YOLOV9 detection network, creating an end-to-end object detection algorithm suitable for power scene images captured by drones under adverse weather conditions.The DLIE module consisted of five adaptable image enhancement parameters, essentially functioning as pixel-level filters.These parameters included the following: white balance (WB), which eliminated color deviations caused by atmospheric light; gamma correction, which restored details in darker areas; contrast enhancement, which improved overall visibility in regions affected by heavy fog, raindrops, or snowflakes; hue adjustment, which emphasized the overall atmosphere or produced specific effects; and image sharpening, which effectively enhanced the visual clarity of dehazed, derained, and desnowed images.
Figure 1 illustrates the IDP-YOLOV9 network architecture.This comprehensive approach ensures more accurate and reliable object detection in power scene images captured under adverse weather conditions.

TRA Module
Due to the difficulty in obtaining paired images under various adverse weather

TRA Module
Due to the difficulty in obtaining paired images under various adverse weather conditions and normal conditions at the same construction site, this study addressed this challenge by artificially synthesizing construction site images with fog, rain, and snow.The influence of rain and snow was also considered.Raindrops and snowflakes scatter and absorb light, thereby reducing image quality.These extensions enhanced the mathematical model, making it more inclusive when considering the image formation process under various weather conditions.Specifically, in the image restoration module, the parameters obtained from the estimation part were utilized to generate the final restored image through element-wise addition and multiplication layers.This process can be regarded as achieving high-quality image dehazing, deraining, and desnowing through joint learning based on the modeling parameters of fog, rain, and snow.
Below is the specific design of the dehazing, deraining, and desnowing filters.The formation of blurry images in foggy weather can be expressed as follows: where I 1 (x, λ) represents the foggy image, x is the position of pixels in the image, λ is the wavelength of light, K(x, λ) denotes the scene radiance (clean image), A is the global atmospheric light, and e −β(λ) d(x) is the medium transmission map, where β(λ) represents the atmospheric scattering coefficient and d(x) is the scene depth.To restore the clean image K(x, λ), it was crucial to obtain the atmospheric light A and the transmission map e −β(λ) d(x).To do this, we first computed the dark channel map of the foggy image I 1 (x, λ) and selected the brightest 1000 pixels.We estimated A by averaging the corresponding positions of these 1000 pixels in the foggy image I 1 (x, λ).Furthermore, we introduced a parameter ε 1 to control the degree of defogging.The defogging filter is expressed as follows: Considering the impact of raindrops on images, we can represent the blurred rainy image as follows: where I 2 (x, λ) is the rainy image, G(x, λ) denotes the scene radiance (clean image), and M(x) is the binary mask for raindrops, based on the scattering and reflection effects of raindrops and used to mark the positions and intensities of raindrops.Q(x, λ) represents the blurred image formed by the light reflected from raindrops.To restore the clear image G(x, λ), we needed to estimate the effect of raindrops and remove their influence.Firstly, we computed the dark channel map of the rainy image I 2 (x, λ) and selected the brightest pixel.Then, by averaging the corresponding pixels in the rainy image I 2 (x, λ), we could estimate the light reflected by raindrops Q(x, λ).
We introduced a parameter ε 2 to control the degree of deraining, similar to the design of the defogging filter.The final deraining filter can be expressed as follows: By adjusting the parameter ε 2 , we could effectively mitigate the impact of raindrops and restore a clearer image.
For the desnowing filter design, the snowy image model can be represented as follows: where I 3 (x, λ) is the snowy image, H(x, λ) represents the scene radiance (clean image), and z is the binary mask for snowflakes, based on the reflection effects of snowflakes and used to form spots on the image.S(x) denotes the blurred image formed by the light reflected from snowflakes.Similarly to the deraining filter, we first computed the dark channel map of the snowy image I 3 (x, λ) and selected the brightest pixel.Then, by averaging the corresponding pixels in the snowy image I 3 (x, λ), we could estimate the light reflected by snowflakes S(x).
The final desnowing filter can be expressed as follows: By adjusting the parameter ε 3 , we could control the degree of desnowing and restore a clearer image.Through the above design, we could utilize the principles of dark channel prior and the atmospheric scattering model, combined with the learning parameter approach, to design defogging, deraining, and desnowing filters.These filters could mitigate the impact of weather on images to some extent, thereby improving image quality and clarity.
The synthetic dataset consisted of power scene data captured by drones under normal weather conditions, upon which a dataset of images under complex weather conditions was constructed offline.By adjusting parameters to specific values, set to 0.6 and 0.1, foggy images were obtained.Rainy images were generated by setting the droplet diameter to 2 pixels and simulating random distribution to mimic raindrops.Gaussian blur with a standard deviation of 1 was applied to simulate environmental reflective light and replicate the lighting effect of raindrops to obtain rainy images.Snowy images were obtained by setting the snowflake diameter to 5 pixels, transparency to 0.8, and color to white.
The TRA module was designed for image defogging, deraining, and desnowing, initially involving the physical modeling of fog, raindrops, and snowflakes in the images.This module primarily focused on parameter estimation and image degradation.Through joint learning, it utilized convolutional networks to estimate key operators for fog, rain, and snow, facilitating image restoration.In the parameter estimation module, five convolutional layers were employed to fuse multiscale information, effectively integrating coarse and fine-scale features through the concatenation of parallel convolutional layers (Concat layers).This process also involved estimating the parameters of the input image.The convolutional network learned the specific values of these parameters, which were integral components of the atmospheric scattering model and significantly impacted the results of image restoration.Notably, the design of the Concat layer incorporated hierarchical connections, with each Concat layer progressively connected to other convolutional layers.This design effectively compensated for any information loss during the convolution process, ensuring comprehensive feature detail acquisition.The output of the parameter estimation module included key parameters in the atmospheric scattering model, playing a critical role in the subsequent image restoration process.

DLIE Module
Typically, image correction and enhancement operations involve manual adjustment of filter parameters based on experience, which poses several challenges.Firstly, manual adjustment is prone to subjective factors and experience limitations, leading to subjective parameter choices and significant errors.Secondly, parameters adjusted manually are often optimized for specific scenes or image collections, lacking generality across different scenes and diverse images, thereby limiting the algorithm's applicability.Thirdly, manually tuned parameters may not adapt well to changes in different environments and image conditions, resulting in poor system adaptability when facing new data.
To address these challenges, this study proposes an automated method using a small CNN network to estimate filter parameters.This approach improves the performance and applicability of image correction and enhancement operations by enabling a more compre-hensive search of the parameter space.It reduces subjective errors, enhances the system's adaptability, and achieves better results across different scenes and image conditions.
The DLIE module comprised both pixel-level filters and sharpening filters.The pixellevel filter involved four adjustable parameters: white balance, gamma correction, hue adjustment, and contrast enhancement.Its primary purpose was to smooth the image post-dehazing, deraining, and desnowing to improve visual quality.Table 1 presents descriptions of the four filters and their parameters.The white balance (WB) filter achieved white balance by adjusting the weights (W r , W g , W b ) of the red r i , green g i , and blue b i channels of the input image.The output X o was composed of the weighted sum of the input channels.The gamma filter adjusted the image by multiplying each pixel value X i by the parameter G.The contrast filter adjusted the image contrast by varying the brightness values, where α is a parameter ranging between 0 and 1.The tone filter adjusted the image hue by applying different hue functions (L t r , L t g , L t b ) to each color channel.The parameter t i represents the hue function.

Filter
Mapping Function Parameters The parameters for the contrast filter mapping function were defined as follows: Equation ( 7) computed the brightness value L(X i ) of the input pixel by applying a weighted sum to the three channels.ω 1 , ω 2 , ω 3 were adjustable weight parameters for the corresponding channels.Equation ( 8) adjusted the brightness non-linearly using a cosine transformation of the brightness value, thus producing enhanced contrast MaL(X i ).Equation ( 9) adjusted the entire pixel by multiplying the input pixel value X i with the enhanced brightness value MaL(X i ) to the brightness value L(X i ).
By processing the contrast parameters as described above, the contrast filter mapping function became more flexible and suitable for various image scenes.This enabled effective enhancement of image contrast, thereby improving image quality and visual effects.
The image sharpening filter primarily utilized image sharpening to compensate for contours and highlight edge information, making the image clearer after haze removal.The process of image sharpening can be described as follows.The primary function of the image sharpening filter was to enhance image clarity by accentuating contours and highlighting edge information, particularly beneficial after haze removal.
Equation ( 10) represents a function for image enhancement, and EnhanceFunc(I(x)) is a function used to enhance details. λ is a weight parameter between the original and the image with optimized details, balancing the differences between them.β is a newly introduced parameter used to adjust the balance between enhancement and detail strengthening.Overall, this function achieved local contrast enhancement and detail strengthening of the image.
The DLIE module optimized parameters based on a CNN network.Due to the high computational cost of extracting features from high-resolution images using CNNs, there is a risk of resource wastage.Therefore, we downsampled the high-resolution adverse weather synthetic images and extracted image filtering parameters based on this downsampled version.After the images were processed by the TRA module to remove fog, rain, and snow, the filters were applied to the downsampled processed images to enhance their quality.
To minimize computational overhead and enhance network efficiency, we employed a compact CNN network for downsampling images in adverse weather conditions, considering the relatively small number of parameters required for the filters.Before parameter estimation, the images, after fog, rain, and snow removal, underwent bilinear interpolation.This approach ensured that parameter estimation was both reasonable and effective, even with low-resolution images.Uncontrollable factors such as the size, shape, position, orientation of the objects, and occlusions under adverse weather conditions make it difficult to achieve accurate detection using traditional convolutional operations on images that have already been processed and enhanced.Issues such as extensive false positives or negatives may arise.To address these challenges and improve the model's ability to recognize objects like power lines and trees in images after processing, this paper proposes the incorporation of the Three-Layer Routing Attention module at the last part of the backbone.The Three-Layer Routing Attention module module was integrated into the entire network and underwent end-to-end training with appropriate loss functions, ensuring that the model could simultaneously learn image features, processing parameters, and attention weights.This adaptation enhanced the model's focus and generalization capabilities for different image processing and enhancement tasks.The proposed Three-Layer Routing Attention module, as shown in Figure 3, demonstrated an improved three-layer routing attention mechanism, comprising the following levels.The first layer was the Region Routing Attention Mechanism (RRAM), which operated at a macroscopic region level and introduced a method for constructing association graphs.Utilizing dynamic and adaptive association graph construction Uncontrollable factors such as the size, shape, position, orientation of the objects, and occlusions under adverse weather conditions make it difficult to achieve accurate detection using traditional convolutional operations on images that have already been processed and enhanced.Issues such as extensive false positives or negatives may arise.To address these challenges and improve the model's ability to recognize objects like power lines and trees in images after processing, this paper proposes the incorporation of the Three-Layer Routing Attention module at the last part of the backbone.The Three-Layer Routing Attention module module was integrated into the entire network and underwent end-to-end training with appropriate loss functions, ensuring that the model could simultaneously learn image features, processing parameters, and attention weights.This adaptation enhanced the model's focus and generalization capabilities for different image processing and enhancement tasks.

Improved YOLOV9 Detection Module
The proposed Three-Layer Routing Attention module, as shown in Figure 3, demonstrated an improved three-layer routing attention mechanism, comprising the following levels.The first layer was the Region Routing Attention Mechanism (RRAM), which operated at a macroscopic region level and introduced a method for constructing association graphs.Utilizing dynamic and adaptive association graph construction strategies, RRAM flexibly adapted to feature relationships in different scenes.This adaptability was achieved through an adaptive threshold mechanism.
not only performed routing within each region but also introduced a more global routing mechanism spanning the entire input image.
Equation ( 21) computed global representations by performing average pooling on queries   and keys   .Equation ( 22) calculated the global attention matrix   by taking the dot product of global queries and keys.Equation ( 23) selected global regions of interest by retrieving global indices   with the highest attention.
= (  ),   =   (  )  ( 22) Equations ( 24) and ( 25) retrieved corresponding global representations from keys and values based on global indices   .Equation (26) performed global attention operation between queries Q and global token keys   and values   .
The proposed three-layer routing attention mechanism comprehensively captured the correlations and feature relationships at different levels of the input data, thereby enhancing the model's performance.

Loss Function
The improved YOLOV9 included a three-layer routing attention mechanism, including region routing, token routing, and global routing, aiming to achieve optimal recognition and detection capability under fog, rain, snow, and normal weather conditions.To adapt to image features under different weather conditions, IDP-YOLOV9 adopted a comprehensive loss function during training, which included detection loss and routing attention loss, to enhance the model's adaptability to different weather conditions.The entire network was trained end-to-end under the improved YOLOV9 detection loss Equation (11) defines the feature tensor of the input image, respectively.X represents the feature tensor of the input image, H is the height of the feature tensor, W is the width of the feature tensor, C is the number of channels in the feature tensor, and R represents the number of regions after partitioning.Equation ( 12) demonstrates the process of region partitioning and input projection, transforming the input tensor X into X r , where S 1 S 2 represents the size of the region partition.The operation Reshape rearranges the elements of a tensor into a new shape while preserving the order of the elements.In this context, it reshapes the input tensor X into X r , which has a size of S 1 S 2 regions, each of size HW/S 1 S 2 with C channels.
X ∈ R H×W×C (11) W k and W v are the projection weights used to project X r into queries, keys, and values, respectively, in Equation (13).They were learned parameters of the model and were determined during the training process through backpropagation.The projection aimed to map the original feature space into a space where queries, keys, and values could be computed efficiently for the subsequent attention mechanism.
Building upon the RRAM, the Token Routing Attention Mechanism (TRAM) was introduced to further optimize the selection of attention regions.TRAM employs advanced graph pruning algorithms combined with deep learning techniques to enhance routing effectiveness by learning more complex relationships between each node.
Average Pooling is a pooling operation that calculates the average value of the input data (usually a tensor) within each region.Specifically, in Equations ( 14) and ( 15), Q r and K r represent the average pooling operations applied to queries Q and keys K, respectively.Through average pooling, the dimensionality of the input data could be reduced while retaining its important features.TopKIndex is a selection operation that retrieved the indices of the top K values from the input data.In Equation ( 16), A r is the attention matrix between queries and keys, where each element represents the attention from one region to another.Equation ( 17) applied the TopKIndex operation to select the indices of the regions with the highest attention I r values from the attention matrix A r .These indices represented the most important and noteworthy regions, which were used for further processing or analysis: In Equations ( 18) and ( 19), we used the Gather operation to retrieve region representations from the global keys K corresponding to the pre-determined Top-K indices I r .This was done to retain only the representations of keys associated with the most important regions, thereby reducing computational complexity.Following Equation ( 20), we performed token-to-token attention operation between the queries Q and the retrieved token keys K g and values V g .This operation assigned attention weights based on the similarity between queries and keys, and used these weights to compute a weighted sum of values, obtaining context-relevant representations for each query.We then added a local contextual enhancement (LCE) term to obtain O 1 , providing richer local context information and enhancing the model's understanding of the input image.
V g = Gather(V, I r ) At the global level, the Global Routing Attention Mechanism (GRAM) was established, introducing a more extensive modeling of positional relationships.GRAM not only performed routing within each region but also introduced a more global routing mechanism spanning the entire input image.
Equation ( 21) computed global representations by performing average pooling on queries Q r and keys K r .Equation ( 22) calculated the global attention matrix A g by taking the dot product of global queries and keys.Equation ( 23) selected global regions of interest by retrieving global indices I g with the highest attention.
Equations ( 24) and ( 25) retrieved corresponding global representations from keys and values based on global indices I g .Equation (26) performed global attention operation between queries Q and global token keys K g and values V g . ) The proposed three-layer routing attention mechanism comprehensively captured the correlations and feature relationships at different levels of the input data, thereby enhancing the model's performance.

Loss Function
The improved YOLOV9 included a three-layer routing attention mechanism, including region routing, token routing, and global routing, aiming to achieve optimal recognition and detection capability under fog, rain, snow, and normal weather conditions.To adapt to image features under different weather conditions, IDP-YOLOV9 adopted a comprehensive loss function during training, which included detection loss and routing attention loss, to enhance the model's adaptability to different weather conditions.The entire network was trained end-to-end under the improved YOLOV9 detection loss to ensure mutual adaptation between internal modules of IDP-YOLOV9.To further address potential domain shifts introduced by synthetic data, IDP-YOLOV9 combined mixed training using real datasets to make the model closer to real-world environments, thus improving the model's robustness under adverse weather conditions.
The overall loss function consisted of the detection loss derived from YOLOV9 and the routing attention loss across different layers, forming a composite measure.In addition to detection and routing attention losses, an IoU loss function was introduced to measure the overlap between predicted bounding boxes and ground truth bounding boxes.The incorporation of the IoU loss function enabled the model to predict the position and shape of objects more accurately, thereby improving the accuracy of object detection.
The parameters of the routing attention loss layer and the parameters in the IoU loss function were dynamically adjusted based on the processing results of different weather factors by the TRA module.By associating the scale factor ratio with feature parameters, the scale factor ratio could be adjusted to a smaller value when the density of raindrops or snowflakes increased, ensuring that the bounding boxes more accurately captured the target object.Based on the characteristics of rainy and snowy days, the calculation of intersection and union was adjusted by a blur factor.Due to the influence of weather, the object boundaries could be more blurred, so relaxing the definition of intersection enabled better adaptation to these situations.We could reflect the importance of different weather conditions by adjusting the weight parameters in the loss function.When it was rainy or snowy, λ region and λ token could be adjusted to larger values to pay more attention to the generation and adjustment of bounding boxes.bottom, and top boundaries of the predicted bounding box.w gt and h gt denote the width and height of the ground truth bounding box, respectively, and w and h represent the width and height of the predicted bounding box.r denotes the scale factor ratio and f uzz represents the blur factor.The final total loss function can be expressed as follows: Here, L det represents the YOLOV9 detection loss and L region , L token , and L global represent the losses of the region, token, and global routing layers, respectively.Hyperparameters λ region , λ token , λ global , and λ IoU were used to adjust the weights of their respective losses and balance their effects.
During the training process of the IDP-YOLOV9 network, various data augmentation techniques were utilized, including image flipping, cropping, and transformations, to extend the training dataset.Additionally, random resizing of images to (128 n × 128 n), where n ∈ [9,19], was performed to enhance the model's adaptability to different input sizes.The RAdam (Rectified Adam) optimizer was employed for better convergence performance during training.Algorithm 1 summarizes the training process of our proposed method, as shown below.During the training phase, a comprehensive loss function was employed, providing a holistic optimization approach to enhance detection performance under various weather conditions.

Experimental Results
This section provides a systematic analysis and evaluation of both the detection function and the image processing capability of the enhanced YOLOV9 across different conditions.The experimental outcomes of the proposed algorithm in varied adverse weather environments are consolidated and summarized.To validate the IDP-YOLOV9 structure in fog, rain, and snow scenarios, comparisons were made with existing defogging methods (AOD-NET [58], GridDehazeNet [41]), deraining methods (EfficientDeRain [59], ADMM-ResNet [60]), and desnowing methods (U-DenseNet [61], ALL In One [62]).Sub-sequently, comparisons were conducted with existing detection methods, including the Faster R-CNN, SSD, RetinaNet, YOLOV8, and YOLOV9 methods.Finally, a comprehensive comparison was made between the image processing algorithms, such as AOD-NET, GridDehazeNet, EfficientDeRain, ADMM-ResNet, and so on.The above-mentioned experiments were performed using a system featuring an NVIDIA GeForce GTX 4090 GPU.This training procedure integrated the YOLOV9 detection loss with the losses from the three routing attention layers, providing a holistic optimization approach for improved detection performance under various weather conditions.

Implementation Details
This study employed a collaborative approach to train the IDP-YOLOV9 network architecture.In the initial phase, the YOLOV9 detection network was trained without prior knowledge and underwent transfer learning alongside the dataset proposed in this study.The DLIE parameters were reconfigured by utilizing the convolutional block of YOLOV9 up to the fifth layer, and joint training was conducted using a mixed data method.This collaborative training strategy aimed to facilitate maximum information exchange between the image processing module and the object detection network, thereby enhancing overall performance.
To further enhance the generalization capability for various adverse weather conditions, IDP-YOLOV9 dynamically adjusted the scale during training.Initially, a range of scales was established, and the selection of image size was dynamically adapted based on the content and image complexity.This enabled the model to accommodate various input sizes in each iteration, enhancing its robustness.The experiments were carried out utilizing the PyTorch framework and implemented on GPUs.

Performance Evaluation
The accuracy of detection under different conditions largely depended on the quality of image defogging, deraining, and desnowing and image enhancement.In this study, we compared our image processing module and image enhancement module (TRA + DLIE) with existing dehazing methods (AOD-NET, GridDehazeNet), deraining methods (Effi-cientDeRain, ADMM-ResNet), and desnowing methods (U-DenseNet, ALL In One).To ensure a fair comparison, we retrained and evaluated these methods on the same training and testing datasets (VOC [63], HAZE [64], FTOD).We used the MSE (Mean Square Error), PSNR (Peak Signal to Noise Ratio), and SSIM (structural similarity) metrics to measure image quality and similarity.Improvements in PSNR and SSIM metrics implied that the processed images were closer to the original images, exhibiting higher quality and better preservation of image details and structures.
In the given expression, MAX I represents the maximum achievable pixel value within the image.SSIM is another metric for comprehensive image quality assessment, evaluating image similarity.To bolster the consideration of structural similarity in the image, we introduced a correction term, denoted as φ.The enhanced SSIM formula is expressed as follows: where µ, µ 2 , and σ respectively represent the mean, variance, and covariance of the images.The addition of a correction term φ for the structural similarity calculation made the evaluation metric more elastic, adapting to different types and qualities of images, and enhancing the algorithm's performance under various adverse weather conditions.The objective evaluation results of weather-removed images of different datasets are shown in Table 2.
For comparison, the proposed (TRA + DLIE) method outperformed the aforementioned algorithms for objective evaluation metrics efficiency in dehazing, deraining, and desnowing.
Figure 4 illustrates the results using the aforementioned desnowing methods.While other deep learning-based techniques exhibit varying levels of image artifacts and incomplete dehazing, our proposed approach to dehazing, deraining, and desnowing excelled at removing haze, raindrop textures, and snowflakes while simultaneously enhancing contrast and saturation.This comprehensive enhancement significantly improved image quality, laying a solid foundation for subsequent object detection tasks.

Evaluating the Detection Results of the Model
To assess the object detection performance of the proposed model, IDP-YOLOV9, across diverse adverse weather conditions, comprehensive comparative experiments were conducted on the same test dataset, employing both cross-sectional and longitudinal analyses.Initially, the IDP-YOLOV9 algorithm was pitted against leading CNN object detectors such as Faster R-CNN, YOLOV8, YOLOV9, and so on.Table 3 showcases the detection outcomes of these different detectors across varying fog concentrations.Notably, the results indicate that, across multiple adverse weather conditions, the object detection accuracy achieved by IDP-YOLOV9 outperforms the aforementioned algorithms.Information about brightness, color, and hue and weather feature parameters for each image with fog, rain, and snow were considered.After filtering out weather factors, the images were processed through the DLIE module, improving the visual effects, enhancing image clarity, supplementing detailed information, and benefiting subsequent object detection.

Evaluating the Detection Results of the Model
To assess the object detection performance of the proposed model, IDP-YOLOV9, across diverse adverse weather conditions, comprehensive comparative experiments were conducted on the same test dataset, employing both cross-sectional and longitudinal analyses.Initially, the IDP-YOLOV9 algorithm was pitted against leading CNN object detectors such as Faster R-CNN, YOLOV8, YOLOV9, and so on.Table 3 showcases the detection outcomes of these different detectors across varying fog concentrations.Notably, the results indicate that, across multiple adverse weather conditions, the object detection accuracy achieved by IDP-YOLOV9 outperforms the aforementioned algorithms.To validate the impact of image restoration on subsequent object detection algorithms, comparative experiments were conducted on object detection using different algorithms under various adverse weather conditions, including AOD-NET (defogging), GridDe-hazeNet (defogging), Efficient-DeRain (deraining), ADMMResNet (deraining), UDenseNet (desnowing), and All In One (desnowing).Figure 6 showcases the outcomes of different snow removal models.The figure illustrates that in snowy weather conditions, the IDP-YOLOV9 algorithm not only effectively detects and processes the weather but also achieves notably higher accuracy and lower miss detection rates compared to other algorithms.The integration of image processing and enhancement within a parallel architecture, along with the jointly optimized YOLOV9 algorithm proposed in this research, significantly enhances performance across diverse conditions.Furthermore, the jointly optimized model demonstrates superior performance compared to the methods mentioned previously, achieving higher average precision across diverse adverse weather conditions (fog, rain, and snow).To validate the impact of image restoration on subsequent object detection algorithms, comparative experiments were conducted on object detection using different algorithms under various adverse weather conditions, including AOD-NET (defogging), GridDehazeNet (defogging), Efficient-DeRain (deraining), ADMMResNet (deraining), UDenseNet (desnowing), and All In One (desnowing).Figure 6 showcases the outcomes of different snow removal models.The figure illustrates that in snowy weather conditions, the IDP-YOLOV9 algorithm not only effectively detects and processes the weather but also achieves notably higher accuracy and lower miss detection rates compared to other algorithms.The integration of image processing and enhancement within a parallel architecture, along with the jointly optimized YOLOV9 algorithm proposed in this research, significantly enhances performance across diverse conditions.Furthermore, the jointly optimized model demonstrates superior performance compared to the methods mentioned previously, achieving higher average precision across diverse adverse weather conditions (fog, rain, and snow).The method proposed in this paper exhibits certain advantages in the image processing and detection of power construction site images under various adverse weather conditions.Due to the difficulty in obtaining paired adverse weather images of the same scene, the method proposed in this paper was trained using conventional The method proposed in this paper exhibits certain advantages in the image processing and detection of power construction site images under various adverse weather conditions.Due to the difficulty in obtaining paired adverse weather images of the same scene, the method proposed in this paper was trained using conventional datasets, including synthetic datasets.Nevertheless, the proposed method is applicable to real-world power construction site environments under diverse adverse weather conditions.

Ablation Study
Comprehensive ablation experiments were conducted to meticulously verify the efficacy of the proposed image processing and enhancement modules in facilitating subsequent object detection, especially in various adverse weather conditions.The IDP-Improved YOLOV9 method was comprehensively compared with several baseline methods, including Improved YOLOV9, Enhancement + Improved YOLOV9, and MultiRemoval + Improved YOLOV9, across three independent test datasets.Table 4 presents the mAP evaluation results for the three adverse weather conditions: fog, rain, and snow.In diverse adverse weather environments, the combination of Enhancement + Improved YOLOV9 and MultiRemoval + Improved YOLOV9 demonstrated a significant advantage in improving detection performance compared to independently using Improved YOLOV9.The IDP-Improved YOLOV9 algorithm notably enhanced visibility in various adverse weather images, effectively improving the effects of defogging, deraining, desnowing, and enhancement while significantly increasing the accuracy of object detection.Specifically, in rainy conditions, IDP-Improved YOLOV9 achieved a 6.2% increase in detection accuracy (mAP) compared to Enhancement + Improved YOLOV9 and a 5.5% increase compared to MultiRemoval + Improved YOLOV9.
This series of ablation studies fully validated the synergistic effect of the parallel architecture for the image defogging, deraining, and desnowing modules, as well as the image enhancement module, on object detection algorithms.The experiments demonstrated that the IDP-Improved YOLOV9 method not only enhances visibility but also significantly improves the efficiency of image defogging, deraining, and desnowing and other image processing tasks, leading to an enhancement in object detection performance.

Discussion
The proposed IDP-YOLOV9 technique demonstrates significant improvements in object detection under fog, rain, and snow conditions.However, this method still has some limitations.It may not generalize well to all adverse weather conditions, requiring more data and adjustments to ensure model performance.Moreover, given the intricate nature of deep learning models and parallel processing architecture, the approach may necessitate considerable computational resources and time.Future work should focus on further expanding and enriching object detection datasets under adverse weather to cover more scenarios and situations.

Conclusions
We propose that the effectiveness of the IDP-YOLOV9 method is attributable to its parallel optimization architecture, which allows for the flexible adjustment of feature relationships in different scenes, the dynamic tuning of image enhancement parameters, and the utilization of the most advanced YOLOV9 network, tailored to power construction sites under various adverse weather conditions.Additionally, the joint learning approach employed by IDP-YOLOV9 enables better feature capturing from restored clear images, while the introduction of a three-layer routing attention mechanism effectively enhances the accuracy of object detection.Through comprehensive comparisons with other algorithms, we demonstrated its performance in addressing complex environments.Specifically, objective evaluation metrics with subjective assessment methods were employed to thoroughly evaluate the performance of the Three-Weather Removal Algorithm module (TRA) and the image enhancement module (DLIE) on real image datasets.Compared to existing advanced detection algorithms and non-joint methods, the IDP-YOLOV9 algorithm exhibits superior performance.Overall, the results indicate that our method accurately identifies and effectively removes weather factors, while the improved detection module demonstrates outstanding performance on processed images, providing strong support for visual perception and object recognition in power construction sites by drones.

Figure 2
Figure 2 illustrates the proposed enhanced YOLOV9 detection network architecture.Conv was used to extract image features, RepNCSPELAN4 was the feature extraction fusion module in YOLOv9, Concat was used to concatenate feature maps from different layers, and SPPF was used to enhance the network's detection capability for targets at different scales.It pooled feature maps at different scales and then concatenated them to capture multi-scale information of targets.Conv CLS (Convolutional Layer for Classification) was the convolutional layer for target classification, classifying each bounding box.Detect transformed the output into the final object detection results, applying non-maximum suppression (NMS) to remove overlapping bounding boxes and filtering the final detection results based on class confidence.Arrows indicated the direction of data flow, starting from the input image and passing through a series of convolutional layers, concatenation operations, and detection layers and, finally, obtaining the object detection results.

Figure 2 .
Figure 2. Improved YOLOV9 structure diagram.The red part represents the proposed three-layer routing attention mechanism module.

Figure 2 .
Figure 2. Improved YOLOV9 structure diagram.The red part represents the proposed three-layer routing attention mechanism module.

Figure 3 .
Figure 3. Tri-Level Routing Attention Module.The mm and softmax modules represented the operations of matrix transpose multiplication, using the softmax function to normalize attention weights.O1 and O2 were obtained by concentrating token and global attention on the collected keyvalue pairs and then adding context enhancement terms.

Figure 3 .
Figure 3. Tri-Level Routing Attention Module.The mm and softmax modules represented the operations of matrix transpose multiplication, using the softmax function to normalize attention weights.O1 and O2 were obtained by concentrating token and global attention on the collected key-value pairs and then adding context enhancement terms.

28 )
union = w gt * h gt * (r) 2 + (w * h) * (r) 2 − inter (IoU inner = inter union(29)L IoU = 1 − IoU inner(30)In the formulas, b gt r , b gt l , b gt b , and b gt t represent the right, left, bottom, and top boundaries of the ground truth bounding box, while b r , b l , b b , and b t represent the right, left,

Figure 5
Figure 5 illustrates how the small CNN network predicted the DLIE module's image enhancement parameters (WB, gamma, contrast, tone) in three examples.The small CNN network learned a set of parameters for each image based on specific information.Information about brightness, color, and hue and weather feature parameters for each image with fog, rain, and snow were considered.After filtering out weather factors, the images were processed through the DLIE module, improving the visual effects, enhancing image clarity, supplementing detailed information, and benefiting subsequent object detection.

Figure 5
Figure 5 visually demonstrates how our small CNN network predicted the image enhancement parameters (WB, gamma, contrast, tone) for the DLIE module in three distinct examples.Leveraging specific image information such as brightness, color, hue, and weather features, the small CNN network learned tailored parameter sets for each image.By filtering out weather-related factors, the images underwent processing through the DLIE module, resulting in enhanced visual effects, improved image clarity, and enriched details, ultimately facilitating more accurate detection.

Figure 5
Figure 5 illustrates how the small CNN network predicted the DLIE module's image enhancement parameters (WB, gamma, contrast, tone) in three examples.The small CNN network learned a set of parameters for each image based on specific information.Information about brightness, color, and hue and weather feature parameters for each image with fog, rain, and snow were considered.After filtering out weather factors, the images were processed through the DLIE module, improving the visual effects, enhancing image clarity, supplementing detailed information, and benefiting subsequent object detection.

Figure 6 .
Figure 6.The processing and detection effects of different algorithms on power scene images from drones' perspective in snowy weather: (a) U-DenseNet algorithm; (b) All In One algorithm; (c) IDP-YOLOV9.

Figure 6 .
Figure 6.The processing and detection effects of different algorithms on power scene images from drones' perspective in snowy weather: (a) U-DenseNet algorithm; (b) All In One algorithm; (c) IDP-YOLOV9.

Table 1 .
Four filters and parameter table.

Table 2 .
Evaluation of defogging, deraining, and desnowing methods on different datasets against benchmarks by IDP-YOLOV9.↑ indicates that our method achieves higher scores for this metric.

Table 2 .
Evaluation of defogging, deraining, and desnowing methods on different datasets against benchmarks by IDP-YOLOV9.↑ indicates that our method achieves higher scores for this metric.

Table 3 .
Average mAP for different object detection methods under various adverse weather conditions.

Table 3 .
Average mAP for different object detection methods under various adverse weather conditions.

Table 4 .
Comparison of ablation experiments.√ indicates that we used this model for image enhancement, removal, detection operations.