AEA-YOLO: Adaptive Enhancement Algorithm for Challenging Environment Object Detection

Kariri, Abdulrahman; Elleithy, Khaled

doi:10.3390/ai6070132

Open AccessArticle

AEA-YOLO: Adaptive Enhancement Algorithm for Challenging Environment Object Detection

by

Abdulrahman Kariri

and

Khaled Elleithy

^*

Department of Computer Science and Engineering, University of Bridgeport, Bridgeport, CT 06604, USA

^*

Author to whom correspondence should be addressed.

AI 2025, 6(7), 132; https://doi.org/10.3390/ai6070132

Submission received: 14 May 2025 / Revised: 12 June 2025 / Accepted: 17 June 2025 / Published: 20 June 2025

(This article belongs to the Section AI Systems: Theory and Applications)

Download

Browse Figures

Versions Notes

Abstract

Despite deep learning-based object detection techniques showing promising results, identifying items from low-quality images under unfavorable weather settings remains challenging because of balancing demands and overlooking useful latent information. On the other hand, YOLO is being developed for real-time object detection, addressing limitations of current models, which struggle with low accuracy and high resource requirements. To address these issues, we provide an Adaptive Enhancement Algorithm YOLO (AEA-YOLO) framework that allows for an enhancement in each image for improved detection capabilities. A lightweight Parameter Prediction Network (PPN) containing only six thousand parameters predicts scene-adaptive coefficients for a differentiable Image Enhancement Module (IEM), and the enhanced image is then processed by a standard YOLO detector, called the Detection Network (DN). Adaptively processing images in both favorable and unfavorable weather conditions is possible with our suggested method. Extremely encouraging experimental results compared with existing models show that our suggested approach achieves 7% and more than 12% in mean average precision (mAP) on the PASCAL VOC Foggy artificially degraded and the Real-world Task-driven Testing Set (RTTS) datasets. Moreover, our approach achieves good results compared with other state-of-the-art and adaptive domain models of object detection in normal and challenging environments.

Keywords:

object detection; image enhancement module; parameter prediction network; filters; YOLO

1. Introduction

One of the key areas of computer vision is object detection. Object detection is an essential technique that gives computers the ability to detect objects like humans by identifying each one in an image. It was difficult for traditional object identification methods, including HARR features [1], Scale Invariant Feature Transform (SIFT) [2], and Histogram of Oriented Gradient (HOG) [3], to be widely deployed in a variety of applications because of their limited performance, even though they were regarded as representative approaches until the early 2000s. To improve performance in object detection and related tasks, a variety of machine learning (ML) and deep learning (DL) models are used. DL technology has made significant advancements, particularly in image object detection. Deep learning-based object detection techniques, like R-CNN [4], Faster R-CNN [5], and YOLO [6], improve existing technologies’ performance, enabling detection capabilities like or even surpassing human abilities. Two-stage object detectors were very common and efficient in the past. Compared to most two-stage object detectors, single-stage object detection and its underlying algorithms have improved dramatically in recent years. Furthermore, since their introduction, YOLOs have been used in a wide range of applications for object identification and recognition in diverse contexts, and they have demonstrated exceptional performance when compared to their two-stage equivalents.

In a wide range of applications, including video surveillance, robotics, augmented reality, and autonomous vehicles, real-time object detection has become essential. Redmon (2016) first presented the YOLO (You Only Look Once) family of object detection models, which have now undergone numerous revisions that have balanced accuracy and processing efficiency [7]. The YOLO family has undergone several modifications since its founding, as shown in Figure 1, each improving on the one before it to solve issues and improve functionality. The most popular object detection-based technology in recent years is YOLO, which has been made more user-friendly by frequent version upgrades [6,7,8,9,10,11,12,13,14,15,16,17,18,19]. The YOLO architecture, which includes variants such as YOLOv4 and YOLOv5 and more recent adaptations, like YOLOv8 and the latest YOLOs, is designed for real-time object recognition with a focus on accuracy and speed. Usually divided into four sizes, including nano, small, medium, and large, these models are designed to accommodate a range of hardware limitations and application needs. Larger models, such as the medium and large variants, frequently perform better in terms of average precision (AP) and mean average precision (mAP), mostly because of their deeper feature extraction capabilities and higher parameter counts. On the other hand, the small and nano versions, which are intended for applications requiring minimal processing power, frequently show worse accuracy. Their smaller network complexity and lower parameter capacity are probably the cause of this performance disparity, since they may make it more difficult for them to detect complicated spatial correlations and fine-grained characteristics [10]. As a result, smaller YOLO models would not be able to match the accuracy requirements needed for jobs requiring high precision, even though they are useful for deployment on edge devices or mobile platforms.

The existing models of YOLO have good results in terms of accuracy and speed for normal weather conditions. However, in adverse weather situations, issues like fuzzy images and limited vision prevent target recognition. To address these issues, Liu et al. [20] developed a fully adjustable module for image processing based on YOLO for object detection in foggy conditions. A lightweight dehazing network was introduced by Li et al. [21], who used it with the Faster R-CNN model to increase average detection accuracy. A multi-scale progressive fusion network (MSPFN) was developed for single images under adverse weather, which significantly increased detection accuracy [22]. Hnewa et al. [23] proposed a method for detecting targets across domains that makes use of multi-scale characteristics and domain-adaptable methodologies. Oreski [24] introduced an algorithm called YOLO-C, which added the Multi ConTeXt (MCTX) context module and enhanced the loss function. It improved object sign detection in complex situations. Many algorithms that are applied to enhance object detection accuracy under adverse weather conditions, which include the Gray Wolf Optimizer (GWO), Artificial Rabbit Optimizer (ARO), and Chimpanzee Leader Selection Optimization (CLEO), have been added to the YOLO model [25]. The CF-YOLO model was designed by Ding et al. [26]. They proposed a unique Cross-Fusion (CF) module capable of dealing with adverse detection issues, such as blurring, distortion, and target coverage under challenging situations. RFCS-YOLO [27] is an object detection model under challenging environments, which improves receptive field enhancement and cross-scale fusion to address challenges like complex backgrounds and missing traffic targets in bad weather. It uses an efficient feature extraction module, a cross-scale fusion module, and a new Focaler-Minimum Point Distance Intersection over Union loss function.

CNN-based methods have become popular in object detection because of their promising performance on benchmark datasets and real-world applications, like autonomous driving. However, they often fail to achieve satisfactory results under adverse weather conditions, such as fog. To address this, we introduced a framework object detection method called AEA-YOLO, which uses a fully differentiable Image Enhancement Module (IEM) with hyperparameters adaptively learned by a CNN-based Parameter Prediction Network (PPN). The PPN adaptively predicts the IEM’s hyperparameters based on the brightness, color, tone, and weather-specific information of the input image. The proposed AEA-YOLO approach can adaptively deal with images affected by different degrees of weather conditions. This work presents a joint optimization scheme to learn the IEM, PPN, and YOLO backbone Detection Network in an end-to-end manner. The contributions of this work are as follows:

The proposed adaptive enhancement algorithm framework aims to consolidate and expand upon classical image processing filters for object detection into six filters.
The use of a PPN based on CCN for data-specific filter combinations and parameter ranges.
The use of a DN detector based on YOLO.
The proposed AEA-YOLO approach achieves promising performance in both normal and adverse weather conditions, and encouraging experimental results are achieved on synthetic testbeds, including both the VOC Foggy and real-world RTTS datasets.

2. Related Work

2.1. Object Detection

Deep convolutional neural networks (CNNs) are widely used for object detection because of their multi-stage architecture, which automatically generates trained features from input images. These methods are considered black-box methods because of their ability to classify objects effectively [28]. As computer performance improves, CNNs become deeper, making them effective for classification, prediction, and object identification. Object detectors consist of a backbone, neck, and head that find and categorize predefined items in an image. There are three categories of object detection algorithms: one-stage [29], two-stage [30], and anchor-free techniques [31]. One-stage detection algorithms are quicker but less accurate, while two-stage target detection algorithms use two feature extractions and predictions [32]. Anchor-free algorithms find key spots instead of using anchor boxes, while single-stage detectors are useful in contexts with limited resources because of their lighter footprint and faster inference time. YOLO, a highly competitive single-stage detector, has demonstrated remarkable accuracy and real-time inference capabilities, highlighting its potential for use in manufacturing.

On benchmark datasets, many object detection techniques have demonstrated impressive performance. However, when used on photos taken in challenging settings, including fog, haze, or low-light levels, their accuracy frequently suffers a significant decline. This is mainly because most traditional detectors are trained using datasets made up of photos taken in standard environments.

2.2. The Adaptation of Image

Image adaptation remains a pivotal strategy in image enhancement, as it tailors transformation parameters to the specific characteristics of a given scene. Early works [33,34,35] demonstrated that parameters of image transformations can be adaptively computed based on key image features. For instance, ref. [35] introduced a brightness adjustment technique that adaptively modifies enhancement parameters according to the illumination distribution characteristics of the input image.

To further refine image enhancement in an adaptive manner, more recent studies [36,37] have employed compact convolutional neural networks (CNNs) to learn transformation hyperparameters. Specifically, ref. [36] proposed a post-processing framework composed of differentiable filters, where deep reinforcement learning (DRL) is utilized to select both operations and filter parameters based on real-time assessments of the retouched image quality. In a related vein, ref. [38] utilized a lightweight CNN to derive image-adaptive three-dimensional look-up tables (3D LUTs), guided by global context cues such as brightness, color, and tonal distribution.

2.3. Object Detection and Domain Adaptation Under Challenging Environments

Detecting objects in challenging conditions remains a significant challenge in computer vision. One conventional strategy involves reconstructing a clear image from a degraded input through approaches like dehazing or illumination correction [39]. For example, MSBDN [40] employs a U-Net-based architecture for dehazing, while GridDehazeNet [39] integrates attention mechanisms into both pre- and post-processing stages to mitigate artifacts in dehazed images. ZeroDCE [41] demonstrates light enhancement in a no-reference setting, further illustrating the spectrum of techniques aimed at image restoration. In parallel, domain adaptation strategies have emerged to bridge the gap between normal images and those captured under challenging weather conditions, focusing on improving object detection performance [42,43,44]. These methods generally fall into two categories: (i) training-based approaches that learn domain-invariant features and (ii) physics-based approaches that utilize environmental models. Within the training-based paradigm, DAYOLO [23] augments YOLO by jointly minimizing object detection and domain classification losses, while DSNet [45] uses a multi-task learning framework that simultaneously addresses image enhancement and object detection, yielding gains in foggy scenarios. In contrast, physics-based solutions [44] estimate condition-specific priors, such as transmittance for rain and haze, and then refine these estimates via adversarial training.

Recently, a new class of adaptive object detection techniques [46,47,48,49] has gained traction, aiming to seamlessly integrate image enhancement or compensation steps into the detection pipeline. These emerging methods represent an alternative to explicit restoration or strictly domain-focused adaptation, suggesting a more unified approach to handling adverse weather conditions. Object detection in challenging conditions, such as fog, rain, and low-light settings, remains a problem due to the domain shift between clear training images and degraded target images. Domain adaptation (DA) methods aim to mitigate this discrepancy by aligning the features so that a model trained in one domain maintains robust performance when deployed in another. Over the past few years, two overarching paradigms have emerged within the literature: (1) two-step pipelines that decouple image restoration/enhancement from detection and (2) end-to-end frameworks that integrate the adaptation or enhancement procedure directly into the detection model.

Two-Step Pipelines: In this traditional approach, image preprocessing is carried out as a discrete step before detection. For instance, physics-based dehazing or image restoration algorithms are first employed to remove visible artifacts and improve the clarity of degraded images, after which a standard detector, like the YOLO variant, is used [50]. Although this can be straightforward to implement, the lack of co-optimization between the enhancement and detection stages often limits performance. The enhancement module, optimized for metrics like the peak signal-to-noise ratio (PSNR), does not necessarily account for the downstream detection task [51]. As a result, two-step pipelines can produce visually appealing images yet fail to achieve the best possible detection accuracy. Nevertheless, two-step pipelines are attractive due to their modular nature, allowing for independent improvement and reuse of components, and their ability to enhance human readability and qualitative analysis in critical applications, like medical imaging [52] and aerial surveillance [53]. Preprocessing has even been demonstrated to boost detection accuracy when correctly aligned with the detection task [54].

End-to-End Frameworks: End-to-end solutions integrate the adaptation process, whether that involves dehazing, illumination correction, or other techniques, directly into the detection model. Recent work has introduced variants of YOLO and Faster R-CNN, wherein an image enhancement subnetwork or auxiliary domain adaptation module is trained jointly with the detection head [55,56]. This co-optimization ensures that image adaptation is driven by detection objectives, often yielding superior accuracy, particularly for small or partially obscured objects [57]. One-step pipelines are also more computationally efficient in terms of inference time. They eliminate the overhead of applying different algorithms for enhancement and detection sequentially.

3. Proposed Method

Challenging environments and weather conditions introduce significant challenges for object detection, such as reduced visibility and degraded image quality. To address this issue, we present a challenging-environment-resilient object detection framework, termed Adaptive Enhancement Algorithm YOLO (AEA-YOLO), designed to adaptively enhance input images for improved detection, as shown in Figure 2. The framework comprises three main modules: the Image Enhancement Module (IEM), the Parameter Prediction Network (PPN), and the Detection Network (DN). These modules work to improve object detection performance in difficult scenarios. The following subsections elaborate on each module and its integration within the framework.

3.1. Image Enhancement Module (IEM)

Computational imaging has been transformed by gradient-based optimization, which uses nonlinear optimization and deep learning. The new tool addresses nonlinear inverse problems in computational imaging, defines new neural network layers for image processing implementations, and significantly enhances the quality of conventional image processing algorithms. There are many ways to differentiate programming for image processing, such as contrast, sharpen, gamma, white balance, and tone. Recent advancements, such as Deep Image Prior (DIP) [58] and Plug-and-Play Priors [59], demonstrate the power of gradient-based methods in tackling inverse issues like denoising, deblurring, and super-resolution. Differentiable enhancement layers have been used successfully in end-to-end systems for visibility restoration in foggy scenarios [21] and tone correction in low-light image improvement [60]. These methods enable the simultaneous optimization of enhancement and recognition objectives, resulting in better performance in real-world vision systems.

The IEM uses differentiable image filters to enhance input images under different weather conditions. These filters, including defog, white balance, gamma correction, contrast adjustment, tone adjustment, and sharpening, are resolution-independent and differentiable. CNN gradient-based optimization uses differentiable filters for backpropagation training, ensuring they function regardless of image resolution. The IEM is designed to dynamically enhance input images before passing them to the Detection Network DN. It incorporates a set of image filters, allowing for end-to-end training. This module is crucial for pre-processing images affected by adverse weather conditions.

The selection and ordering of the IEM filters are grounded in both theoretical and empirical considerations. The sequence mirrors a progressive restoration pipeline, starting with global visibility correction (defog), followed by color fidelity adjustments (white balance) and photometric enhancement (gamma and tone correction), and concluding with detail emphasis (contrast and sharpening). This ordering aligns with standard practices in image enhancement pipelines, where early-stage global corrections are performed before local feature refinements. Defogging is prioritized as haze substantially reduces scene visibility, and removing it upfront improves the effectiveness of subsequent corrections [61]. White balancing follows to neutralize color casts introduced by atmospheric scattering. Gamma and tone correction are essential for adjusting dynamic range, particularly under non-uniform illumination [62]. Finally, contrast adjustment and sharpening enhance local details, which are critical for object boundary detection. Empirical evaluations in prior works, such as [33,36], show that this sequential application leads to superior perceptual quality and detection accuracy compared to randomized or reversed orders.

This section explains the image processing filters utilized in image adaptive approaches. The degree of processing for each module is determined by parameters that can be differentiated. The suggested filters are classified into three categories: pixel-wise filters for global intensity mapping [36], sharpen filters for picture enhancement [33], and defog filters for fog removal [61].

Pixel-wise Filters: These filters convert the value of input pixel into an output pixel value [36], as illustrated below.

P_{i} = (R_{i}, G_{i}, B_{i})

(1)

P_{o} = (R_{o}, G_{o}, B_{o})

(2)

where

P_{i}

means an input pixel value,

P_{0}

means an output pixel value, and RGB mean three color channels: Red, Green, and Blue. There are many filters that depend on pixel-wise methods, such as white balance, gamma, contrast, and tone. The purpose of the white balance filter is to correct color imbalances caused by illumination differences; Figure 3 shows an example plot. We can express the equation as follows:

P_{o} = (W_{R}, W_{G}, W_{B}) \cdot P_{i}

(3)

where

W_{R}, W_{G}, a n d W_{B}

are scaling factors for Red, Green, and Blue.

In terms of gamma correction, the purpose is to adjust image brightness, as shown in Figure 4. The formula is shown below:

P_{o} = P_{i}^{γ}

(4)

where (γ) refers to the gamma value.

The purpose of the contrast filter is to improve the visibility of features by modifying the image contrast. Figure 5 illustrates how the contrast filter changes intensity values in an up-and-down curve fashion. The luminance of the image is used to express this filter, as expressed in the equation below:

L u m (P_{i}) = 0.27 * R_{i} + 0.67 * G_{i} + 0.06 * B_{i}

(5)

Then, we enhance the original image before using the contrast filter of an image, as shown in the equations below:

P_{o} = α * E n (P_{i}) + (1 - α) * P_{i}

(6)

where α means weight controlling the blend of original and enhanced pixel values. Then, the enhanced image is calculated as:

E n (P_{i}) = P_{i} * \frac{0.5 (1 - \cos (π * L u m (P_{i})))}{L u m (P_{i})}

(7)

The tone filter’s purpose is to modify the tone curve of the image for better color representation, as shown in Figure 6. A piecewise linear function is developed to change the slope in each interval in the tone filter, which divides the range [0, 1] of the input pixel intensity Pi into L as the number of tone levels. Each number of tone levels is represented by a set of Learnable tone parameters (t0, t1, t2,…, t L):

P_{o} = \frac{1}{L} \sum_{j = 0}^{L - 1} C l i p (L * P_{i} - j, 0,1) * t_{j}

(8)

where L is the number of tone levels and t j is the learnable tone parameter.

Sharpening Filter: The sharpening filter can bring out the details of an image. The sharpening method is like the un-sharpen mask approach [33] and can be explained as follows:

F (x, λ) = I (x) + λ * (I (x) - G (I (x)))

(9)

where I(x) is the input image and G(I(x)) refers to Gaussian as a positive factor. In terms of both x and

λ

, the sharpening process is differentiable. Keep in mind that you can optimize the sharpening degree to improve object detection performance, as shown in Figure 7.

Defog Filter: By estimating a transmission map and atmospheric light, the defog filter eliminates haze and fog from images. The defog filter is designed as a defog filter with a learnable parameter [61]. Based on the atmospheric scattering model [63], we generate a hazy image as the equation below:

I (x) = J (x) * t (x) + A (1 - t (x))

(10)

where J(x) refers to the clean image (scene radiance), I(x) is a foggy image, A represents the light, and t(x) denotes the medium transmission map, which is defined as follows:

t (x) = e^{- β} d (x)

(11)

where the scene depth is denoted by d(x) and the atmosphere’s scattering coefficient is denoted by β. Obtaining light A and the transmission map t(x) is essential to recovering the clean image J(x). An approximate solution to t(x) can be obtained based on I(x). We also add a parameter

ω

to regulate the level of defogging as follows:

t (x, ω) = 1 - ω \min_{C} (\min_{y \in Ω (x)} \frac{I^{C} (y)}{A^{C}}

(12)

where t(x) is the transmission map, ω is the fog removal parameter (learnable), Ω(x) is the neighborhood of pixel x, IC(y) is the intensity of channel (c) at location (y), and A^C is the atmospheric light for channel (c). The enhanced pixel intensity is computed as:

J (x) = \frac{I (x) - A}{t (x)} + A

(13)

where J(x) is the enhanced image and A is the atmospheric light. Figure 8 uses backpropagation to improve the defog filter’s suitability for identifying foggy images because the technique is differentiable.

To sum up, IEM contains six filters, which are the gamma, WB, tone, sharpening, defog, and contrast filters, as shown in Figure 9.

3.2. Parameter Prediction Network (PPN)

The PPN based on a CNN is a lightweight convolutional regressor that maps a low-resolution RGB preview

x \in R^{3 \times 256 \times 256}

to a compact vector of filter coefficients

θ \in R^{N}

used by the Image Enhancement Module (IEM). Table 1 summarizes the architecture, while this section elaborates on why each component is present and how it contributes to robust parameter estimation. The five convolutional blocks down-sample feature maps and increase channel capacity from 16 to 32. This allows the network to aggregate global statistics, like brightness, haze density, and color shift, without the memory footprint of max-pool layers. This “all-convolutional subsampling” technique preserves more contextual information than pooling for tiny parameter vectors rather than dense prediction maps [64]. Block 1 records low-frequency color and brightness cues, which are reliable indicators of white balance and gamma values. Blocks 2–3 increase the network’s receptive field, enabling it to detect mid-scale fog veiling and tone compression. Blocks 4–5 operate at 16×16 and 8×8 resolutions, highlighting coarse semantic patterns and picture depth signals to estimate contrast and sharpening strength. Global Average Pooling (GAP) reduces spatial grids to channel means, leaving only scene-level statistics. The first fully connected FC1 layer expands a 128-D latent embedding vector, while the final FC2 layer outputs N scalar filter parameters, achieving the best accuracy–efficiency trade-off.

Batch normalization stabilizes weather condition statistics, allowing the lightweight model to converge within 15 epochs. Leaky-ReLU avoids dead activations. As result, the PPN uses a GAP to condense the 32 × 8 × 8 feature map into a 32-dimensional scene descriptor; then, it projects this descriptor through a 128-unit fully connected (FC) layer and finally regresses the N continuous filter coefficients via a second FC layer, whose N × 128 weight matrix linearly combines the 128 hidden activations to produce each parameter. Altogether, the network contains ≈ 6 k learnable parameters and has minimal computational overhead while retaining sufficient capacity to adapt the IEM filters on a per-image basis.

The PPN predicts parameter bounds for enhancement parameters, which are bounded and linearly rescaled to fit photographic practice and differentiable pipelines. The defog transmittance weight is restricted to a closed interval [0.1, 1.0], while per-channel white-balance gains are confined to a ±10% chromatic correction. The gamma exponent can vary from one-third to three, and the tone curve knots are bounded to maintain monotonicity. These constraints, as listed in Table 2, ensure that the lightweight PPN produces enhancement parameters that are expressive yet robust, and they add no extra learnable weights because the bounding functions are differentiable and applied at inference time only.

Although the PPN outputs a single global filter vector, the features it learns are condition-aware rather than simple photometric averages. The first two convolutional blocks are sensitive to low-frequency luminance and color casts, cues that differ markedly among low-light scenes, dense fog, glare, or rainy scenes. Blocks three through five add larger receptive fields, capturing texture loss and depth-related veiling; these mid-level patterns let the network separate true fog density from mere under-exposure. During training, each degraded image is paired with its clean counterpart, so backpropagation steers different degradation types toward distinct regions of the coefficient space:

Dense fog: High defog strength (ω) and compressed tone curves.
Low-light (nighttime): Gamma values below one and moderate contrast boosts.
Glare: Blue-skewed white-balance gains and mild sharpening.
Rain: Moderate ω and mid-tone steepening.

This mapping enables the PPN to adapt its enhancement strategy automatically to the prevailing adverse condition.

3.3. Detection Network (DN)

In this work, we adopt YOLO as our one-stage Detection Network, given its extensive use in a range of real-world applications. As an evolution from earlier iterations, YOLO introduces the Darknet-53 backbone, which arranges sequential convolutional layers. The network leverages multi-scale training by producing predictions across multiple feature map resolutions, thereby enhancing detection accuracy for small objects [65]. We maintain the same network architecture and loss functions originally specified in YOLO. In addition, a recovery loss is computed as the L2 difference between the enhanced image and a “clean” image.

L 2 = [|f i l t e r e d_i m a g e - i n p u t_d a t a_c l e a n|]^{2}

The YOLO detector uses a learnable image pre-processing step before the usual YOLO pipeline. The code extracts filter-parameter vectors per image and applies each filter to produce an “enhanced” image. A recovery loss is calculated as the L2 difference between the improved and original images. The code then passes the enhanced image to the DN, as shown in Figure 10, which returns three feature “routes” for detecting large, medium, or small objects. The code then decodes the raw outputs into bounding boxes, maps the predictions into absolute coordinates, and applies Non-Maximum Suppression (NMS) to remove duplicates. Four major loss terms are used, i.e., Generalized Intersection over Union (GIoU) Loss, Confidence Loss, Classification Loss, and Recovery Loss. These losses are summed up to form the total training loss, with the IEM introducing an additional recovery term.

Although our method can compute a pixel-domain recovery loss, the squared distance between the augmented image and its clean reference, this term is removed from the optimization target in the trials reported. In ablation experiments, inserting a tiny, non-zero weighting for the recovery period resulted in a mean average precision change of less than one percentage point, indicating that detection accuracy is essentially unaffected by moderate weighting values. To ensure reproducibility and focus on detection, all published data have a recovery coefficient of zero.

This study proposes a hybrid data training strategy for AEA-YOLO to improve detection performance in both normal and challenging weather conditions, as summarizes in Algorithm 1. The approach involves the random transformation of each image by introducing foggy images before training the network. This approach ensures strong detection accuracy.

Algorithm 1: Adaptive Enhancement Algorithm AEA-YOLO Training Methodology.

  • Training dataset D containing images {x} with bounding box annotations.
  • Number of epochs E, batch size B.
  • Probability p of applying challenging environments simulation.
  • Initialized weights:
         – θ for the PPN network based on CNN.
         – β for YOLO detector.
Output:
  • Trained weights (θ*, β*) for AEA-YOLO framework.
1: for epoch = 1 to E do
2:       Shuffle D and partition into batches of size B
3:       for each batch Db in D do
4:                            # --- STEP A: DATA AUGMENTATION AND CHALLENGING ENVIRONMENTS SIMULATION ---
5:             for each image x in Db do
6:                   Generate a random number r ∈[0,1]
7:                   if r < p then
8:                     x ← Simulate Challenging Environment (x)
                 # e.g., fog synthesis
9:                  end if
10:             end for
11:             # --- STEP B: PREDICT FILTER PARAMETERS ---
12:             X-low ← Down sample Each (Db, target-size)
13:             P ← PPN (X-low; θ)
                   # P stores the IEM filter parameters (defog strength, gamma, etc.)
14:             # --- STEP C: IMAGE ENHANCEMENT MODULE---
15:             X-enhancement ← IEM (Db, P)
                   # Apply the IEM filters with parameters P to the full-resolution images
16:             # --- STEP D: OBJECT DETECTION AND LOSS COMPUTATION ---
17:             Y-pred ← YOLO(X- enhancement; β)
18:             L-det ← Detection-Loss (Y-pred, Ygt)
                   # Ygt denotes ground-truth bounding boxes for Db
19:             # --- STEP E: BACKPROPAGATION AND WEIGHT UPDATE ---
20:             (∂L-det/∂θ, ∂L-det/∂β) ← Backprop(L-det)
21:             θ ← θ − η · ∂L-det/∂θ
22:             β ← β − η · ∂L-det/∂β
23:          end for
24: end for
25: return (θ*, β*) # Final trained parameters for AEA-YOLO

4. Experimental Results

To ensure performance, we adopted a hybrid training strategy using normal and synthetically degraded images. Adverse conditions were simulated by foggy images. The foggy images were generated by the atmospheric scattering model, as shown in Figure 11. In terms of training steps, it is important to augment training data with fog conditions. Using YOLO detection loss is critical when training the IEM, PPN, and DN jointly. As a result, we validated the model using a mix of real-world and synthetic datasets. AEA-YOLO’s framework pipeline involves input image processing, image enhancement, and object detection, with the IEM applying filters and the DN producing bounding boxes and class predictions.

In this experiment, the suggested AEA-YOLO algorithm’s performance was evaluated using the metrics average precision (AP), mean average precision (mAP), precision (P), and recall (R). We assessed the performance of our method in both regular and challenging circumstances. The filter settings were used to enhance the images and, in some cases, generate a fog dataset for training and validation purposes. We used YOLO’s training methodology in our proposed AEA-YOLO approach. Furthermore, augmentations, such as HorizontalFlip and ColorJitter, were used to increase the heterogeneity in the training data, which improved model generalization. Validation transformations emphasize standardization while balancing training robustness with fair, predictable validation.

4.1. Experiment Dataset, Environment, and Parameters

This experiment used the Real-world Task-driven Testing Set (RTTS) dataset, which contains 4322 natural fog images for five object classes. However, the small number of images makes it difficult to meet the large data requirements for deep learning model training. To address this, a foggy object detection dataset, VOC Fog, was generated on the PASCALVOC dataset using an atmospheric scattering model. This dataset contains 15,797 images, with training, validation, and test sets comprising 8111, 2734, and 4952 images, respectively.

Our AEA-YOLO model was trained in over 250 epochs using the Adam optimizer, saving checkpoints to evaluate experiments and compare the results. The code was run on an NVIDIA GeForce RTX 4090 24 GB GPU. The experimental parameters used for the two datasets were the same to guarantee consistency and fairness across all experiments. Table 3 below displays the specific training parameters.

4.2. Ablation Experiment

To validate the effectiveness of our proposed AEA-YOLO framework, we conducted experiments on a subset of the PASCAL VOC 2007 and VOC 2012 datasets, containing five object classes, such as bicycle, bus, car, motorbike, and person, out of 20 classes. Our AEA-YOLO model was trained, saving checkpoints to evaluate both normal images and artificially degraded images. AEA-YOLO predicts three different scales of bounding boxes, as well as three anchors for each size. The artificially degraded set was generated by introducing controlled degradations that mimic adverse weather factors. After training the model, we evaluated it using two modes: the first one used the Detection Network (DN) only, and the second one used IEM + DN (AEA-YOLO), as shown in Table 4 and Figure 12, to compare the results for both.

We tested both modes on 4952 PASCAL VOC2007 images, restricting the evaluation to the five categories. Nonetheless, these results form a strong baseline for assessing how well our system adapts when encountering adverse conditions. Interestingly, the mean average precision at IoU (0.5) when using DN + IEM (AEA-YOLO) exceeds the DN (97.84% vs. 89.72%). We attribute this outcome to two factors: (a) the IEM is particularly beneficial on moderately hazy inputs, helping the detector see object boundaries more clearly, and (b) the artificially degraded images might match the model’s training augmentations closely, thus yielding slightly better alignment of features. As a result of this convergence in mAP for both accuracies, this indicates that our model is suitable for real-time object detection applications regardless of weather conditions, whether in normal or challenging environments, while the existing models suffer from this.

4.3. Comparison of Existing Integrated Methods

Several single-framework approaches have emerged that jointly learn image enhancement and detection under adverse conditions. In extremely challenging environments, our suggested AEA-YOLO attempts to improve object detection accuracy by adapting to the issue of poor visual image quality. To demonstrate the usefulness and superiority of our technology, we prepared different types of normal images by generating challenging images. For challenging environments, we used the proposed AEA-YOLO to detect objects and compared the results using visualizations and metrics such as average precision (AP) and mean average precision (mAP). As demonstrated in Table 5, AEA-YOLO outperforms the existing models in artificially degraded images, as an experiment of generalization, indicating its superior performance in challenging environments.

Compared to the existing methods, our preliminary results, i.e., 97.84% mAP@50, demonstrate that our integrated pipeline maintains robust detection accuracy under both conditions without manual retuning and sacrificing normal-image performance. Although direct one-to-one numerical comparisons are to be normal and artificially degraded images as hybrid training, the following points distinguish our AEA-YOLO framework:

Weakly Supervised Parameter Prediction: We use bounding-box supervision only, allowing the Image Enhancement Module (IEM) to adapt its filter parameters automatically per image.
Lightweight Enhancement: A small PPN based on CNN for parameter prediction adds minimal overhead, supporting near-real-time inference speeds, which is an aspect some other joint methods do not emphasize.
Unified Hybrid Training: We combine normal and synthetic adverse images during training, enabling a single model to handle multiple conditions rather than specialized domain-adaptive modules.

4.4. Testing on the Real-World Scenario Dataset

We verified the performance of our AEA-YOLO for challenging environments detection in realistic datasets, such as the RTTS dataset. The RTTS dataset includes 4322 naturally hazy images that are annotated to emphasize five distinct item classes: person, car, bicycle, motorbike, and bus. Table 6 displays the ideal size of each class and statistics of the RTTS dataset [72].

We compared it to other state-of-the-art methods, such as the generic object detection, dehazing, domain-adaptive, and multi-task algorithms [73,74,75,76,77,78,79]. The results demonstrate that our network has considerable benefits over the other state-of-the-art networks, as shown in Table 7. Figure 13 shows qualitative comparisons of several approaches on the images from RTTS. As shown, our technique detects most objects with high accuracy, both with and without the IEM, and produces no erroneous detection results.

To assess the practical robustness of our proposed AEA-YOLO framework, we tested it on real-world imagery under varying visibility conditions. Specifically, we selected two images of the same scene: one unmodified, representing a foggy image in real-world conditions, and one artificially degraded to emulate severe fog. These two images, illustrated in Figure 14, were used in our AEA-YOLO model and for evaluation in the existing nano YOLO versions.

Our AEA-YOLO successfully identifies the objects with high confidence, above 0.9 in most cases, whether using the foggy image from real-world conditions or the artificially degraded image. While standard YOLO variants, such as YOLOv5n, YOLOv8n, YOLOv10n, and YOLO11n, also detect most targets, several bounding boxes appear partially misplaced or present lower confidence scores, indicating minor difficulty with small objects, such as bicycles, under standard settings. In contrast, AEA-YOLO’s adaptive enhancement module appears to emphasize the salient features, yielding more stable detection.

In terms of the artificially degraded scenario, visibility is drastically reduced to replicate dense fog conditions. Here, the conventional YOLO models exhibit more pronounced detection gaps, often missing or being underconfident about distant or partially obscured objects. As a result, some bounding-box scores drop below 0.5. In contrast, AEA-YOLO consistently boosts the local contrast and sharpens edge cues for objects, sustaining detection confidence at or near normal image levels. While the extreme haze does marginally lower recall, our model preserves sufficiently high precision to maintain reliable results under these challenging conditions.

Overall, these real-scene tests affirm that AEA-YOLO can robustly handle both normal and challenging environments, thereby reducing the domain gap that commonly degrades object detection performance. By adaptively adjusting per-image enhancement parameters, the framework avoids over-enhancement on clear images and effectively restores pertinent details in heavily obscured scenes. Such adaptability is crucial for safety-critical applications where visibility can fluctuate widely in short time spans.

5. Limitations and Future Work

Despite the gains reported, AEA-YOLO is not a universal remedy for every adverse weather scenario. Objects whose bounding-box diagonal spans fewer than eight pixels often disappear during the five stride down-sampling steps of the PPN, so the IEM is tuned by context rather than by the objects themselves, leading to missed detections. A third failure mode arises when the enhancement stage inserts ringing or over-sharpened halos, most often in scenes containing specular glare, because the handcrafted unsharp mask kernel does not adapt to spatial frequency [33]; the detector’s confidence score then drops because of features the backbone has not seen during training. These cases highlight two intrinsic limits: (a) global, image-level filters cannot recover information that is physically unrecoverable, such as dense haze and extreme under-exposure [61,63], and (b) the current one-stage detection head remains sensitive to small artifacts near object boundaries [8]. Future work could explore local adaptive filtering with spatial attention, multi-scale feature fusers for sub-ten-pixel objects, and perceptual regularization, which penalize halo formation, thereby alleviating the above shortcomings.

All experiments reported in this paper use visible-spectrum RGB inputs because the existing adverse weather benchmarks are Pascal VOC Foggy and RTTS datasets. On the other hand, public YOLO baselines are RGB-only. At present, we therefore make no empirical claims for infrared, depth, or other sensor modalities. AEA-YOLO increases identification in moderate weather conditions, but it has limits due to global filters, down-sampled preview, sharpening and contrast filters, and a focus on visible-spectrum RGB data. Training the recovery-aware variation requires paired “clean and degraded” images, which might be time-consuming. Addressing these challenges may necessitate the use of local adaptive filters, multi-scale preview branches, perceptual regularization, and modality-specific pre-training, all of which are reserved for future research. In terms of integration with transformer detectors, recent transformer detectors, ranging from Deformable DETR [80] and Swin-DETR [81] to lightweight, task-specific designs for railway-catenary inspection [82] and UAV first-person viewing [83], offer stronger global reasoning and may handle complex weather artifacts more gracefully. Adapting AEA-YOLO’s enhancement stage to such transformer backbones, therefore, constitutes a promising direction for future work.

While a formal FPS benchmarking was not included in the initial experiments, the AEA-YOLO framework was implemented with real-time feasibility in mind. Each stage, including the lightweight CNN-based PPN and the differentiable IEM, was designed for GPU-accelerated inference. Informal runtime tests on RTX-class hardware suggest that the processing times per image are well below one minute. Future work will include formal timing benchmarks and direct comparisons with the latest nano YOLO versions in terms of both speed and accuracy.

6. Conclusions

The Adaptive Enhancement Algorithm YOLO (AEA-YOLO) framework, which allows for an enhancement in each image for improved detection capabilities, is proposed. This approach is dependent on three components, which are an Image Enhancement Module (IEM), a Parameter Prediction Network (PPN), and a Detection Network (DN) based on YOLO. The IEM was used to generate a fog image; then, the PPN was applied to predict the parameters of the IEM. The PPN adds approximately six thousand parameters, which is negligible computational overhead. It captures scene-specific degradations, such as variable haze density or color cast, producing condition-aware filter settings on a per-image basis. The model was tested on a traditional dataset for evaluating object detection metrics. The experimental results reveal that our proposed technique outperforms existing models by 7% and more than 12% in mean average precision (mAP) on the PASCAL VOC Foggy intentionally degraded and Real-world Task-driven Testing Set (RTTS) datasets, respectively. This design proves particularly effective for capturing scene-specific degradations, such as foggy density, without requiring pixel-level supervision. On the domain adaptation front, multi-granularity discriminators foster robust feature alignment across weather domains, mitigating the typical domain gap that arises when training and testing conditions differ dramatically.

Overall, the findings of this paper suggest that adaptive, jointly learned image processing and domain-adaptive detection constitute a powerful direction for reliable object recognition in challenging weather conditions. Despite these advances, there are limitations, such as scenes fully obscured by dense fog, objects smaller than one-eighth of the input resolution, or ringing artifacts introduced by fixed unsharp masks that remain challenging, and the present work is limited to RGB imagery. Future research directions include local adaptive filtering, multi-scale preview, perceptual regularization, modality-aware extensions, and synthetic data generation to reduce halo artifacts, recover lost sub-ten-pixel objects, and improve image quality. We believe that extending this line of research through broader weather simulation, deeper domain-alignment strategies, and hardware-optimized deployment will help realize more robust and resilient vision systems in real-world environments.

Author Contributions

Conceptualization, A.K. and K.E.; methodology, A.K. and K.E.; software, A.K. and K.E.; validation, A.K. and K.E.; formal analysis, A.K. and K.E.; investigation, A.K. and K.E.; resources, A.K. and K.E.; data curation, A.K. and K.E.; writing—original draft preparation, A.K. and K.E.; writing—review and editing, A.K. and K.E.; visualization, A.K. and K.E.; supervision, A.K. and K.E.; project administration, A.K. and K.E. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Public datasets were used; references are available in this manuscript. All materials related to our study are publicly available: https://drive.google.com/drive/folders/1MK0VZnhgwrOysMBWppqZwnFEHp20Gcm8?usp=sharing (assessed on 12 June 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

YOLO	You Only Look Once
CNN	Convolutional Neural Network
AI	Artificial Intelligence
AP	Average Precision
mAP	Mean Average Precision

References

Whitehill, J.; Omlin, C.W. Haar features for FACS AU recognition. In Proceedings of the 7th International Conference on Automatic Face and Gesture Recognition (FGR06), Southampton, UK, 10–12 April 2006; p. 5. [Google Scholar]
Lowe, D.G. Object recognition from local scale-invariant features. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Corfu, Greece, 20–27 September 1999; Volume 2, pp. 1150–1157. [Google Scholar]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Farhadi, A.; Redmon, J. Yolov3: An incremental improvement. In Computer Vision and Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2018; Volume 1804, pp. 1–6. [Google Scholar]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. Technical Report. 2020. Available online: https://arxiv.org/abs/2004.10934 (accessed on 12 June 2025).
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. Scaled-yolov4: Scaling cross stage partial network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13029–13038. [Google Scholar]
Jocher, G.; Chaurasia, A.; Stoken, A.; Borovec, J.; Kwon, Y.; Michael, K.; Fang, J.; Yifu, Z.; Wong, C.; Montes, D.; et al. Zenodo. 2020. Available online: https://zenodo.org/records/3983579 (accessed on 21 April 2025).
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. Technical Report. 2022. Available online: https://arxiv.org/abs/2209.02976 (accessed on 12 June 2025).
Li, C.; Li, L.; Geng, Y.; Jiang, H.; Cheng, M.; Zhang, B.; Ke, Z.; Xu, X.; Chu, X. YOLOv6 v3.0: A full-scale reloading. (preprint). 2023. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Jocher, G.; Chaurasia, A.; Qiu, J. YOLOv8: Explore Ultralytics YOLOv8; Ultralytics: Frederick, MD, USA, 2023; Available online: https://docs.ultralytics.com/models/yolov8/ (accessed on 12 June 2025).
Wang, C.Y.; Yeh, I.H.; Mark Liao, H.Y. Yolov9: Learning what you want to learn using programmable gradient information. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2024; pp. 1–21. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Jocher, G.; Qiu, J. YOLO11: Ultralytics YOLO11; Ultralytics: Frederick, MD, USA, 2024; Available online: https://docs.ultralytics.com/models/yolo11/ (accessed on 12 June 2025).
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors; Ultralytics: Frederick, MD, USA, 2025; Available online: https://docs.ultralytics.com/zh/models/yolo12/ (accessed on 12 June 2025).
Liu, W.; Hou, X.; Duan, J.; Qiu, G. End-to-end single image fog removal using enhanced cycle consistent adversarial networks. IEEE Trans. Image Process. 2020, 29, 7819–7833. [Google Scholar] [CrossRef]
Li, C.; Guo, C.; Guo, J.; Han, P.; Fu, H.; Cong, R. PDR-Net: Perception-inspired single image dehazing network with refinement. IEEE Trans. Multimed. 2019, 22, 704–716. [Google Scholar] [CrossRef]
Wang, Q.; Jiang, K.; Wang, Z.; Ren, W.; Zhang, J.; Lin, C.-W. Multi-scale progressive fusion network for single image deraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8346–8355. [Google Scholar]
Hnewa, M.; Radha, H. Multiscale domain adaptive yolo for cross-domain object detection. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Online, 19–22 September 2021; pp. 3323–3327. [Google Scholar]
Oreski, G. YOLO* C—Adding context improves YOLO performance. Neurocomputing 2023, 555, 126655. [Google Scholar] [CrossRef]
Özcan, İ.; Altun, Y.; Parlak, C. Improving YOLO detection performance of autonomous vehicles in adverse weather conditions using metaheuristic algorithms. Appl. Sci. 2024, 14, 5841. [Google Scholar] [CrossRef]
Ding, Q.; Li, P.; Yan, X.; Shi, D.; Liang, L.; Wang, W.; Xie, H.; Li, J.; Wei, M. CF-YOLO: Cross fusion YOLO for object detection in adverse weather with a high-quality real snow dataset. IEEE Trans. Intell. Transp. Syst. 2023, 24, 10749–10759. [Google Scholar] [CrossRef]
Liu, G.; Huang, Y.; Yan, S.; Hou, E. RFCS-YOLO: Target Detection Algorithm in Adverse Weather Conditions via Receptive Field Enhancement and Cross-Scale Fusion. Sensors 2025, 25, 912. [Google Scholar] [CrossRef]
Ho, T.T.; Kim, T.; Kim, W.J.; Lee, C.H.; Chae, K.J.; Bak, S.H.; Kwon, S.O.; Jin, G.Y.; Park, E.-K.; Choi, S. A 3D-CNN model with CT-based parametric response mapping for classifying COPD subjects. Sci. Rep. 2021, 11, 34. [Google Scholar] [CrossRef]
Terven, J.; Córdova-Esparza, D.M.; Romero-González, J.A. A comprehensive review of yolo architectures in computer vision: From yolov1 to yolov8 and yolo-nas. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
Gavrilescu, R.; Zet, C.; Foșalău, C.; Skoczylas, M.; Cotovanu, D. Faster R-CNN: An approach to real-time object detection. In Proceedings of the 2018 International Conference and Exposition on Electrical and Power Engineering (EPE), Iasi, Romania, 18–19 October 2018; pp. 165–168. [Google Scholar]
Law, H.; Deng, J. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 734–750. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Polesel, A.; Ramponi, G.; Mathews, V.J. Image enhancement via adaptive unsharp masking. IEEE Trans. Image Process. 2000, 9, 505–510. [Google Scholar] [CrossRef] [PubMed]
Yu, Z.; Bajaj, C. A fast and adaptive method for image contrast enhancement. In Proceedings of the 2004 International Conference on Image Processing, ICIP’04, Singapore, 24–27 October 2004; Volume 2, pp. 1001–1004. [Google Scholar]
Wang, W.; Chen, Z.; Yuan, X.; Guan, F. An adaptive weak light image enhancement method. In Proceedings of the Twelfth International Conference on Signal Processing Systems, Shanghai, China, 6–9 November 2020; Volume 11719, p. 1171902. [Google Scholar]
Hu, Y.; He, H.; Xu, C.; Wang, B.; Lin, S. Exposure: A white-box photo post-processing framework. ACM Trans. Graph. (TOG) 2018, 37, 1–17. [Google Scholar] [CrossRef]
Yu, R.; Liu, W.; Zhang, Y.; Qu, Z.; Zhao, D.; Zhang, B. DeepExposure: Learning to Expose Photos with Asynchronously Reinforced Adversarial Learning. Neural Inf. Process. Syst. 2018. Available online: https://proceedings.neurips.cc/paper/2018/file/a5e0ff62be0b08456fc7f1e88812af3d-Paper.pdf (accessed on 12 June 2025).
Zeng, H.; Cai, J.; Li, L.; Cao, Z.; Zhang, L. Learning image-adaptive 3d lookup tables for high performance photo enhancement in real-time. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 2058–2073. [Google Scholar] [CrossRef]
Liu, X.; Ma, Y.; Shi, Z.; Chen, J. Griddehazenet: Attention-based multi-scale network for image dehazing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7314–7323. [Google Scholar]
Dong, H.; Pan, J.; Xiang, L.; Hu, Z.; Zhang, X.; Wang, F.; Yang, M.H. Multi-scale boosted dehazing network with dense feature fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2157–2167. [Google Scholar]
Guo, C.; Li, C.; Guo, J.; Loy, C.C.; Hou, J.; Kwong, S.; Cong, R. Zero-reference deep curve estimation for low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1780–1789. [Google Scholar]
Chen, Y.; Li, W.; Sakaridis, C.; Dai, D.; Van Gool, L. Domain adaptive faster r-cnn for object detection in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3339–3348. [Google Scholar]
Huang, S.C.; Le, T.H.; Jaw, D.W. DSNet: Joint semantic learning for object detection in inclement weather conditions. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 2623–2633. [Google Scholar] [CrossRef]
Sindagi, V.A.; Oza, P.; Yasarla, R.; Patel, V.M. Prior-based domain adaptive object detection for hazy and rainy conditions. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XIV 16. Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 763–780. [Google Scholar]
Chen, K.; Franko, K.; Sang, R. Structured model pruning of convolutional networks on tensor processing units. In Proceedings of the 2021 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS), Washington, DC, USA, 6–9 June 2021; pp. 1–5. [Google Scholar]
Kalwar, S.; Patel, D.; Aanegola, A.; Konda, K.R.; Garg, S.; Krishna, K.M. Gdip: Gated differentiable image processing for object detection in adverse conditions. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 7083–7089. [Google Scholar]
Fu, Z.; Chang, K.; Ling, M.; Zhang, Q.; Qi, E. Auxiliary Domain-Guided Adaptive Object Detection in Adverse Weather Conditions. In Asian Conference on Computer Vision; Springer: Singapore, 2025; pp. 312–329. [Google Scholar]
Wang, Y.; Xu, T.; Fan, Z.; Xue, T.; Gu, J. Adaptiveisp: Learning an adaptive image signal processor for object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 112598–112623. [Google Scholar]
Hu, J.; Wei, Y.; Chen, W.; Zhi, X.; Zhang, W. CM-YOLO: Typical Object Detection Method in Remote Sensing Cloud and Mist Scene Images. Remote Sens. 2025, 17, 125. [Google Scholar] [CrossRef]
Khanum, A.; Lee, C.Y.; Yang, C.S. Involvement of deep learning for vision sensor-based autonomous driving control: A review. IEEE Sens. J. 2023, 23, 15321–15341. [Google Scholar] [CrossRef]
Wang, W.; Zhang, J.; Zhai, W.; Cao, Y.; Tao, D. Robust object detection via adversarial novel style exploration. IEEE Trans. Image Process. 2022, 31, 1949–1962. [Google Scholar] [CrossRef]
Shen, D.; Wu, G.; Suk, H.I. Deep learning in medical image analysis. Annu. Rev. Biomed. Eng. 2017, 19, 221–248. [Google Scholar] [CrossRef]
Li, B.; Peng, X.; Wang, Z.; Xu, J.; Feng, D. An all-in-one network for dehazing and beyond. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4770–4778. [Google Scholar]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H.; Shao, L. Multi-stage progressive image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14821–14831. [Google Scholar]
Peng, D.; Ding, W.; Zhen, T. A novel low light object detection method based on the YOLOv5 fusion feature enhancement. Sci. Rep. 2024, 14, 4486. [Google Scholar] [CrossRef]
Varailhon, S.; Aminbeidokhti, M.; Pedersoli, M.; Granger, E. Source-Free Domain Adaptation for YOLO Object Detection. In *Computer Vision – ECCV 2024 Workshops*, Proceedings of the European Conference on Computer Vision (ECCV 2024), Tel Aviv, Israel, 23–27 October 2024; Springer: Berlin/Heidelberg, Germany; pp. 218–235. [Google Scholar] [CrossRef]
Li, Y.; Zhang, S.; Liu, Y.; Yang, J. Loose to compact feature alignment for domain adaptive object detection. Pattern Recognit. Lett. 2024, 181, 92–98. [Google Scholar] [CrossRef]
Ulyanov, D.; Vedaldi, A.; Lempitsky, V. Deep image prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 9446–9454. [Google Scholar]
Venkatakrishnan, S.V.; Bouman, C.A.; Wohlberg, B. Plug-and-play priors for model based reconstruction. In Proceedings of the 2013 IEEE Global Conference on Signal and Information Processing, Austin, TX, USA, 3–5 December 2013; pp. 945–948. [Google Scholar]
Zhang, Y.; Zhang, J.; Guo, X. Kindling the darkness: A practical low-light image enhancer. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 1632–1640. [Google Scholar]
He, K.; Sun, J.; Tang, X. Single image haze removal using dark channel prior. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 33, 2341–2353. [Google Scholar]
Lore, K.G.; Akintayo, A.; Sarkar, S. LLNet: A deep autoencoder approach to natural low-light image enhancement. Pattern Recognit. 2017, 61, 650–662. [Google Scholar] [CrossRef]
Arasimhan, S.G.; Nayar, S.K. Vision and the atmosphere. Int. J. Comput. Vis. 2002, 48, 233–254. [Google Scholar] [CrossRef]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Gongguo, Z.; Junhao, W. An improved small target detection method based on Yolo V3. In Proceedings of the 2021 International Conference on Electronics, Circuits and Information Engineering (ECIE), Zhengzhou, China, 22–24 January 2021; pp. 220–223. [Google Scholar]
Abbasi, H.; Amini, M.; Yu, F.R. Fog-aware adaptive yolo for object detection in adverse weather. In Proceedings of the 2023 IEEE Sensors Applications Symposium (SAS), Ottawa, ON, Canada, 18–20 July 2023; pp. 1–6. [Google Scholar]
Chu, Z. D-YOLO: A Robust Framework for Object Detection in Adverse Weather Conditions. Technical Report. 2024. Available online: https://arxiv.org/abs/2403.09233 (accessed on 12 June 2025).
Zhang, L.; Zhou, W.; Fan, H.; Luo, T.; Ling, H. Robust domain adaptive object detection with unified multi-granularity alignment. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 9161–9178. [Google Scholar] [CrossRef]
Chen, Y.; Wang, Y.; Zou, Z.; Dan, W. GMS-YOLO: A Lightweight Real-Time Object Detection Algorithm for Pedestrians and Vehicles Under Foggy Conditions. IEEE Internet Things J. 2025. [Google Scholar] [CrossRef]
Wang, H.; Shi, Z.; Zhu, C. Enhanced Multi-Scale Object Detection Algorithm for Foggy Traffic Scenarios. Comput. Mater. Contin. 2025, 82, 2451–2474. [Google Scholar] [CrossRef]
Gharatappeh, S.; Sekeh, S.; Dhiman, V. Weather-Aware Object Detection Transformer for Domain Adaptation. Technical Report. 2025. Available online: https://arxiv.org/abs/2504.10877 (accessed on 12 June 2025).
Li, B.; Ren, W.; Fu, D.; Tao, D.; Feng, D.; Zeng, W.; Wang, Z. Benchmarking single-image dehazing and beyond. IEEE Trans. Image Process. 2018, 28, 492–505. [Google Scholar] [CrossRef]
Wang, Y.; Yan, X.; Zhang, K.; Gong, L.; Xie, H.; Wang, F.L.; Wei, M. Togethernet: Bridging image restoration and object detection together via dynamic enhancement learning. Comput. Graph. Forum 2022, 41, 465–476. [Google Scholar] [CrossRef]
Wang, H.; Xu, Y.; He, Y.; Cai, Y.; Chen, L.; Li, Y.; Sotelo, M.A.; Li, Z. YOLOv5-Fog: A multiobjective visual detection algorithm for fog driving scenes based on improved YOLOv5. IEEE Trans. Instrum. Meas. 2022, 71, 1–12. [Google Scholar] [CrossRef]
Hnewa, M.; Radha, H. Integrated multiscale domain adaptive yolo. IEEE Trans. Image Process. 2023, 32, 1857–1867. [Google Scholar] [CrossRef] [PubMed]
Ma, J.; Lin, M.; Zhou, G.; Jia, Z. Joint Image Restoration for Domain Adaptive Object Detection in Foggy Weather Condition. In Proceedings of the 2024 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 27–30 October 2024; pp. 542–548. [Google Scholar]
Ogino, Y.; Shoji, Y.; Toizumi, T.; Ito, A. ERUP-YOLO: Enhancing Object Detection Robustness for Adverse Weather Condition by Unified Image-Adaptive Processing. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 26 February–6 March 2025; pp. 8597–8605. [Google Scholar]
Agarwal, S.; Birman, R.; Hadar, O. WARLearn: Weather-Adaptive Representation Learning. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 26 February–6 March 2025; pp. 4978–4987. [Google Scholar]
Liu, Z.; Fang, T.; Lu, H.; Zhang, W.; Lan, R. MASFNet: Multi-scale Adaptive Sampling Fusion Network for Object Detection in Adverse Weather. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–15. [Google Scholar] [CrossRef]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In Proceedings of the International Conference on Learning Representations (ICLR) 2021, Virtual Conference, 25–29 May 2021. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Chen, Z.; Yang, J.; Li, F.; Feng, Z.; Chen, L.; Jia, L.; Li, P. Foreign Object Detection Method for Railway Catenary Based on a Scarce Image Generation Model and Lightweight Perception Architecture. IEEE Trans. Circuits Syst. Video Technol. 2025. [Google Scholar] [CrossRef]
Yan, L.; Wang, Q.; Zhao, J.; Guan, Q.; Tang, Z.; Zhang, J.; Liu, D. Radiance Field Learners as UAV First-Person Viewers. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2024; pp. 88–107. [Google Scholar]

Figure 1. YOLO version timeline [6,7,8,9,10,11,12,13,14,15,16,17,18,19].

Figure 2. Adaptive Enhancement Algorithm AEA-YOLO model.

Figure 3. White balance filter.

Figure 4. Gamma filter.

Figure 5. Contrast filter.

Figure 6. Tone filter.

Figure 7. Image example of the sharpening filter.

Figure 8. Defog filter.

Figure 9. Image filters.

Figure 10. Detection Network (DN) diagram.

Figure 11. Atmospheric scattering model for generating foggy images.

Figure 12. Confusion matrix from left to right: DN and IEM + DN (AEA-YOLO).

Figure 13. Real image detection accuracy from left to right: YOLOv5-Fog, YOLOv9 + CLEO, YOLOv12x, our proposed DN without IEM, and our proposed with IEM + DN (AEA-YOLO).

Figure 14. AEA-YOLO model testing in a real-world scenario and evaluation with YOLO nano versions.

Table 1. PPN architecture.

Layer	Output Dimensions	Details
Input image (i)	3 × i_size × i_size	Resized input image 256 × 256
Conv Block 1	16 × 128 × 128	3 × 3, stride 2
Conv Block 2	32 × 64 × 64	3 × 3, stride 2
Conv Block 3	32 × 32 × 32	3 × 3, stride 2
Conv Block 4	32 × 16 × 16	3 × 3, stride 2
Conv Block 5	32 × 8 × 8	3 × 3, stride 2
GAP	32	32 × 8 × 8 feature tensor to a 32-D vector
FC1	128	Liner layer
FC2	N	15 Output Parameters

Table 2. The constraints of the PPN.

Filter	Parameter (θ)	Count	Symbol(s)	Range
Defog	0	1	ω	[0.1, 1.0]
White balance	1–3	3	Wr, Wg, Wb	[0.91, 1.10] (that is 1 ± 10%)
Gamma	4	1	γ	[1/3, 3]
Tone	5–12	8	t0,…, t7	[0.5, 2.0]
Contrast	13	1	λ	[1/3.5, 3.5]
Unsharp-mask	14	1	κ	[0, 5]
Total	-	15	-	-

Table 3. Experimental parameters.

Parameter	Value
Input image size	640 × 640
Training batch size	6
Initial learning rate	1 × 10⁻⁴
Momentum	0.937
Optimizer	Adam
Epoch	250

Table 4. Evaluation of AEA-YOLO.

Mode	Bicycle AP	Bus AP	Car AP	Motorbike AP	Person AP	mAP@50
DN	85.76	94.75	91.52	91.92	84.65	89.72%
IEM + DN	97.36	98.81	98.22	98.33	96.47	97.84%

Table 5. Comparison of the proposed AEA-YOLO model with existing models using PASCAL VOC Foggy images.

Model	mAP@50	Features/Domain
Fog-Aware YOLO [66] 2023	72.00%	An adaptive object detection block aims to determine the fogginess level of an image before object detection, avoiding pre-processing if fog levels are below the threshold, and applying the YOLOv3 algorithm.
D-YOLO [67] 2024	43.00%	A double-route network with an attention feature fusion module, incorporating hazy and dehazed features. A subnetwork for haze-free features.
MG-ADA [68] 2024	62.10%	Multi-granularity alignment (pixel, instance, category) for domain adaptation (SYNTHETIC TO REAL ADAPTATION DETECTION EXPERIMENT).
GMS-YOLO [69] 2025	64.70%	It is based on YOLOv10, which uses a Ghost Multi-Scale Convolution module and Shape Consistent Intersection over Union for localization loss function, with a Compensatory Consistency Matching Metric for sensitivity reduction.
Multi-Scale [70] 2025	81.07%	Enhances feature extraction with Triplet Attention, integrates the Diverse Branch Block for semantic information fusion, introduces a decoupled detection head, and uses Minimum Point Distance for faster training convergence.
PL-RT-DETR [71] 2025	90.90%	The domain adaptation strategy is for the weather adaptive attention mechanism and a weather fusion encoder to ensure feature-level consistency across domains and adapt to fog contexts.
AEA-YOLO (Ours)	97.84%	Adaptable and learning-based filters that unify domain alignment with detection.

Table 6. Statistics of the RTTS dataset.

RTTS Dataset (4322 Images)
Class	Person	Car	Bicycle	Motorbike	Bus	All
Count	7950	18,413	534	862	1838	29,597

Table 7. Detection results in domain adaptation networks on the RTTS dataset.

Model	Person	Car	Bicycle	Motorbike	Bus	mAP@50
MS-DAYOLO [23] 2021	81.30	68.00	61.40	54.60	35.20	60.10%
TogetherNet [73] 2022	82.70	75.32	57.27	55.40	37.04	61.55%
YOLOv5-Fog [74] 2022	-	-	-	-	-	77.80%
IMS-DAYOLO [75] 2023	80.50	65.10	61.20	51.90	33.00	58.30%
YOLOv9 + CLEO [25] 2024	-	-	-	-	-	79.30%
DA-YOLOX [76] 2024	78.33	75.88	59.10	54.98	44.42	62.60%
ERUP-YOLO [77] 2025	-	-	-	-	-	49.81%
WARLearn [78] 2025	-	-	-	-	-	52.60%
AD-DAYOLO [47] 2025	82.40	71.10	65.10	59.10	38.60	63.20%
MASFNet [79] 2025	85.15	80.49	72.55	66.11	64.10	73.68%
YOLOv12n 2025	84.30	86.30	71.10	74.10	66.20	76.40%
YOLOv12x 2025	89.00	91.10	79.10	80.30	74.60	82.80%
AEA-YOLO (ours)	93.79	94.49	95.92	96.69	97.16	95.61%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kariri, A.; Elleithy, K. AEA-YOLO: Adaptive Enhancement Algorithm for Challenging Environment Object Detection. AI 2025, 6, 132. https://doi.org/10.3390/ai6070132

AMA Style

Kariri A, Elleithy K. AEA-YOLO: Adaptive Enhancement Algorithm for Challenging Environment Object Detection. AI. 2025; 6(7):132. https://doi.org/10.3390/ai6070132

Chicago/Turabian Style

Kariri, Abdulrahman, and Khaled Elleithy. 2025. "AEA-YOLO: Adaptive Enhancement Algorithm for Challenging Environment Object Detection" AI 6, no. 7: 132. https://doi.org/10.3390/ai6070132

APA Style

Kariri, A., & Elleithy, K. (2025). AEA-YOLO: Adaptive Enhancement Algorithm for Challenging Environment Object Detection. AI, 6(7), 132. https://doi.org/10.3390/ai6070132

Article Menu

AEA-YOLO: Adaptive Enhancement Algorithm for Challenging Environment Object Detection

Abstract

1. Introduction

2. Related Work

2.1. Object Detection

2.2. The Adaptation of Image

2.3. Object Detection and Domain Adaptation Under Challenging Environments

3. Proposed Method

3.1. Image Enhancement Module (IEM)

3.2. Parameter Prediction Network (PPN)

3.3. Detection Network (DN)

4. Experimental Results

4.1. Experiment Dataset, Environment, and Parameters

4.2. Ablation Experiment

4.3. Comparison of Existing Integrated Methods

4.4. Testing on the Real-World Scenario Dataset

5. Limitations and Future Work

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI