Performance Degradation of Object Detection Neural Networks Under Natural Visual Contamination in Autonomous Driving

Csikor, Dániel; Hollósi, János

doi:10.3390/computers15040254

Open AccessArticle

Performance Degradation of Object Detection Neural Networks Under Natural Visual Contamination in Autonomous Driving

by

Dániel Csikor

and

János Hollósi

^*

Vehicle Industry Research Center, Széchenyi István University, 1. Egyetem tér, H-9026 Győr, Hungary

^*

Author to whom correspondence should be addressed.

Computers 2026, 15(4), 254; https://doi.org/10.3390/computers15040254

Submission received: 24 March 2026 / Revised: 14 April 2026 / Accepted: 16 April 2026 / Published: 17 April 2026

Download

Browse Figures

Versions Notes

Abstract

The operation of driver assistance systems and autonomous vehicles requires a sensor system and a control algorithm. Sensors provide information to detect people, vehicles and objects in the vehicle’s environment; however, their performance can be degraded by adverse environmental conditions and contamination. This literature review identified factors that reduce sensor visibility, such as weather conditions and external contamination. In this study, the detection efficiency of state-of-the-art neural network-based object detectors was examined in a simulation environment using a synthetic dataset. A custom dataset comprising six urban and suburban traffic scenarios was created, including clean images and ten contaminated variants per scene with increasing mud coverage. The results show that contamination leads to a measurable reduction in detection performance across all models. Smaller variants are more sensitive to degradation, while medium-complexity models provide a favorable balance between robustness and computational cost. Increasing model size yields limited additional robustness, and performance differences between architectures highlight the importance of model design. Furthermore, the spatial distribution of contamination, particularly near the image center, has a significant impact on performance in addition to its overall extent.

Keywords:

autonomous vehicle; object detection; weather conditions; robustness analysis

1. Introduction

One of the most important and perhaps most challenging tasks facing the automotive industry today is the research and development of autonomous vehicles. In recent years, a wide variety of driver assistance systems have appeared, some of which are standard equipment, but technologies that greatly increase traffic safety are available as options for almost all models. Among other things, automatic emergency braking, blind spot monitoring, lane keeping assist, adaptive cruise control, and traffic sign recognition are all driver assistance systems that serve as an important basis for the emergence of self-driving vehicles. Driver assistance systems and, in the future, autonomous vehicles require a coordinated sensor system and control algorithm to function effectively. Below are some of the devices that must be installed in a car to enable autonomous vehicles to operate on public roads: LIDAR, RADAR, stereo video camera, high-precision GPS positioning, gyroscope, ultrasonic sensors, continuously updated map database with the most accurate road information possible, including data on traffic congestion. The efficiency of autonomous vehicles is influenced by a number of factors, including the detection accuracy of the built-in sensors, the efficiency of the localization algorithm, the range of the sensors, and the detection angle of view. Weather and environmental conditions have a particular impact on the performance and reliability of autonomous vehicles [1,2].

Variable weather conditions—including rain, snow, fog, ice formation, and varying light conditions—pose challenges for the sensors responsible for the operation of autonomous vehicles. Rain and snow have a significant impact on detection accuracy in that water droplets and snowflakes can distort light under certain conditions, confusing the sensors and causing them to transmit less and less accurate data to the data processing system. Fog and smog also pose a challenge for these systems, as they limit visibility, which affects the detection efficiency of camera-based systems and LIDAR sensors. Strong sunlight can blind the camera, similar to the human eye, so this weather condition is also a critical factor in the proper functioning of driver assistance systems and autonomous vehicles. All of the above factors affect the ability of autonomous systems to detect obstacles and other road users in time, reducing the response time of the systems and delaying the speed of decision-making by the algorithms [3]. Within the scope of this study, the experimental analysis was limited exclusively to mud contamination, and other potentially influential environmental factors, such as rain, fog, or lighting variations, were not considered.

During a study, the radar detection range was examined in a simulation under different rainfall intensities. The simulation took into account multi-path effects and different signal-to-noise ratios (SINR) (10 dB, 13 dB, and 20 dB). For objects with a small radar cross section (pedestrians), the detection range decreased by 55% in extreme precipitation (400 mm/h) [4,5].

Another measurement also examined the impact of weather conditions. Statistical analyses showed that there is a significant correlation between weather conditions and the number of detected points. The greatest performance degradation occurred in foggy conditions, especially when visibility fell below 50 m. The smallest decline was observed in the case of retroreflective foil, while the black plate showed a significant decline even in light rain [6]. In this paper, the evaluation is limited to the effects of mud contamination, whereas other challenging environmental conditions were not included in the experiments.

It is true that the world faces numerous challenges and unresolved issues until the advent of autonomous vehicles—in addition to solving technological problems, social division, uncertainty about responsibilities, fear of immature technology and hacker attacks are all problems that need to be solved. With the resolution of technological problems and the clarification of legal and ethical issues, autonomous vehicles may be commercially available in certain countries and under certain conditions within a few years [7].

The aim of this research is to investigate how the performance of state-of-the-art neural network-based object detection algorithms used for autonomous vehicle environmental perception deteriorates under different weather conditions and “natural” visual contamination. To this end, we generate a synthetic dataset in the aiMotive aiSim simulation environment comprising six traffic scenarios, each including a clean baseline and 10 progressively increasing contamination levels. We then evaluate multiple modern detector families (YOLOv8, YOLO-World, YOLO-NAS, YOLOv11, YOLOv12, and RT-DETR) using F1-score, mAP⁵⁰, mAP^50–95, and recall, and quantify the extent of performance degradation across scenarios and contamination levels. The study is conducted in a repeatable simulation setting designed to approximate real-world autonomous driving conditions, enabling controlled assessment of camera-based perception limitations and the development of a comparable benchmarking framework. This framework is intended to support future extensions of the methodology toward validation with real-world driving data. An important goal of the study is to explore the extent to which different levels of contamination affect the reliability of object detection based on camera images, and to determine which of the detector families examined demonstrate greater robustness in each scenario.

Recent literature has clearly established that adverse environmental conditions such as rain, fog, snow, low light, and sensor contamination can substantially degrade the perception performance of autonomous driving systems. Existing studies and surveys have shown that these effects reduce sensing reliability, impair object detection, and ultimately influence downstream decision-making and safety-critical vehicle functions. However, the available literature is dominated either by general reviews of adverse-weather perception, by studies focusing on a single sensing modality or algorithm, or by works addressing contamination primarily from the perspective of soiling detection or image restoration rather than comparative detector robustness evaluation [8,9,10]. The scope of the present analysis is restricted to mud contamination, while other adverse environmental effects that may influence perception performance were not examined.

Augustine et al. (2024) [11] investigate the impact of camera module defects, particularly blemish artifacts, on object detection in autonomous driving systems, with a particular focus on how hardware-induced image degradation affects the reliability of AI-based perception. The paper places object detection within the broader context of autonomous driving safety and emphasizes that dependable visual sensing is a prerequisite for the robust operation of downstream perception functions. A central contribution of the study is that it shifts attention from model-side robustness improvement to the manufacturing quality of camera modules, highlighting that sensor- and optics-level imperfections can systematically propagate through the perception pipeline and degrade detection performance. At the same time, the authors argue that robustness should be assessed not only under external environmental challenges, but also in relation to production-related sensor defects, thereby underlining the need for stricter hardware quality requirements and evaluation frameworks that account for real imaging imperfections [11].

Thottempudi et al. (2025) [12] provide a comprehensive review of object detection for autonomous vehicles under adverse weather conditions, with a particular focus on how deep learning frameworks, multi-sensor fusion strategies, and specialized datasets contribute to perception robustness when visibility is degraded by rain, fog, snow, and related environmental challenges. The paper places object detection within the broader context of autonomous vehicle safety and emphasizes that reliable obstacle perception is a prerequisite for dependable real-world navigation. A central contribution of the review is its comparative analysis of 100 studies, structured around methodologies, toolsets, datasets, and evaluation metrics, which highlights the growing dominance of deep learning and sensor fusion approaches in challenging weather scenarios. At the same time, the authors stress that important limitations remain, including scalability issues, high computational demands, restricted real-world generalizability of existing datasets, and the continuing need for more weather-resilient perception systems and stronger evaluation frameworks for autonomous driving applications [12].

Kumar et al. (2026) [13] provide a comprehensive review of sensor systems for autonomous vehicles, with a particular focus on the functionality, limitations, and reliability challenges of the main sensing modalities under adverse environmental conditions. The paper places autonomous vehicle perception within a broader system-level context and emphasizes that safe and dependable autonomous operation depends on the coordinated use of cameras, LiDAR, radar, ultrasonic sensors, GPS/GNSS, IMU/INS, odometry sensors, and acoustic systems rather than on any single sensing technology alone. A central contribution of the review is its structured discussion of the working principles, advantages, and weaknesses of these sensors, together with the role of sensor fusion in improving redundancy, perception accuracy, and robustness. At the same time, the authors stress that adverse climatic conditions, integration complexity, cybersecurity concerns, and sensor reliability issues remain major barriers to full autonomy, highlighting the need for more resilient sensing architectures and more robust perception frameworks for real-world deployment [13].

At the same time, although adverse condition datasets and corruption benchmarks have become increasingly important, there remains a lack of controlled benchmarks for camera-based object detection that systematically evaluate multiple state-of-the-art detector families under progressively increasing, naturalistic lens contamination in autonomous-driving scenarios. This is an important research gap, because camera lens contamination represents a practically relevant failure mode in real traffic, while simulation-based evaluation offers the repeatability and controllability required for isolating its effect. Such a benchmark is also valuable from a validation perspective, since synthetic environments make it possible to reproduce rare or difficult-to-control conditions, even though the simulation-to-reality gap must still be acknowledged [14].

The novelty of the publication is that it examines the performance degradation of camera-based object detection on a synthetic dataset created in the aiMotive aiSim environment, containing 6 traffic scenarios and 10 gradually increasing pollution levels per scenario. Specifically, it evaluates several state-of-the-art detector families (YOLOv8, YOLO-World, YOLO-NAS, YOLOv11, YOLOv12, and RT-DETR) using a unified metric system (F1-score, mAP⁵⁰, mAP^50–95, and recall), allowing the robustness of the models to be quantified directly and comparably under the same conditions. Another novelty of the study is that it does not examine the performance of a single model in a single contamination condition, but compares the effectiveness of several object detection models in gradually deteriorating visibility conditions, thus allowing for a systematic exploration of differences in robustness. It should also be noted that the research does not treat the effect of contamination as a general deterioration in quality, but as a perception problem relevant to autonomous transport, which can directly affect the reliability of detection and, through this, higher-level decision-making processes. Thus, the study provides a reproducible benchmark framework for analyzing the impact of “natural” visual contamination (e.g., mud deposits), which directly informs subsequent real-world validation and robustness-oriented model selection.

The remainder of this paper is organized as follows: Section 2 presents the materials and methods, including the image processing mechanism of the stereo video camera, the background of adversarial attack methods relevant to camera-based perception, the investigated neural networks, and the aiMotive aiSim simulation software framework. Section 3 describes the experimental results obtained on clean and contaminated images, including the comparative performance analysis and robustness assessment of the evaluated YOLO neural networks. Section 4 discusses the perception, reliability and operational limitations of autonomous driving systems under degraded visual conditions. Finally, Section 5 summarizes the main conclusions of the study, outlines its limitations, and highlights possible directions for future research.

2. Materials and Methods

2.1. Image Processing Mechanism for Stereo Video Camera

Autonomous vehicles must be able to travel on the same infrastructure as human-driven vehicles, but they must meet slightly different requirements to ensure safe travel. Horizontal traffic control elements, i.e., road markings, play a significant role in the safe and accurate positioning of self-driving vehicles on the road. The average driver can cope with the absence of worn horizontal traffic signs more easily than an autonomous system. Road markings and lanes can be included on high-resolution digital maps, but in this case, the vehicle’s position is typically determined by satellite positioning, which cannot be guaranteed to be accurate to the centimeter in all circumstances. Some of the traffic signs that are important for drivers, such as direction signs, warning signs, and information signs, contain information that is either included in the map database or, due to the nature of the autonomous system, does not provide any additional information that is relevant to safe driving. Driving based solely on non-physical signals would require a real-time, high-precision database, the creation and ongoing operation of which would be a huge task given the size of the transport network and the number of relevant signs on it [15,16,17,18,19].

Machine vision is provided by a lens system and two CMOS (metal oxide semiconductor) color image sensors in the sensor, which typically have a resolution of 1280 × 960 pixels. The system can distinguish between colors using the color image sensor, which helps it recognize traffic signs and road markings of different colors. The program essentially identifies transitions between known colors, thereby determining the location of lane markings. Video cameras are generally capable of identifying objects at a distance of 50 to 100 m. For successful detection, the electronics look for characteristics typical of motor vehicles: axis of symmetry, license plate, number of lights, glare on the car window. Traffic signs are compared with a database fed into the computer [20,21,22]. Stereo video cameras are well suited for detecting cars, pedestrians, cyclists, and other obstacles, but radars are important for determining more accurate distance and speed values in autonomous vehicles. The camera’s performance is greatly affected by weather conditions, as its operating principle is similar to that of the human eye [23].

Turay and Vladimirova (2022) [24] present a broad survey that connects the evolution of convolutional neural networks for image classification and object detection with their practical role in autonomous driving systems. The paper reviews CNN development in a structured, historical manner, covering both heavyweight and lightweight classification networks as well as one-stage and two-stage detection architectures, while comparing them in terms of accuracy, model size, computational cost, and inference speed. A central contribution of the survey is that it does not treat object detection in isolation, but frames it within the strict real-time, safety, and embedded-computing constraints of autonomous driving, where the trade-off between detection performance and deployability is especially critical. The authors also discuss convolution types, backbone designs, implementation considerations, and industry-oriented design constraints, arguing that progress in autonomous driving perception depends not only on improving detection accuracy, but also on balancing latency, hardware efficiency, and system-level feasibility in real-world vehicle platforms [24].

2.2. Adversarial Attack Methods

The performance of environmental sensors required for the operation of autonomous vehicles at a given moment in time and the effectiveness of detection are influenced by a number of external factors, including weather conditions. These include rain, snow, and fog, for example [25]. This study focuses exclusively on mud-induced visual contamination, without considering other environmental degradation factors such as rain, fog, or illumination changes.

Ruan et al. (2023) [26] provide a review of occluded object detection in real, complex autonomous-driving scenarios, with a particular focus on how detection methods handle partially visible vehicles, pedestrians, and traffic signs under practical sensing and computational constraints. The paper places object detection within the broader context of autonomous driving and emphasizes that real-time and accurate recognition of occluded targets is a prerequisite for reliable downstream decision-making and vehicle control. A central contribution of the review is its structured comparison of recent methods from the previous five years, organized by target category and detection strategy, highlighting both the similarities and differences among approaches developed for the same sensing context. At the same time, the authors stress that occlusion remains a major challenge for autonomous-driving perception, and they identify persistent limitations, open challenges, and future research directions for improving robust target detection in complex real-world environments [26].

Pao et al. (2024) [27] investigate the influence of rain-induced lens obstruction on vehicle camera performance, with a particular focus on how droplet dynamics and lens surface wettability affect image quality and object detection accuracy in camera-based ADAS perception. The paper places camera sensing within the broader context of adverse-weather operation and emphasizes that reliable visual perception is a prerequisite for dependable driver assistance and autonomous driving functions. A central contribution of the study is its wind-tunnel-based experimental framework, which simulates realistic rain conditions perceived by a moving vehicle and enables a structured comparison between hydrophilic and hydrophobic lens surfaces. At the same time, the authors show that droplet characteristics such as size, velocity, shape, and motion can substantially influence both visual quality and downstream detection performance, and they conclude that hydrophobic lens surfaces generally provide better performance than hydrophilic ones under realistic rain-exposure conditions [27].

The quality of camera recordings captured by autonomous vehicles can therefore be influenced and modified by a number of environmental factors, and since the vehicle control system uses this data for object detection, lane recognition, and determining the direction and speed of the vehicle, it is particularly important to identify the factors that affect the efficiency and accuracy of the system. We distinguish between two groups of these factors, one of which is weather effects and style transfer phenomena, in which the camera image is modified by software to add the characteristics of another time of day to the recording. The aim of this method is to provide the system with difficult visual input, which can be used to examine the extent to which the performance of the sensor changes and, as a result, how the accuracy of recognition is affected. In the study, camera-based recognition algorithms were evaluated in a virtual test environment with varying degrees and colors of lens obstruction. According to their results, the degree of field of view obstruction was the most decisive factor in performance degradation [28].

Another study aimed to automatically detect and classify contaminants appearing on the lenses of autonomous vehicle cameras, including raindrops, mud, dust, and snow. Among the possible environmental factors affecting perception, this study investigates the impact of mud contamination exclusively. The research used a tile-based examination method, in which the camera image was divided into 64 × 64 pixel frames. This method was used to examine the extent of contamination and its effect on the performance of the sensors. The research found that contamination can be detected through contrast and sharpness tests, as well as texture analysis. Water droplets are more difficult to detect than dust, for example, because they distort what the camera sees rather than obscuring it [10,29].

With the neural style transfer method, a captured camera image (its textures and colors) can be transformed into another style or environment by defining parameters. The software can be used to create environments with different color schemes—such as sunset or twilight—which can confuse the object recognition algorithm, making it more difficult to identify other road users. For this reason, it is particularly important to identify the weak points of the sensors so that the system can be prepared for these weather and environmental conditions, either in terms of hardware or software [30].

The style transfer method can also be used to increase the robustness of recognition models (data augmentation). In their research, Jianing et al. generated night-style images from daytime images using the CycleGAN methodology and used this to teach the sensor to detect objects more effectively in nighttime conditions [30].

Unlike the method described above, adversarial perturbations assign barely noticeable—almost invisible to the naked eye—pixel differences to the input data. Studies have shown that even minimal noise can have a significant impact on the accuracy of the system, causing the neural network to draw incorrect conclusions from the data [25,31,32].

So-called iterative refinement algorithms (Projected Gradient Descent, PGD) increase the effect of perturbation in several steps. Optimization methods such as Carlini-Wagner can produce specific, predetermined outputs—for example, software can be designed so that a traffic sign is not only “unrecognized” by the neural network, but also classified into a different category (Speed Limit instead of STOP). Meanwhile, the image seen by the camera still shows a regular STOP sign. The effectiveness of adversarial patterns in application and testing has been documented in several studies. Based on current research, even a model operating with over 90% accuracy on a test database can be disrupted by adding a few pixels of targeted noise [25].

There are two typical attack methods for object recognition algorithms: (a) removing an existing object (e.g., the neural network does not detect a pedestrian or a stop sign), and (b) placing a non-existent object in space (generating false detections). The former is referred to in the literature as a vanishing attack, while the latter is referred to as a fabrication attack. In addition to testing object detection systems, this method can also be used to deceive lane detection algorithms. In this case, the two scenarios above also apply: (a) the lane detection model detects the software-generated object as a real lane, or (b) existing lanes are removed from the image, thereby deceiving the model [25,33,34].

In autonomous vehicles, the persistent deterioration in the effectiveness of camera-based detection is not an isolated problem, but a widespread uncertainty factor in the entire vehicle control process, meaning that faulty detection also affects route planning. According to Yang Li, not all perception errors are equally serious, because identifying an obstacle in front of a vehicle at the wrong time is significantly more critical to route planning than an object behind the vehicle. This effect is exacerbated in adverse weather and nighttime conditions, as the effectiveness of individual object detection algorithms is significantly reduced in fog, rain, snow, and at night [34]. Only mud-related sensor contamination was considered in this work; additional factors such as precipitation, fog, or lighting variability were outside the scope of the study.

Shin et al. [35] present a study on robust object detection in harsh autonomous-driving environments, motivated by the fact that perception systems in real traffic are exposed to multiple sources of degradation, including camera-related effects, vehicle motion, adverse weather, and system noise. The paper argues that deep learning-based vision models are highly vulnerable to perturbations and therefore require dedicated robustness-oriented design beyond standard object detection settings. In this context, the authors focus on improving detection reliability under noisy and visually degraded conditions typical of autonomous driving, framing robustness not as a secondary property but as a central requirement for practical deployment. Overall, the study contributes to the broader literature by emphasizing that object detection performance in autonomous driving must be assessed not only under ideal benchmark conditions but also under challenging real-world perturbations that directly affect perception quality and operational safety [35].

Xie and Cheng (2026) [36] proposed DFA-YOLO, an enhanced YOLOv11-based object detector designed to improve robustness under adverse weather and other visually degraded conditions. Their method introduces three main modifications: a Deformable Spatial Attention (DSA) module embedded in the backbone to adaptively focus on informative spatial regions and better handle geometric variation and partial occlusion; a Hierarchical Multi-Scale Fusion Module (HMFM) to strengthen feature integration across object scales; and an improved Wasserstein loss formulation to better address boundary ambiguity and scale sensitivity, especially for small or degraded targets. Overall, the study approaches adverse condition perception primarily from an architectural optimization perspective, arguing that robustness can be improved by making the detector more adaptive to spatial distortion, multi-scale variation, and feature degradation rather than relying solely on preprocessing or dataset-side compensation [36].

Sahingoz et al. (2026) [37] provide a PRISMA-compliant systematic literature review of YOLO architectures from v1 to v12 in healthcare applications, with the aim of synthesizing how the YOLO family has evolved and how these models have been applied across medical diagnostic tasks. The review shows that YOLO-based methods have been widely adopted in areas such as lesion detection, tumor identification, organ and anatomical structure analysis, and image-guided clinical support, primarily because they offer a favorable balance between detection accuracy and computational efficiency. At the same time, the authors emphasize that the progression from earlier to more recent YOLO variants reflects a broader shift toward improved multi-scale feature extraction, better small-object sensitivity, stronger efficiency, and greater adaptability to domain-specific requirements. Overall, the study highlights that YOLO has become an important methodological framework in healthcare imaging, while also pointing out persistent challenges related to dataset quality, annotation consistency, explainability, and clinical generalizability, which remain critical barriers to reliable real-world deployment [37].

Poor input quality caused by different weather conditions and camera contamination significantly increases the detection efficiency and uncertainty of autonomous vehicle sensors, which can lead to delayed or even incorrect decisions. The literature emphasizes that adverse environmental effects such as rain, fog, snow, nighttime conditions, or lens contamination significantly impair the stability and reliability of visual perception. Consequently, the impact of these factors cannot be considered merely a perception problem, but can be interpreted as a factor that directly affects the functional safety of autonomous vehicles. Based on the above, the investigation of sensor-level degradation is essential to understanding the extent to which the decline in perception performance affects the stability, predictability, and operational safety of autonomous systems [38]. Among the possible environmental factors affecting perception, this study investigates the impact of mud contamination exclusively.

It is particularly important to examine the robustness of different object detection models against weather conditions and camera contamination, as a deterioration in detection performance has a direct impact on the predictability and operational safety of vehicle behavior. The increased detection uncertainty resulting from sensor-level degradation not only reduces the accuracy of object recognition, but also adversely affects the functioning of higher-level autonomous functions, in particular, environmental perception, route planning, decision-making, and vehicle control. As a result, lane tracking accuracy, obstacle avoidance capabilities, and the reliability of acceleration and braking interventions may deteriorate. The research is therefore not only relevant from an image processing perspective, but also contributes to a deeper understanding of the extent to which sensor performance degradation affects the reliability, stability, and functional safety of the entire operational chain of autonomous systems [39].

Farhani et al. (2025) [40] propose an enhanced YOLO-based traffic sign detector for operation in complex visual environments, with the stated goal of improving robustness through the combined use of multi-scale attention mechanisms and Transformer-based feature modeling. The work is framed around the challenge that traffic sign detection in real driving scenes must remain reliable despite scale variation, cluttered backgrounds, and visually demanding conditions, and it addresses this by strengthening feature extraction and fusion within the YOLO framework rather than relying on a standard baseline detector alone. Overall, the study can be interpreted as an architecture-oriented contribution to robust traffic sign detection, emphasizing that improved attention to salient regions and stronger multi-scale contextual representation are key to maintaining detection performance in complex traffic scenes [40].

Tahir et al. (2024) [41] provide a comprehensive review of object detection in autonomous vehicles under adverse weather, with a particular focus on how both traditional computer vision methods and deep learning-based approaches perform when perception is degraded by rain, fog, snow, haze, low illumination, and related environmental challenges. The paper places object detection within the broader context of autonomous vehicle architecture and emphasizes that reliable perception of vehicles, pedestrians, and road lanes is a prerequisite for safe real-world deployment. A central contribution of the review is its structured comparison of traditional feature-engineering-based pipelines and modern deep learning detectors, highlighting the growing importance of data-driven methods due to their stronger adaptability and feature-learning capability in dynamic scenes. At the same time, the authors stress that no single approach fully resolves the difficulties posed by adverse weather, and they identify the need for more robust detection systems, better datasets, and improved evaluation practices to support dependable autonomous driving in challenging environmental conditions [41].

Specifically, the effectiveness of detection in vehicle control primarily affects the quality of lateral control, trajectory tracking, and obstacle avoidance. According to the literature on vision-based lateral control, perception delay and visual uncertainty directly impair lane-keeping accuracy and can even pose a safety risk because the vehicle control algorithm responds to a delayed image of the environment rather than the actual current state. The reliability of detection therefore has a direct impact on whether the planned trajectory is dynamically feasible, safe, and within the road boundaries. MPC-based approaches address this relationship by considering trajectory planning and vehicle control in a unified manner, taking into account static and dynamic obstacles as well as vehicle parameters [42].

Overall, it can be concluded that detection efficiency is a key factor in vehicle control, as it directly affects lane keeping, steering interventions, speed control, braking, obstacle avoidance, vehicle stability, and the precise operation of control algorithms. However, perception errors are not only a problem at the perception level, but can also cause information loss and various system errors during the operation of the entire autonomous system. As a result, inaccurate or uncertain perception results can also have a negative impact on higher-level decision-making processes. The propagation of such errors can reduce the predictability of vehicle behavior, impair the accuracy of the vehicle control system, and overall reduce the reliability of the system’s operation. Based on the above, detection efficiency is not only an indicator of the performance of the perception subsystem, but also one of the fundamental conditions for the full functional operation of an autonomous vehicle.

2.3. Neural Networks

2.3.1. YOLOv8

Muhammad Y. (2024) [43] presents a detailed description of the YOLOv8 neural network, which uses a CSPNet-based backbone network to extract more efficiently usable parameters. The FPN + PAN methodology serves to improve object recognition, and the so-called anchor-free approach allows for simplified and accelerated prediction. The model shows outstanding accuracy and real-time performance on various benchmark tests, such as Microsoft COCO and Roboflow 100. The unified Python package and command line interface introduced in the YOLOv8 neural network enable simplified and more efficient training of the neural network. The YOLOv8 network has shifted from anchor-based detection to an anchor-free prediction model, which directly predicts the center and size of the object. This simplifies training, provides flexibility, and helps generalize to different object types and proportions. YOLOv8 also introduces technical innovations such as the C2f module (a new type of convolutional block that optimizes information flow between layers) and the “decoupled head”, that is, separate models for classification and bbox regression, which helps improve prediction accuracy. Data augmentation methods used during training, such as mosaic or mixup, help the model to be robust to different environmental conditions, thus contributing to good generalization ability. YOLOv8 is not only capable of object detection: the framework also supports various tasks, such as instance segmentation and keypoint detection, making it a versatile, universal model. Usability was also taken into account during the development of YOLOv8: the model is available as a unified Python package with a command line interface (CLI), which simplifies training and fine-tuning, thus providing a more efficient platform for developers. The models are available in different sizes (tiny/small/medium/large/extra large), enabling scalable performance, making it easier to install on smaller, resource-constrained devices, and ensuring maximum accuracy for larger systems. YOLOv8 provides higher accuracy (mAP) while maintaining real-time inference speed [43].

2.3.2. YOLO World

The goal of YOLO-World is to overcome the limitations of traditional YOLOv8-based neural networks object recognition tied to fixed, predefined categories, by enabling detection and recognition based on open-vocabulary principles. The model uses vision-language integration: it combines visual features and natural language prompts, enabling it to detect objects that it has not explicitly seen during training. One of the key parameters of the architecture is the so-called Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN), which enables the efficient merging and semantic comparison and interpretation of text and image features. The architecture also includes a neural network that uses a deep learning model to identify the most relevant images for a given query. YOLO-World follows a “prompt-then-detect” paradigm; the user can enter natural language prompts (e.g., “red car,” “tree”), and the model searches for objects accordingly. YOLO-World is designed for real-time operation: high-resolution open-vocabulary detection can run at up to ~52 FPS (e.g., V100), which is a significant advantage over other models. The system is not only suitable for detection, but can also be fine-tuned for downstream tasks, such as instance segmentation. Overall, YOLO-World offers an unprecedented compromise: it retains the speed and efficiency of classic YOLO networks while also recognizing arbitrary not predefined object categories, thus providing a much more flexible and scalable solution for a variety of applications [44].

2.3.3. YOLO NAS

The goal of YOLO-NAS is to find the optimal detection network using automated architecture search (NAS—Neural Architecture Search), so it not only uses a pre-designed structure, but also adapts the model through search to the optimum between speed, accuracy, and hardware constraints. The method used for the search, such as AutoNAC used by developers, examines a wide range of possible architectures (such as number of layers, block type, number of channels) within a predefined “search space” and eliminates potential design constraints using an intelligent search strategy. The model is available in different size categories (e.g., small/medium/large—YOLO-NAS-s, YOLO-NAS-m, YOLO-NAS-l), so users can choose according to their needs and hardware capabilities, whether it is a mobile/edge device or a high-performance GPU server. YOLO-NAS uses pre-trained models (e.g., models published by Deci AI) that have been trained on large, well-known datasets such as COCO, Objects365, and Roboflow 100. The YOLO NAS neural network runs with relatively high detection accuracy and low latency, which is critical for real-time applications (embedded systems, edge devices, autonomous vehicles). As a result of architecture search and quantization-friendly design, YOLO-NAS typically offers a more favorable accuracy vs. latency trade-off. YOLO-NAS is a milestone in the evolution of object detection: it uses automatic architecture search instead of human design, takes advantage of quantization-friendly structures and modern training methods, and thus offers better scalability, efficiency, and real-world deployability compared to classic YOLO models [45].

2.3.4. YOLOv11

The main methodological innovation of the YOLOv11 neural network is the introduction and application of the C3k2 block, the SPPF module, and the C2PSA mechanism to further improve the efficiency of the network. In addition to object detection, the YOLOv11 neural network has become capable of instance segmentation and oriented object detection (OBB) thanks to the integration of the above modules. YOLOv11 provides better average detection accuracy (mAP) and computational efficiency compared to the previous solutions presented above, while being flexibly scalable for applications in different systems. The YOLOv11 neural network is therefore ideal for real-time execution and can also be used for computationally intensive applications. The C2PSA block (Convolutional block with Parallel Spatial Attention) applies spatial attention to maps, allowing the network to focus better on important areas, which is particularly useful in complex images where there are many different features and objects in a single image. YOLOv11 achieves better average accuracy (mAP) and better computational efficiency compared to its predecessors—thus offering a more effective compromise between accuracy and speed parameters. The model is scalable: versions range from nano (for edge/embedded systems) to extra-large (for high-performance systems), making it widely applicable from resource-constrained systems to servers. The developers paid special attention to ensuring that the number of parameters and complexity of the model did not increase significantly compared to previous models. YOLOv11 is therefore well suited for real-time applications (e.g., automatic video analysis, live camera image processing, UAV monitoring) where fast and accurate detection is important, even if the image is complex, with many objects or many different types of objects [46].

2.3.5. YOLOv12

The YOLOv12 neural network is built around the use of attention mechanisms, while maintaining the real-time speed of previous CNN-based YOLO models. An attention mechanism determines which parts of the input vector the network should focus on to generate the correct output. The model enables better recognition accuracy compared to previous solutions. For example, the YOLOv12-N version achieves 2.1% better accuracy than the YOLOv10-N version, while maintaining similar speed. The YOLOv12 version also operates more efficiently compared to previously used methods: for example, YOLOv12-S is 42% faster and uses only 36% of the computing capacity compared to similar RT-DETR models. The YOLOv12 neural network focuses on attention-based object detection. Feature extraction is performed by an improved backbone algorithm: the Residual Efficient Layer Aggregation Network (R-ELAN) structure optimizes aggregation and gradient flow between layers, enabling more stable learning and more efficient recognition. The model’s performance is evaluated using a benchmark dataset, and the results show that the proposed model achieves significantly better recognition rates than state-of-the-art models. YOLOv12 also uses 7 × 7 separable convolutions, which allow it to perceive a wide range of spatial information (large receptive field) with relatively low parameter count and computational requirements. The developers removed traditional positional encoding and fine-tuned the MLP (feed-forward) ratio, which helps keep the attention + convolution combination efficient, fast, and optimized. According to benchmark results, the YOLOv12 model achieves higher detection accuracy (mAP) and inference speed than previous neural networks: For example, the smallest “N” variant achieves ~40.6% mAP with a delay of 1.64 ms on a T4 GPU, which is 2.1% better than the previous YOLOv10-N, with similar speed. The YOLOv12 neural network combines the advantages of the classic YOLO philosophy (single-step detection, fast inference, scalability) with attention-based, higher-level object recognition [47].

2.3.6. RT-DETR

The goal of RT-DETR is to be an end-to-end transformer-based system, unlike classic two- or multi-stage object detection systems, meaning that a single model directly provides bounding boxes and classes during direct inference after learning. RT-DETR eliminates traditional Non-Maximum Suppression (NMS) post-processing, which slows down many real-time object detectors, enabling true, continuous real-time detection. The network architecture is hybrid: it uses a combination of a convolutional backbone and a hybrid encoder, which extracts multi-scale parameters from the input image and then processes this data. In addition to the encoder–decoder architecture, RT-DETR uses an IoU-aware query-selection module that intelligently selects object queries (queries used for detection), thereby increasing inference efficiency and accuracy. The RT-DETR is designed to be scalable. The model works not only with “natural” images, but also in workshop and industrial environments. Detection performance is excellent. For example, mAP⁵⁰ is very high (~0.996), and mAP^50–95 is also 0.801—indicating that the model accurately localizes and classifies errors at different IoU thresholds. The inference speed of RT-DETR is also outstanding: the tested framework ran at 67 FPS, which is sufficient for real-time applications such as production line quality control. The model also uses data augmentation and training strategies. Various data enhancement algorithms were used for training to ensure that the model could be generalized well to different circumstances and error outputs. The RT-DETR neural network represents an important step forward in the development of modern object detection: it combines the transformer-based, end-to-end paradigm with real-time data processing and can be used in industrial environments (weld seam inspection) where previously it was often necessary to choose between speed and accuracy. The transformer-based architecture of RT-DETR is capable of interpreting long-term contextual relationships, giving it an advantage over classical methods in applications where many similar objects (e.g., vehicles, pedestrians, traffic signs) appear in distorted perspectives or partially obscured. This is particularly advantageous in urban traffic or at intersections, where accurate localization and type identification are critical for decision-making. Fast, accurate, and reliable inference also contributes to the stability of predictive decision-making, even at high speeds [48].

2.4. aiMotive aiSim Software

The aiMotive aiSim (v5.10.1) software is a simulation program developed specifically for testing ADAS and autonomous vehicle technology solutions in a virtual environment. The system also allows the efficiency of sensors used in autonomous technologies to be tested. As part of this research, the object recognition capabilities of the sensors under different weather conditions were analyzed. The aiSim is capable of high-fidelity sensor simulation: various vehicle sensor configurations can be modeled, including cameras, LiDARs, ultrasonic sensors, and radars. Environmental conditions can be dynamically changed, such as rain, fog, snow, and sunlight intensity, and road surface characteristics (asphalt quality, lane markings, road defects) can also be simulated, which helps to test the effectiveness of sensor systems in near-real conditions. Within the scope of this study, the experimental analysis was limited exclusively to mud contamination, and other potentially influential environmental factors, such as rain, fog, or lighting variations, were not considered. This allows us to test under different conditions when and how the range or reliability of a given sensor type deteriorates. The aiSim software allows for the simultaneous simulation of multiple sensors in the same scenario (e.g., camera + LiDAR + radar), allowing the robustness of sensor combinations (fusion) to be tested in conditions such as rain, strong sunlight, or fog, which is crucial for testing and developing the efficiency of autonomous vehicles. The aiSim supports both Software-in-the-Loop (SiL) and Hardware-in-the-Loop (HiL) testing, meaning that virtual sensor data can be generated using the system to examine how the software and sensors would respond to specific weather conditions in real traffic. The software includes a large 3D toolbox and map/vehicle asset library. The software includes multiple road surfaces, urban and highway environments, vehicle types, pedestrians, traffic signs, and other traffic elements that contribute to creating realistic and varied simulations. The aiSim software also allows third-party sensor models to be integrated using the aiSim “Sensor API,” which provides a high degree of flexibility for testing and examining special hardware configurations. The aiSim runs on both Windows and Linux systems. The program supports industry standards and interoperable formats/interfaces such as OpenDRIVE, OpenSCENARIO, OpenCRG, OpenMATERIAL, ROS2, and FMU. This compatibility is particularly important in order to ensure that the simulation environment can be integrated with various ADAS components, devices, and software. The aiSim is a modular software that provides various C++/Python APIs for efficient management of sensors, vehicles, and scenarios built into the simulation, as well as an open SDK that allows for faster and easier integration into existing development and testing tool chains.

The simulation versions created in the software are repeatable, meaning that a test situation can be run multiple times using the same scenario, and the given case will always yield the same result. Thus, the software allows individual cases to be repeated, which is critical when testing and validating autonomous vehicles. The software and the simulations developed can be used to examine which sensors perform more reliably in conditions such as heavy rain, snowfall, or dense fog. The scope of the present analysis is restricted to mud contamination, while other adverse environmental effects that may influence perception performance were not examined. In addition to the above, the aiSim software is also ideal for systematic and comparative testing of various neural network-based object detection solutions, such as YOLOv8, YOLO World, YOLO NAS, YOLOv11, YOLOv12, and RT-DETR. The simulation program allows us to test different object detection neural networks in a uniform and controlled virtual environment. The annotated images and sensor data generated by the aiSim software can be used directly to train or fine-tune various neural networks, for example during domain adaptation, when we adapt individual models to a real traffic system.

For the reasons mentioned above, aiSim software is well suited for various research tasks and industrial environments due to the complexity of the program, its compliance with standards, and its realistic simulation capabilities. The software helps to ensure that individual sensors and neural networks can be effectively compared and their performance evaluated in a simulation environment.

3. Results

3.1. Proposed Dataset

A custom synthetic dataset was generated using the aiSim simulation platform developed by aiMotive to evaluate the performance and robustness of neural network-based object detectors under both clean and contaminated visual conditions. All images were rendered at a resolution of 1280 × 720 pixels with RGB color channels. The dataset comprises six distinct simulated scenes, each representing a different type of urban or suburban traffic environment. Illustrative examples from each scene are shown in Figure 1.

A summary of the number of images contained in each scene is presented in Figure 2, providing an overview of the dataset size distribution across the six simulated environments. For every scene, annotations were created for two object categories: pedestrian and vehicle. The six scenes are summarized below, following the same order as the example images (Figure 1) provided:

Scene 1—Commercial Parking Area (686 images):
A commercial parking-lot environment with adjacent retail buildings, landscaped vegetation, moderate vehicle presence, and occasional pedestrians.

Scene 2—Suburban Roundabout (683 images):
A suburban multilane roundabout with traffic signs, continuous vehicle flow, and surrounding residential and commercial structures.

Scene 3—Urban Downtown Corridor (812 images):
A dense downtown street canyon lined with multi-story buildings, narrow sidewalks, crosswalks, and typical inner-city traffic patterns.

Scene 4—Snowy Urban Artery (707 images):
A major urban road under heavy snowfall, with reduced visibility, snow-covered surfaces, multiple vehicles, and public-transport elements such as bus stops.

Scene 5—Residential Hillside Street (2500 images):
A sloped suburban residential neighborhood with light traffic, overhead utility lines, and houses lining both sides of the roadway.

Scene 6—Residential Street in Rain (1900 images):
The same residential area as Scene 5, but captured under rainy conditions, resulting in wet asphalt, strong reflections, and increased vehicle density.

To systematically examine neural network robustness, each scene was generated in both clean and contaminated variants. For every clean image set, 10 additional contaminated versions were produced using mud deposition effects at incrementally increasing severity levels. The contamination was generated in a stochastic manner, resulting in varying spatial distributions and intensities across the images. The resulting dirtiness appears as irregular, non-uniform patches that partially or fully occlude the underlying scene, mimicking realistic mud splashes on a vehicle-mounted camera. Depending on the severity level, the contamination ranges from semi-transparent smearing, which primarily reduces contrast, to fully opaque regions that completely block visual information.

The level of contamination was quantified by comparing each contaminated image with its corresponding clean reference. Since the contaminated sequences were generated from the same underlying scene configurations, a one-to-one correspondence exists between clean and contaminated frames. The dirtiness mask was therefore computed individually for each image pair, ensuring per-image estimation of the contamination level. Since the internal operation of the aiSim contamination generation process is not explicitly known, a data-driven approach was applied. Consequently, the dirtiness percentage reflects the proportion of image regions significantly affected by visual degradation, capturing both the spatial extent and the effective occlusion caused by the contamination.

Let

I_{c} (x, y)

and

I_{d} (x, y)

denote the clean and contaminated RGB images, respectively, defined over the image domain

Ω = \{(x, y)| 1 \leq x \leq W, 1 \leq y \leq H\}

, where

W

is the width of the image and

H

is the height of the image. Both images were first converted to grayscale using a standard luminance transformation:

Y (x, y) = 0.299 \cdot R (x, y) + 0.587 \cdot G (x, y) + 0.114 \cdot B (x, y)

(1)

This transformation was applied to both images, yielding grayscale representations

Y_{c} (x, y)

and

Y_{d} (x, y)

. Next, the absolute difference between the two grayscale images was computed:

∆ (x, y) = |Y_{d} (x, y) - Y_{c} (x, y)|

(2)

Based on this difference image, a binary contamination mask

D (x, y)

was defined:

D (x, y) = \{\begin{matrix} 1, ∆ (x, y) > τ \\ 0, o t h e r w i s e \end{matrix}

(3)

Finally, the overall contamination level (referred to as dirtiness) was computed as the ratio of contaminated pixels to the total number of pixels:

Dirtiness = \frac{\sum_{(x, y) \in Ω} D (x, y)}{|Ω|}

(4)

Examples of the contamination levels are presented in Figure 3, illustrating how increasing mud deposition progressively reduces visibility and contrast, while also affecting the spatial distribution of the contamination, as later quantified by the Center Concentration Index (CCI) defined in Section 3.1 (Equation (10)). The distribution of contamination severity across the individual contaminated subsets for each scene is shown in Figure 4. The figure provides a scene-wise overview of the percentage of image area covered by mud, highlighting both the gradual increase in contamination levels and the variability between scenes. This results in 11 subsets per scene (1 clean + 10 contaminated). The clean subsets contain a total of 7288 images, while the contaminated subsets collectively comprise approximately 72,880 images, providing a comprehensive basis for controlled robustness analysis. Table 1 provides a detailed overview of the number of images, the number of images containing objects, and the distribution of annotated pedestrian and vehicle instances across the six scenes.

3.2. Performance Analysis

This section presents a comprehensive evaluation of six state-of-the-art object detection model families under clean and environmentally contaminated imaging conditions. The investigated architectures include YOLOv8 [43], YOLOv8-Worldv2 [44], YOLO NAS [45], YOLOv11 [46], YOLOv12 [47] and RT-DETR [48], each represented by multiple model variants with different parameter counts.

All models were evaluated using their official Ultralytics [49] implementations and publicly available pretrained weights, originally trained on the COCO dataset [50,51]. As a result, the examined networks were primarily optimized for detecting real-world objects, including vehicles and pedestrians captured in natural imaging conditions. In contrast, the dataset generated using the aiSim simulation environment does not contain real vehicles or human subjects, but instead relies on computer graphics–based synthetic representations. While these objects aim to achieve a high level of visual realism, they do not necessarily match the appearance of real-world vehicles and pedestrians in all visual details.

Consequently, beyond measuring robustness to environmental contamination, the evaluation also provides insight into the generalization capability of detection models trained on real-world data when applied to synthetic domains. This aspect is particularly relevant for simulation-based validation pipelines, where pretrained perception models are often deployed without additional domain-specific fine-tuning.

During the evaluation, only pedestrian and vehicle classes were considered. All detections belonging to other object categories were excluded from the analysis to ensure a focused and consistent comparison across models and datasets. No fine-tuning or additional training was performed in this study; all models were used in a purely inference-based evaluation to ensure a fair comparison. While the exact hyperparameter configurations used during the original COCO training are not fully disclosed for all architectures, Ultralytics models generally follow standardized training pipelines involving multi-scale training, extensive data augmentation, and architecture-specific optimization strategies. Each model variant was evaluated on both the clean and contaminated versions of the proposed dataset using four performance metrics: F1-score, mAP⁵⁰, mAP^50–95, and recall. The mAP⁵⁰ and mAP^50–95 metrics correspond to IoU thresholds of 0.5 and 0.5–0.95 (step 0.05), respectively, following a custom implementation based on the COCO protocol. Predictions were sorted by confidence and matched to ground truth using the highest IoU, ensuring that each ground truth instance was assigned at most once. Precision–recall curves were implicitly constructed from the ranked predictions, and Average Precision (AP) was computed as the area under the curve.

Figure 5 shows the mAP^50–95 performance of the evaluated models on the COCO dataset as a function of the number of model parameters. Each data point corresponds to one model variant from the investigated architectures.

Figure 6 presents the detection results obtained on the clean version of the dataset for all evaluated model families and variants. The performance is shown as a function of the number of model parameters using four metrics: F1-score, mAP⁵⁰, mAP^50–95, and recall.

As shown in Figure 6a, the F1-score generally increases with model size across most architectures. YOLOv11 and YOLO12 achieve the highest F1-scores in the mid- to large-scale model range, while YOLOv8 exhibits a more gradual increase. YOLO NAS shows lower F1-scores compared to the YOLO-based models, and RT-DETR yields consistently lower values within the evaluated parameter range.

Figure 6b summarizes the mAP⁵⁰ results. All model families demonstrate a clear improvement with increasing parameter count. Larger variants of YOLOv11, YOLO12, and YOLO NAS achieve comparable mAP⁵⁰ values, while YOLOv8-Worldv2 shows slightly lower performance across the examined sizes. RT-DETR achieves relatively high mAP⁵⁰ values despite its limited set of model variants.

The mAP^50–95 results are shown in Figure 6c. A consistent upward trend can be observed for all architectures as model size increases. YOLOv11, YOLO12, and YOLOv8 reach similar performance levels for larger models, while YOLO NAS exhibits a slightly lower but steadily improving trend. RT-DETR achieves competitive mAP^50–95 values compared to similarly sized models.

Recall values for the clean dataset are presented in Figure 6d. Recall increases with model size for all evaluated architectures. YOLO NAS and RT-DETR achieve the highest recall values among the larger models, while YOLOv8-Worldv2 shows lower recall across the examined parameter range. The remaining YOLO-based architectures exhibit comparable recall performance for medium and large model variants.

Figure 7 presents the detection results obtained on the contaminated version of the dataset for all evaluated model families and variants. The performance is shown as a function of the number of model parameters using the same four metrics as in the clean-data evaluation.

As shown in Figure 7a, the F1-score generally increases with model size across most architectures, similarly to the clean dataset. YOLO11 and YOLO12 achieve the highest F1-scores for medium and large models, while YOLOv8 exhibits a more moderate increase. YOLO NAS shows lower F1-scores across the evaluated sizes, and RT-DETR achieves the lowest values among the compared architectures.

Figure 7b summarizes the mAP⁵⁰ results under contaminated conditions. All model families demonstrate improved performance with increasing parameter count. YOLO11, YOLO12, and YOLO NAS reach comparable mAP⁵⁰ values for larger models, while YOLOv8-Worldv2 remains consistently lower across the examined parameter range. RT-DETR achieves relatively high mAP⁵⁰ values compared to other models with similar parameter counts.

The mAP^50–95 results are shown in Figure 7c. A steady increase in performance can be observed for all architectures as model size increases. Larger variants of YOLOv8, YOLO11, and YOLO12 reach similar mAP^50–95 values, while YOLO NAS exhibits a slightly lower but gradually increasing trend. YOLOv8-Worldv2 shows lower mAP^50–95 values compared to the other evaluated architectures.

Recall values for the contaminated dataset are presented in Figure 7d. Recall increases with model size for all model families. RT-DETR and YOLO NAS achieve the highest recall values among the larger models, while YOLOv8-Worldv2 shows lower recall and a less consistent trend across the examined parameter range. The remaining YOLO-based models exhibit comparable recall performance for medium and large variants.

Figure 8 illustrates the performance difference between the clean and contaminated datasets for all evaluated model families and variants. The results are presented as absolute differences for F1-score, mAP⁵⁰, mAP^50–95, and recall as a function of the number of model parameters.

As shown in Figure 8a, the largest decrease in F1-score occurs for the smallest model variants across all architectures. The performance gap generally decreases with increasing model size. YOLOv8-Worldv2 exhibits a relatively higher F1-score difference for medium and large models compared to the other architectures, while YOLO NAS and RT-DETR show smaller differences in the higher parameter range.

Figure 8b presents the difference in mAP⁵⁰. The performance gap is highest for small models and decreases as model size increases for most architectures. YOLOv8-Worldv2 and RT-DETR maintain a relatively stable mAP⁵⁰ difference across the evaluated parameter range, while YOLO11 and YOLO12 show a more pronounced reduction in the gap for larger models.

The mAP^50–95 differences are shown in Figure 8c. All architectures exhibit the largest performance drop for small model variants, followed by a gradual reduction as model size increases. YOLOv8, YOLO11, and YOLO12 converge to similar mAP^50–95 differences for larger models, while YOLO NAS shows a slightly lower gap in the mid- to large-scale range. RT-DETR exhibits a nearly constant difference across its evaluated variants.

Recall differences are presented in Figure 8d. A substantial recall decrease is observed for the smallest models, while larger variants show a reduced performance gap. YOLOv8-Worldv2 exhibits an increasing recall difference for the largest model variant, whereas the remaining architectures display more consistent trends with relatively smaller differences in the higher parameter range.

The relative robustness of the evaluated models was quantified using the Robustness Index (RI), which aggregates the normalized performance degradation across multiple evaluation metrics.

For each metric

m \in M

, where

M = \{F 1, {mAP}^{50}, {mAP}^{50 – 95}, Recall\}

(5)

and a normalized metric value was computed for every

m \in M

as

{Δ m}^{norm} = \frac{m_{clean} - m_{dirty}}{m_{clean}}

(6)

where

m_{clean}

denotes the performance measured on the clean dataset and

m_{dirty}

denotes the corresponding performance on the contaminated dataset.

Based on these normalized degradation values, the Robustness Index (RI) was defined as

RI = 1 - \frac{1}{|M|} \sum_{m \in M} {Δ m}^{norm}

(7)

A Robustness Index value of

RI = 1

indicates unchanged performance under contamination across all evaluated metrics, while lower values correspond to increasing performance degradation. Values slightly above

1

may occur if certain metrics improve under contaminated conditions.

Based on the RI values shown in Figure 9, both model size and architectural design have a significant influence on detection robustness under contaminated conditions. A consistent robustness–complexity trend can be observed across all evaluated model families. The smallest variants of YOLOv8, YOLO11, and YOLO12 exhibit the lowest robustness indices, indicating pronounced sensitivity to visual contamination despite their competitive performance under clean conditions. Increasing model capacity within these families leads to substantial robustness gains, with medium-sized variants already achieving robustness levels close to those of the largest models, while further scaling yields diminishing returns. RT-DETR demonstrates inherently high robustness across its evaluated configurations but at the cost of substantially higher parameter counts, limiting its applicability in resource-constrained settings. YOLOv8-Worldv2 shows moderate robustness that does not scale proportionally with model size, suggesting that open-vocabulary capabilities introduce additional vulnerability under environmental degradation. In contrast, YOLO NAS consistently achieves high robustness even at moderate model sizes, indicating a favorable architectural balance. Overall, the results indicate that medium-complexity models provide the most effective trade-off between robustness and model size.

Figure 10 illustrates the average Robustness Index (RI) of the evaluated model families as a function of contamination level. A clear global decreasing trend can be observed across all architectures: as the level of contamination increases, detection performance gradually deteriorates. Although the curves exhibit local fluctuations, the overall tendency is consistent, indicating that increasing visual degradation reduces the robustness of all investigated models.

In terms of relative performance, the YOLO11 model family consistently achieves the highest RI values across nearly all contamination levels. This indicates that YOLO11 provides the most robust overall behavior among the evaluated detector families, maintaining superior performance even under increasingly adverse visual conditions. While the other model families follow similar global trends, their RI values remain systematically below those of YOLO11, especially at higher contamination levels.

At the same time, the relationship between contamination level and robustness is not strictly monotonic. A particularly noticeable drop in RI can be observed around the 12% contamination level across multiple model families. This suggests that the degradation cannot be explained solely by the overall amount of contamination, and that another factor must also contribute to the observed performance loss.

To capture the spatial distribution of contamination, a Gaussian-based Center Concentration Index (CCI) was introduced. Let

D (x, y)

denote the binary contamination mask, where

D (x, y) = \{\begin{matrix} 1, i f p i x e l (x, y) i s c o n t a m i n a t e d \\ 0, o t h e r w i s e \end{matrix}

(8)

Let

(x_{c}, y_{c})

denote the image center. A Gaussian weighting function is then defined as

w (x, y) = \exp (- \frac{{(x - x_{c})}^{2} + {(y - y_{c})}^{2}}{2 σ^{2}})

(9)

where

σ

controls the spatial extent of the central emphasis. Based on this weighting, the Gaussian-based CCI is computed as

CCI = \frac{\sum_{(x, y) \in Ω} D (x, y) w (x, y)}{\sum_{(x, y) \in Ω} w (x, y)}

(10)

where

Ω = \{(x, y)| 1 \leq x \leq W, 1 \leq y \leq H\}

represent the discrete image domain of image width

W

and image height

H

.

This formulation assigns larger weights to contaminated pixels closer to the image center and smaller weights to those located toward the periphery. As a result, the metric not only quantifies how much contamination is present, but also how strongly it is concentrated around the central region of the image.

Figure 11 shows the variation in the CCI as a function of contamination level. In contrast to the smoother global trend of the RI curves, the CCI values fluctuate considerably across the contamination range. Most importantly, a clear increase in CCI is visible around the same contamination level, approximately 12%, where the RI curves exhibit a marked drop.

The comparison of Figure 10 and Figure 11 suggests that the stronger performance degradation observed around 12% contamination is associated with a higher spatial concentration of the contamination. Notably, this contamination level coincides with a peak in the Center Concentration Index (CCI), while nearby levels (e.g., around 10% or 14%) exhibit lower CCI values. It is important to note that the spatial distribution of contamination was generated stochastically, and therefore the CCI does not follow a monotonic relationship with the overall dirtiness level. However, the results indicate that when contamination happens to be more concentrated near the image center, it can lead to disproportionately larger performance degradation compared to cases with similar overall contamination but more peripheral distribution. These findings suggest that model robustness is influenced not only by the magnitude of contamination, but also by its spatial placement. Consequently, a global dirtiness measure alone may not be sufficient to fully characterize performance degradation, and spatially aware metrics such as the Gaussian-based CCI can provide additional insight. This directly supports our hypothesis that contamination located near the image center has a more pronounced impact on detection performance. All of this also explains the pronounced performance gap observed around the 12% contamination level, which is driven by spatial concentration effects rather than the overall amount of contamination.

4. Discussion

Modern driver assistance systems are significantly changing the future of driving and transportation. Drivers need to learn new skills to use these systems effectively and safely. They must understand how driver assistance systems work, their limitations, and the situations in which they are effective. These systems can improve the driving experience, increase safety, and reduce driver fatigue, but only if they are used correctly. Environmental factors such as heavy rain or fog can further affect the effectiveness of sensors and cameras, and systems may not always correctly interpret complex or unexpected traffic situations. Consequently, drivers must understand when and how to take control of the vehicle. In this paper, the evaluation is limited to the effects of mud contamination, whereas other challenging environmental conditions were not included in the experiments.

The ability to absorb and process information is a key factor in the effective use of modern driver assistance systems. Drivers must be able to quickly and efficiently process the complex and continuously changing information provided by vehicle sensors, cameras, and onboard systems. As vehicle technology advances, the amount of available information increases, requiring drivers to adapt to more complex decision-making processes. In emergency or unexpected traffic situations, rapid information processing and decision-making are particularly critical. However, the growing volume of information can also overwhelm drivers, making it essential to prioritize relevant signals and maintain focus on the most important cues. Even when driver assistance systems are active, continuous monitoring of the environment remains necessary, as drivers must be able to assume control immediately if the system fails to respond appropriately. In this paper, the evaluation is limited to the effects of mud contamination, whereas other challenging environmental conditions were not included in the experiments.

From a broader methodological perspective, the evaluation of camera-based perception under contamination is important not only because it reveals a potential source of detection uncertainty, but also because it contributes to a more realistic understanding of the operational limitations of autonomous driving systems. In real traffic environments, sensor degradation does not occur in isolation, but interacts with scene complexity, dynamic objects, illumination variability, and other sources of uncertainty. Therefore, even controlled studies focusing on a single contamination type can provide valuable insight into how perception reliability may be challenged in practice. At the same time, the use of a synthetic and fully controllable simulation environment makes it possible to isolate specific degradation effects in a reproducible way, thereby offering a useful methodological basis for subsequent validation studies under more complex real-world conditions.

The future of mobility depends on the successful integration of technological innovation and human adaptation. Developments in driver assistance systems and autonomous driving technologies hold great promise for improving road safety and reducing traffic-related injuries and fatalities. The acceptance and proper use of advanced vehicle technologies require drivers to remain open to innovation while understanding both the benefits and the limitations of these systems. The development and deployment of driver assistance systems and autonomous vehicles involve numerous complex decisions and require careful consideration of diverse traffic scenarios, including rare and hazardous situations. Continued research, rigorous evaluation under realistic conditions, and responsible system integration are essential steps toward achieving safer road transport in the future.

5. Conclusions and Future Directions

As part of this research, object detection models from multiple neural network families with varying complexity were evaluated to assess detection robustness under realistic visual contamination. The results indicate that lower-complexity variants within each model family often appear sufficiently effective under ideal evaluation conditions, such as those represented by the COCO dataset. Under clean imaging conditions, even the smallest models achieve competitive performance across standard detection metrics, which may initially suggest their suitability for practical applications. However, the experiments conducted on contaminated data demonstrate that this apparent effectiveness does not persist under environmental degradation. The smallest model variants consistently exhibit the most pronounced performance degradation across all evaluated metrics, including F1-score, mAP⁵⁰, mAP^50–95, and recall, indicating a high sensitivity to lens contamination and visual noise. In contrast, medium-complexity models show substantially smaller performance drops and maintain more stable detection accuracy across contaminated scenarios. Further increases in model complexity lead to diminishing robustness gains, suggesting that excessive scaling alone does not proportionally improve resistance to visual degradation. These observations confirm that model selection based solely on clean benchmark performance can be misleading, and that medium-complexity architectures provide a more reliable trade-off between efficiency and robustness in realistic environmental conditions.

A consistent correlation can be observed between robustness and the complexity of a given model. Overall, the YOLOv11 model family exhibited the most stable robustness behavior across different levels of noise, while YOLO NAS demonstrated a favorable balance between robustness and model size (complexity), in contrast to RT-DETR’s high parameter requirements. It is also important to note that changes in detection efficiency are determined not only by the degree of contamination but also by its spatial distribution: contamination concentrated closer to the center of the image clearly and demonstrably caused greater degradation, a finding supported by the combined analysis of the CCI and RI curves. This suggests that when determining the robustness level of individual models, measuring the “contamination level” alone is not sufficient; rather, indicators that also take into account the spatial structure of the contamination are required.

The main findings of this study may be summarized as follows:

Mud contamination led to a measurable reduction in object detection performance for all investigated detector families;
The smallest model variants were the most vulnerable to visual degradation, although they remained competitive on clean benchmark data;
Medium-complexity architectures offered the most balanced compromise between robustness and computational scale;
YOLOv11 showed the most stable robustness trend across contamination levels, whereas YOLO-NAS provided a favorable robustness-to-complexity ratio;
Increasing parameter count beyond a certain level yielded only limited additional robustness benefits;
The spatial concentration of contamination, especially near the image center, proved to be an important factor in performance degradation in addition to the overall degree of contamination.

One of the main limitations of the study is that the tests were conducted in a controlled simulation environment using synthetic data sets, so the generalizability of the results may be affected by the greater complexity and variability of real road environments. Among the limitations, it is also important to note that although synthetic datasets are highly controllable and reproducible, they cannot fully replicate the complex effects that occur in real-world road environments, such as changing lighting conditions, sensor noise, dynamic background elements, or the actual pattern of dirt. Accordingly, the detection efficiency values identified as results can be interpreted primarily as comparative benchmarks under identical conditions, rather than as comprehensive real-world validation.

As an extension of the publication, further research potential lies in analyzing detection degradation in different weather and pollution conditions using the same metric system (F1, mAP, recall) on our own recordings made on real roads, enabling direct validation of the proposed approach and supporting the calibration of the simulated dirtiness metric against real-world visual degradation. In further research, it would be advisable to conduct the investigation at different times of day, under varying traffic conditions, and broken down into several object classes, in order to explore in greater detail the sensitivity of detection performance in real-world environments. It is particularly important to analyze false negative and false positive error types separately, as these can affect the reliability and efficiency of autonomous system decision-making to varying degrees. This allows the simulation results to be confirmed in a real environment and critical object classes and error types to be identified more accurately. An important long-term goal of this research is the validation of the proposed approach on embedded platforms, particularly on NVIDIA Jetson-based systems integrated into our autonomous robot and vehicle platforms. This will enable real-time evaluation of selected models under realistic operating conditions and provide further insight into their practical applicability.

Author Contributions

Conceptualization, D.C. and J.H.; methodology, J.H. and D.C.; software, J.H.; validation, J.H.; formal analysis, J.H. and D.C.; investigation, D.C.; resources, D.C.; data curation, J.H.; writing—original draft preparation, D.C.; writing—review and editing, D.C. and J.H.; visualization, J.H.; supervision, J.H.; project administration, D.C.; funding acquisition, D.C. All authors have read and agreed to the published version of the manuscript.

Funding

Supported by the EKÖP-25-4-II-SZE-142 University Research Fellowship Program of the Ministry for Culture and Innovation from the source of the National Research, Development and Innovation Fund.

Data Availability Statement

Data for this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Goodin, C.; Carruth, D.; Doude, M.; Hudson, C. Predicting the Influence of Rain on LIDAR in ADAS. Electronics 2019, 8, 89. [Google Scholar] [CrossRef]
Aloufi, N.; Alnori, A.; Basuhail, A. Enhancing Autonomous Vehicle Perception in Adverse Weather: A Multi Objectives Model for Integrated Weather Classification and Object Detection. Electronics 2024, 13, 3063. [Google Scholar] [CrossRef]
Linnhoff, C.; Hofrichter, K.; Elster, L.; Rosenberger, P.; Winner, H. Measuring the Influence of Environmental Conditions on Automotive Lidar Sensors. Sensors 2022, 22, 5266. [Google Scholar] [CrossRef]
Zang, S.; Ding, M.; Smith, D.; Tyler, P.; Rakotoarivelo, T.; Kaafar, M.A. The Impact of Adverse Weather Conditions on Autonomous Vehicles: Examining how rain, snow, fog, and hail affect the performance of a selfdriving car. IEEE Veh. Technol. Mag. 2019, 14, 103–111. [Google Scholar] [CrossRef]
Heinzler, R.; Schindler, P.; Seekircher, J.; Ritter, W.; Stork, W. Weather Influence and Classification with Automotive Lidar Sensors. In Proceedings of the 2019 IEEE Intelligent Vehicles Symposium (IV), Paris, France, 9–12 June 2019. [Google Scholar] [CrossRef]
Kim, J.; Park, B.-J.; Kim, J. Empirical Analysis of Autonomous Vehicle’s LiDAR Detection Performance Degradation for Actual Road Driving in Rain and Fog. Sensors 2023, 23, 2972. [Google Scholar] [CrossRef] [PubMed]
Diaz-Piedra, C.; Liedo, B.; de Prado-Gordillo, M.N.; Caurcel, M.J.; Di Stasi, L.L. Ethical and legal challenges of automated driving: The prioritization of socio-political values. Transp. Res. Procedia 2023, 72, 2449–2456. [Google Scholar] [CrossRef]
Chen, Z.; Zhang, Z.; Su, Q.; Yang, K.; Wu, Y.; He, L.; Tang, X. Object detection for autonomous vehicles under adverse weather conditions. Expert Syst. Appl. 2026, 296, 128994. [Google Scholar] [CrossRef]
Feng, S.; Cai, X.; Li, L.; Wang, W.; Ying, S. A review of research on vehicle detection in adverse weather environments. J. Traffic Transp. Eng. (Engl. Ed.) 2025, 12, 1452–1483. [Google Scholar] [CrossRef]
Zhang, Y.; Carballo, A.; Yang, H.; Takeda, K. Perception and sensing for autonomous vehicles under adverse weather conditions: A survey. ISPRS J. Photogramm. Remote Sens. 2023, 196, 146–177. [Google Scholar] [CrossRef]
Augustine, N.; Schwab, M.; Klarmann, S.; Pfefferer, C.; Schiendorfer, A. Impact of Blemish Artefacts on Object Detection Models in Autonomous Driving: A Study on Camera Module Manufacturing Defects. Procedia Comput. Sci. 2024, 232, 616–625. [Google Scholar] [CrossRef]
Thottempudi, P.; Jambek, A.B.B.; Kumar, V.; Acharya, B.; Moreira, F. Resilient object detection for autonomous vehicles: Integrating deep learning and sensor fusion in adverse conditions. Eng. Appl. Artif. Intell. 2025, 151, 110563. [Google Scholar] [CrossRef]
Kumar, M.; Rattan, N.; Mondal, S. Sensor systems for autonomous vehicles: Functionality and reliability challenges in adverse environmental conditions. Measurement 2026, 258, 119215. [Google Scholar] [CrossRef]
Liu, J.; Wang, Z.; Ma, L.; Fang, C.; Bai, T.; Zhang, X.; Liu, J.; Chen, Z. Benchmarking Object Detection Robustness against Real-World Corruptions. Int. J. Comput. Vis. 2024, 132, 4398–4416. [Google Scholar] [CrossRef]
Li, B.; Song, D.; Li, H.; Pike, A.; Carlson, P. Lane Marking Quality Assessment for Autonomous Driving. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018. [Google Scholar] [CrossRef]
Biermeier, S.; Kemper, D.; García-Hernandez, A. Road marking visibility for automated vehicles: Machine detectability and maintenance standards. Case Stud. Constr. Mater. 2025, 22, e04430. [Google Scholar] [CrossRef]
Chen, H.; Ali, M.A.; Nukman, Y.; Razak, B.A.; Turaev, S.; Chen, Y.; Zhang, S.; Huang, Z.; Wang, Z.; Abdulghafor, R. Computational methods for automatic traffic signs recognition in autonomous driving on road: A systematic review. Results Eng. 2024, 24, 103553. [Google Scholar] [CrossRef]
Zhao, J.; Zhao, W.; Deng, B.; Wang, Z.; Zhang, F.; Zheng, W.; Cao, W.; Nan, J.; Lian, Y.; Burke, A.F. Autonomous driving system: A comprehensive survey. Expert Syst. Appl. 2024, 242, 122836. [Google Scholar] [CrossRef]
Rani, A.R.; Anusha, Y.; Cherishama, S.; Laxmi, S.V. Traffic sign detection and recognition using deep learning-based approach with haze removal for autonomous vehicle navigation, e-Prime-Advances in Electrical Engineering. Electron. Energy 2024, 7, 100442. [Google Scholar] [CrossRef]
Zaarane, A.; Slimani, I.; Al Okaishi, W.; Atouf, I.; Hamdoun, A. Distance measurement system for autonomous vehicles using stereo camera. Array 2020, 5, 100016. [Google Scholar] [CrossRef]
Zakaria, N.J.; Shapiai, M.I.; Ghani, R.A.; Yassin, M.N.M.; Ibrahim, M.Z.; Wahid, N.; Yasin, M.N.M. Lane Detection in Autonomous Vehicles: A Systematic Review. IEEE Access 2023, 11, 3729–3765. [Google Scholar] [CrossRef]
Al Noman, A.; Li, Z.; Almukhtar, F.H.; Rahaman, F.; Omarov, B.; Ray, S.; Miah, S.; Wang, C. A computer vision-based lane detection technique using gradient threshold and hue-lightness-saturation value for an autonomous vehicle. Int. J. Electr. Comput. Eng. (IJECE) 2023, 13, 347–357. [Google Scholar] [CrossRef]
Song, J.G.; Lee, J.W. CNN-Based Object Detection and Distance Prediction for Autonomous Driving Using Stereo Images. Int. J. Automot. Technol. 2023, 24, 773–786. [Google Scholar] [CrossRef]
Turay, T.; Vladimirova, T. Towards Performing Image Classification and Object Detection with Convolutional Neural Networks in Autonomous Driving Systems: A Survey. IEEE Access 2021, 10, 14076–14119. [Google Scholar] [CrossRef]
Ibrahum, A.D.M.; Hussain, M.; Hong, J.-E. Deep learning adversarial attacks and defenses in autonomous vehicles: A systematic literature review from a safety perspective. Artif. Intell. Rev. 2025, 58, 28. [Google Scholar] [CrossRef]
Ruan, J.; Cui, H.; Huang, Y.; Li, T.; Wu, C.; Zhang, K. A review of occluded objects detection in real complex scenarios for autonomous driving. Green Energy Intell. Transp. 2023, 2, 100092. [Google Scholar] [CrossRef]
Pao, W.Y.; Li, L.; Agelin-Chaab, M. Perceived rain dynamics on hydrophilic/hydrophobic lens surfaces and their influences on vehicle camera performance. Trans. Can. Soc. Mech. Eng. 2024, 48, 543–553. [Google Scholar] [CrossRef]
Son, S.; Lee, W.; Jung, H.; Lee, J.; Kim, C.; Lee, H.; Park, H.; Lee, H.; Jang, J.; Cho, S.; et al. Evaluation of Camera Recognition Performance under Blockage Using Virtual Test Drive Toolchain. Sensors 2023, 23, 8027. [Google Scholar] [CrossRef]
Das, A. SoildNet: Soiling Degradation Detection in Autonomous Driving. arXiv 2019, arXiv:1911.01054. [Google Scholar] [CrossRef]
Shen, J.; Li, R. Vehicle Detection at Night Based on Style Transfer Image Enhancement. J. Inf. Process. Syst. 2023, 19, 663–672. [Google Scholar] [CrossRef]
Appiah, O.; Mensah, S. Object detection in adverse weather conditions for autonomous vehicles. Multimed. Tools Appl. 2024, 83, 28235–28261. [Google Scholar] [CrossRef]
Goodfellow, I.; Shlens, J.; Szegedy, C. Explaining and harnessing adversarial examples. arXiv 2015, arXiv:1412.6572. [Google Scholar] [CrossRef]
Eykholt, K. Robust Physical-World Attacks on Deep Learning Visual Classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar] [CrossRef]
Li, W.; Yang, X. Transcendental Idealism of Planner: Evaluating Perception from Planning Perspective for Autonomous Driving. In Proceedings of the 40th International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar] [CrossRef]
Kim, Y.; Hwang, H.; Shin, J. Robust object detection under harsh autonomous-driving environments. IET Image Process. 2021, 16, 958–971. [Google Scholar] [CrossRef]
Xie, L.; Cheng, L. DFA-YOLO: Deformable Spatial Attention and Hierarchical Fusion for Robust Object Detection in Adverse Weather. Sensors 2026, 26, 2229. [Google Scholar] [CrossRef] [PubMed]
Sahingoz, O.K.; Karatas Baydogmus, G.; Kugu, E. A Systematic Literature Review of You Only Look Once Architectures (v1–v12) in Healthcare Systems. Diagnostics 2026, 16, 935. [Google Scholar] [CrossRef]
Sakaridis, C.; Wang, H.; Li, K.; Zurbrügg, R.; Jadon, A.; Abbeloos, W.; Reino, D.O.; Van Gool, L.; Dai, D. ACDC: The Adverse Conditions Dataset with Correspondences for Robust Semantic Driving Scene Perception. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 48, 2970–2988. [Google Scholar] [CrossRef]
Uricar, M.; Sistu, G.; Rashed, H.; Vobecky, A.; Kumar, V.R.; Krizek, P.; Burger, F.; Yogamani, S. Let’s Get Dirty: GAN Based Data Augmentation for Camera Lens Soiling Detection in Autonomous Driving. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Virtual, 5–9 January 2021; pp. 766–775. [Google Scholar] [CrossRef]
Farhani, N.; Hermassi, M.; Hajjaji, M.A.; Zrigui, M. Multi-Scale Attention and Transformer-Enhanced YOLO Architecture for Robust Traffic Sign Detection in Complex Visual Environments. Intell. Artif. 2025, 20, 29–52. [Google Scholar] [CrossRef]
Tahir, N.U.A.; Zhang, Z.; Asim, M.; Chen, J.; Elaffendi, M. Object Detection in Autonomous Vehicles under Adverse Weather: A Review of Traditional and Deep Learning Approaches. Algorithms 2024, 17, 103. [Google Scholar] [CrossRef]
Khalil, A.; Kwon, J. PLM-Net: Perception Latency Mitigation Network for Vision-Based Lateral Control of Autonomous Vehicles. Sensors 2026, 26, 1798. [Google Scholar] [CrossRef] [PubMed]
Yaseen, M. What is YOLOv8: An In-Depth Exploration of the Internal Features of the Next-Generation Object Detector. arXiv 2024, arXiv:2408.15857. [Google Scholar] [CrossRef]
Cheng, T.; Song, L.; Ge, Y.; Liu, W.; Wang, X.; Shan, Y. YOLO-World: Real-Time Open-Vocabulary Object Detection. arXiv 2024, arXiv:2401.17270. [Google Scholar] [CrossRef]
Gupta, C.; Gill, N.S.; Gulia, P.; Kumar, A.; Karamti, H.; Moges, D.M. An optimized YOLO NAS based framework for realtime object detection. Sci. Rep. 2025, 15, 32903. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar] [CrossRef]
Liu, G.; Yang, D.; Ye, J.; Lu, H.; Wang, Z.; Zhao, Y. A real-time welding defect detection framework based on RT-DETR deep neural network. Adv. Eng. Inform. 2025, 65, 103318. [Google Scholar] [CrossRef]
Buslaev, A.; Iglovikov, V.I.; Khvedchenya, E.; Parinov, A.; Druzhinin, M.; Kalinin, A.A. Albumentations: Fast and Flexible Image Augmentations. Information 2020, 11, 125. [Google Scholar] [CrossRef]
Jocher, G.; Qiu, J.; Chaurasia, A. Ultralytics YOLO, Version 8.0.0; Ultralytics: Frederick, MD, USA, 2023; Available online: https://github.com/ultralytics/ultralytics (accessed on 15 January 2026).
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Zurich, Switzerland, 2014; pp. 740–755. [Google Scholar] [CrossRef]

Figure 1. Example images from the six simulated scenes included in the proposed dataset: (a) Commercial Parking Area, (b) Suburban Roundabout, (c) Urban Downtown Corridor, (d) Snowy Urban Artery, (e) Residential Hillside Street, (f) Residential Street in Rain.

Figure 2. Number of images per scene.

Figure 3. Illustration of the applied mud contamination levels and their corresponding spatial distribution (CCI) in the images: (a) 0.35% dirtiness and 0.22 CCI, (b) 4.48% dirtiness and 0.30 CCI, (c) 11.2% dirtiness and 0.36 CCI, (d) 16.19% dirtiness and 0.47 CCI.

Figure 4. Scene-wise distribution of contamination levels across contaminated subsets: (a) Scene 1, (b) Scene 2, (c) Scene 3, (d) Scene 4, (e) Scene 5, (f) Scene 6.

Figure 5. Performance of the selected models on the COCO dataset.

Figure 6. Performance of the evaluated neural networks on clean images across four metrics: (a) F1-score, (b) mAP⁵⁰, (c) mAP^50–95 and (d) Recall.

Figure 7. Performance of the evaluated neural networks on dirt images across four metrics: (a) F1-score, (b) mAP⁵⁰, (c) mAP^50–95 and (d) Recall.

Figure 8. Performance degradation caused by mud contamination across four metrics: (a) ΔF1-score, (b) ΔmAP⁵⁰, (c) ΔmAP^50–95 and (d) ΔRecall.

Figure 9. Relationship between model complexity and relative robustness to visual contamination.

Figure 10. Average Robustness Index of Different Model Families Across Contamination Levels.

Figure 11. Center Concentration Index Across Different Contamination Levels.

Table 1. Summary of the annotated objects across the six scenes.

Scene	Number of Images	Number of Images Containing Objects	Number of Pedestrians	Number of Vehicles
#1	686	620	1412	1192
#2	683	553	0	1661
#3	812	812	975	4277
#4	707	707	890	2346
#5	2500	2397	2452	5301
#6	1900	1846	1284	8565

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Csikor, D.; Hollósi, J. Performance Degradation of Object Detection Neural Networks Under Natural Visual Contamination in Autonomous Driving. Computers 2026, 15, 254. https://doi.org/10.3390/computers15040254

AMA Style

Csikor D, Hollósi J. Performance Degradation of Object Detection Neural Networks Under Natural Visual Contamination in Autonomous Driving. Computers. 2026; 15(4):254. https://doi.org/10.3390/computers15040254

Chicago/Turabian Style

Csikor, Dániel, and János Hollósi. 2026. "Performance Degradation of Object Detection Neural Networks Under Natural Visual Contamination in Autonomous Driving" Computers 15, no. 4: 254. https://doi.org/10.3390/computers15040254

APA Style

Csikor, D., & Hollósi, J. (2026). Performance Degradation of Object Detection Neural Networks Under Natural Visual Contamination in Autonomous Driving. Computers, 15(4), 254. https://doi.org/10.3390/computers15040254

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Performance Degradation of Object Detection Neural Networks Under Natural Visual Contamination in Autonomous Driving

Abstract

1. Introduction

2. Materials and Methods

2.1. Image Processing Mechanism for Stereo Video Camera

2.2. Adversarial Attack Methods

2.3. Neural Networks

2.3.1. YOLOv8

2.3.2. YOLO World

2.3.3. YOLO NAS

2.3.4. YOLOv11

2.3.5. YOLOv12

2.3.6. RT-DETR

2.4. aiMotive aiSim Software

3. Results

3.1. Proposed Dataset

3.2. Performance Analysis

4. Discussion

5. Conclusions and Future Directions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI