Next Article in Journal
Attack Detection of Federated Learning Model Based on Attention Mechanism Optimization in Connected Vehicles
Previous Article in Journal
Four-Wheel Steering Control for Mining X-by-Wire Chassis Based on AUKF State Estimation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Enhancing Object Detection for Autonomous Vehicles in Low-Resolution Environments Using a Super-Resolution Transformer-Based Preprocessing Framework

by
Mokhammad Mirza Etnisa Haqiqi
1,
Ajib Setyo Arifin
1,* and
Arief Suryadi Satyawan
2
1
Department of Electrical Engineering, Faculty of Engineering, Universitas Indonesia, Kampus Baru UI, Depok 16424, Indonesia
2
Research Center for Telecommunication, National Research and Innovation Agency, Bandung 40135, Indonesia
*
Author to whom correspondence should be addressed.
World Electr. Veh. J. 2025, 16(12), 678; https://doi.org/10.3390/wevj16120678
Submission received: 3 November 2025 / Revised: 8 December 2025 / Accepted: 12 December 2025 / Published: 17 December 2025
(This article belongs to the Section Automated and Connected Vehicles)

Abstract

Low-resolution (LR) imagery poses significant challenges to object detection systems, particularly in autonomous and resource-constrained environments where bandwidth and sensor quality are limited. To address this issue, this paper presents an integrated framework that enhances object detection performance by incorporating a Super-Resolution (SR) preprocessing stage prior to detection. Specifically, a Dense Residual Connected Transformer (DRCT) is employed to reconstruct high-resolution (HR) images from LR inputs, effectively restoring fine-grained structural and textural information essential for accurate detection. The reconstructed HR images are subsequently processed by a YOLOv11 detector without requiring architectural modifications. Experimental evaluations demonstrate consistent improvements across multiple scaling factors, with an average increase of 13.4% in Mean Average Precision (mAP)@50 at ×2 upscaling and 9.7% at ×4 compared with direct LR detection. These results validate the effectiveness of the proposed SR-based preprocessing approach in mitigating the adverse effects of image degradation. The proposed method provides an improved yet computationally challenging solution for object detection.

1. Introduction

Object detection has become a fundamental task in computer vision [1,2,3,4], with widespread applications in autonomous driving [5,6,7,8], aerial surveillance [9,10,11], smart city systems [12,13], and robotic navigation [14,15,16]. Recent advances in deep learning have significantly improved object detection performance on high-quality image datasets [4,17,18]. Specifically for safe and reliable autonomous navigation, onboard cameras must accurately detect crucial objects such as pedestrians, cyclists, other vehicles, and road signs, often under challenging conditions and varying distances. However, in practical deployments particularly in autonomous vehicles (AVs) image acquisition is often constrained by hardware limitations, bandwidth restrictions, and energy efficiency requirements. As a result, many onboard vision systems must operate on low-resolution (LR) imagery, where critical object features are either blurred or missing [19,20,21,22,23]. This degradation leads to a considerable drop in detection accuracy, especially for small or distant objects that are crucial for safe navigation and decision making in AV environments.
Low-resolution images present two primary challenges for object detection: (1) the loss of high-frequency details such as edges, textures, and fine structures, and (2) insufficient semantic information for effective feature extraction in modern detectors. Although significant research has focused on improving detection architectures through strategies such as multi-scale feature aggregation and attention mechanisms [24,25,26,27,28], these methods often overlook the fundamental issue of input image quality. Moreover, modifying detection algorithms typically increases model complexity, computational cost, and training overhead factors that limit their practicality in real-time or resource-constrained environments like autonomous vehicles. In contrast, enhancing the input resolution prior to detection using Super-Resolution (SR) techniques offers a more general and lightweight solution. By improving visual quality before feature extraction, SR-based preprocessing can strengthen feature representation across different detectors without requiring changes to their internal architectures.
In the context of autonomous vehicles, maintaining high perception reliability is critical for ensuring safety in real-world scenarios. The increasing complexity of AV environments necessitates robust perception systems [29,30,31]. Onboard cameras often capture distant or small-scale objects under challenging conditions such as low lighting, motion blur, or fog. Additionally, the need for efficient data transmission and processing limits the use of high-resolution images, especially when multiple cameras operate simultaneously or when communication with a central processing unit occurs via bandwidth-limited channels. This constraint is particularly acute in on-road autonomous vehicles where real-time, reliable data transfer of visual information (often compressed or down-sampled) is paramount for immediate decision making and safe navigation. Consequently, autonomous vehicles frequently rely on compressed or down-sampled visual inputs, which reduce detection robustness and increase the risk of misclassification or missed detections. Therefore, improving input image resolution through a lightweight and efficient SR model becomes essential to enhance object recognition accuracy while maintaining real-time performance we illustrated as Figure 1.
Super-Resolution refers to the process of reconstructing a high-resolution (HR) image from its low-resolution counterpart. It has gained significant traction due to the emergence of powerful deep learning models, including Convolutional Neural Networks (CNNs) [32,33,34], Generative Adversarial Networks (GANs) [35,36,37], and more recently as Transformer [38,39,40,41]. Beyond enhancing visual quality, SR also restores structural and contextual details that are crucial for downstream perception tasks such as classification and object detection. By recovering fine-grained spatial information, SR serves not merely as an esthetic enhancement but as a task-oriented preprocessing step that can improve overall perception reliability.
Several recent studies have explored the integration of SR into detection pipelines [42,43,44,45,46,47,48]. However, most existing approaches assume ideal HR targets or require joint training of SR and detection modules, leading to high computational and memory costs an impractical setup for edge-based systems such as autonomous vehicles. To address this limitation, we propose a modular framework where SR is applied as a fixed preprocessing stage to enhance detection performance without retraining the detector. Specifically, we adopt the Dual Residual Compression Transformer (DRCT) [39], a lightweight Transformer-based SR model that achieves strong reconstruction quality with low computational overhead. The SR-enhanced images are then processed by conventional object detectors (e.g., YOLO [49,50,51,52,53,54,55,56], resulting in improved localization and classification accuracy even under bandwidth-limited or degraded input conditions. More recent methods attempt to mitigate this issue by modifying detection architectures, such as incorporating multi-scale features, attention modules [57,58,59,60,61,62,63]. Although effective, these approaches typically increase computational burden, require retraining, and reduce flexibility across different detectors. In contrast, our proposed system introduces a modular preprocessing stage using a lightweight Transformer-based SR model (DRCT), which enhances resolution prior to detection. This design preserves structural and contextual information, improves generalizability, and avoids modifications to the detection pipeline.
The proposed integration of DRCT is particularly well suited for autonomous driving perception systems due to its balance between accuracy and efficiency. To clearly illustrate the challenges of object detection under adverse conditions and the necessity of Super-Resolution preprocessing within an autonomous vehicle’s communication and perception architecture, we include a high-level schematic. The DRCT architecture leverages dense residual connections and shift-window mechanisms to expand the receptive field while preserving fine structural details an essential capability for recovering textures of small or distant vehicles and road features. Moreover, DRCT’s compact design allows for deployment on edge computing hardware within vehicles without significantly increasing latency or energy consumption. By enhancing resolution before detection, the DRCT-based preprocessing stage mitigates the limitations of low-quality visual inputs, thereby improving the reliability and robustness of perception systems critical to autonomous navigation.
To validate the effectiveness of the proposed framework, extensive experiments are conducted comparing object detection performance on original LR images versus their SR-enhanced counterparts. Evaluation metrics such as mAP, Intersection over Union (IoU), Peak Signal-to-Noise Ratio (PSNR), and Structural Similarity Index (SSIM) are employed to assess improvements in both image quality and detection accuracy. The results consistently demonstrate that the DRCT-based preprocessing pipeline significantly enhances detection performance under low-resolution and bandwidth-constrained conditions highlighting its potential for real-world autonomous driving applications.
The key contributions of this paper are summarized as follows:
  • We formulate the problem of object detection in low-resolution imagery and demonstrate how it can be mitigated by introducing Super-Resolution as a preprocessing step.
  • We empirically show that SR preprocessing improves object detection performance without requiring modifications to the detection model, particularly within autonomous vehicle vision systems.
  • We propose an efficient Transformer-based SR architecture, Dual Residual Compression Transformer (DRCT), which enhances image resolution while preserving contextual and structural detail for robust object detection.
The rest of the paper is organized as follows: Section 2 presents the Literature Review. Section 3 describes the problem formulation, Section 4 for proposed methodology. Section 5 discusses the experimental setup. Finally, Section 6 concludes the paper and outlines future research directions.

2. Literature Review

2.1. Super-Resolution on Computer Vision

Image super-resolution (SR) is a classical and long-standing problem in computer vision, aimed at reconstructing a high-resolution (HR) image from a low-resolution (LR) counterpart. Early methods were interpolation-based, such as bicubic [64] or Lanczos interpolation [65], which often produced overly smooth results and failed to recover high-frequency details. The emergence of deep learning, particularly Convolutional Neural Networks (CNNs), marked a significant breakthrough in SR performance.
One of the pioneering deep learning-based models, SRCNN [32], demonstrated the feasibility of learning an end-to-end mapping from LR to HR images using CNNs. Follow-up models such as EDSR [66] introduced deeper architectures and residual learning strategies to enhance image fidelity. While effective, CNN-based SR methods generally suffer from limited receptive fields and lack of long-range context modeling.
To address these limitations, Generative Adversarial Networks (GANs) were introduced to the SR domain. SRGAN [37] and ESRGAN [35] achieved perceptually convincing results by optimizing perceptual and adversarial losses, generating high-frequency details that resemble natural images. However, GAN-based SR methods are often unstable during training and may produce artifacts that are detrimental to downstream vision tasks.

2.2. Transformer-Based Super-Resolution

The advancement of Vision Transformers in SR has recently inspired their exploration as powerful pre-processing modules to enhance performance in downstream tasks, particularly object detection under low-resolution conditions. Vision Transformers have gained attention in the SR community for their ability to capture both local and global dependencies. Swin-IR [40,41] leveraged shifted window self-attention to balance efficiency and representational power. The Hat Transformer (HAT) [38] further improved performance by introducing hierarchical attention modules and overlapping patch embeddings, achieving state-of-the-art results on benchmark datasets and DRCT [39] integrates dense residual learning with shifted window attention, specifically selected for its balance of high reconstruction quality and computational efficiency, making it well suited as a modular SR pre-processor for real-time object detection in autonomous vehicles.
An important relevant work using the Hybrid Attention Transformer (HAT-L) for SR preprocessing to boost YOLOv8 performance on the DOTA dataset [67]. Their findings underscore the general efficacy of Transformer-based SR in improving object detection accuracy on low-quality input imagery. However, their work focuses primarily on static aerial surveillance images, requiring high-resolution reconstruction for large object scales. In contrast, our study targets the real-time, dynamic perception needs of autonomous vehicles on roads, emphasizing a lightweight and modular approach suitable for resource-constrained edge devices.

2.3. Object Detection

Conventional object detectors such as YOLO [51,52,53,55,56], SSD [59,68] and Faster R-CNN [69] achieve high performance on high-resolution datasets like MS COCO [17,18] and PASCAL VOC [18,70]. However, when applied to LR inputs, their accuracy degrades significantly due to the absence of fine-grained details and insufficient spatial resolution. This issue is especially prominent for small object instances, where key features become indistinguishable.
Several approaches have been proposed to adapt object detection to LR conditions. Some works incorporate multi-scale feature fusion, while others modify anchor box strategies or train specialized detectors on down sampled datasets. However, these solutions often require retraining or architectural changes to the detection model.
A promising alternative is to enhance the input resolution using SR before applying standard models. Studies such as [21,22,23,71] have demonstrated that SR preprocessing can improve detection accuracy for image classification, but most existing works rely on CNN-based, GAN-based SR models and Transformer-based, which may not sufficiently recover structural details or may introduce artifacts. Furthermore, some methods rely on joint training of SR and detection modules, which increases complexity and limits modularity.

2.4. Contribution of This Work

While existing works have shown the potential of SR to improve object detection in low-resolution settings, most approaches suffer from limited scalability, suboptimal reconstruction quality, or require joint optimization. Specifically, while the work [67] validated the use of Transformer SR for aerial images, our methodology addresses the distinct challenges of autonomous driving perception, focusing on the swift detection of smaller, highly dynamic objects (pedestrians, cars) and prioritizing a highly efficient, standalone preprocessing module (DRCT) that minimizes latency impact. In contrast, our proposed method introduces a modular pipeline that applies a Transformer-based SR model specifically DRCT as a standalone preprocessing step. This approach enables the reuse of existing detection models without retraining while significantly improving detection performance. By recovering perceptually and semantically rich HR representations from LR inputs, our framework bridges the domain gap and enhances detection robustness in resource-constrained environments. Therefore, we explain the research gap in Table 1.

3. Problem Formulation

Object detection in low-resolution (LR) imagery remains a significant challenge in computer vision, particularly in real-world applications where sensor limitations, transmission bandwidth constraints, or storage optimization often leads to reduced image quality. In the context of autonomous vehicles (AVs), this LR imagery is a direct consequence of optimizing the onboard sensor suite for efficiency and reducing the data load transmitted over bandwidth-limited vehicle bus networks (e.g., Edge Computing on Autonomous Vehicles [75,76]). This constraint is fundamental to the AV perception problem, as the resulting loss of visual fidelity directly impacts safety-critical tasks, leading to potential misdetection of distant or small objects. These conditions, crucial high-frequency visual information such as object boundaries, textures, and small object instances is either severely degraded or completely lost. Consequently, conventional object detection models trained on high-resolution (HR) images experience substantial performance degradation when applied directly to LR inputs.
Let I L R R H × W × C denote such a low-resolution image, where H and W denote the hight and width of the image, and C is the number of color channels. The corresponding ground truth object annotations are denoted by y = { ( b i ,   c i ) } i = 1 N o , where b i is the bonding box and c i C is the class label of the i -th object, and N o is the number of objects in the image. An object detection model can be defined is a function D ϕ ( . ) : R H × W × C y ^ , parameterized by weight ϕ , that maps an input image to a set of predicted bounding boxes and class labels,
y ^ = D ϕ ( I ) ,
when applied directly to I L R the performance of D ϕ is suboptimal, especially in terms of precision, recall, and mAP,
m A P ( D ϕ I L R ) m A P ( D ϕ I H R ) ,
to mitigate this, we propose the introduction of a Super-Resolution (SR) module as preprocessing stage. The SR function f θ : R H × W × C R s H × s W × C , parameterized by θ , aims to reconstruct a high-resolution image I S R from I L R , where s is the upscaling factor. The selection of the upscaling factor s is chosen based on standard AV simulation benchmarks and sensor data compression rates, typically using s = 4 to simulate the extreme loss of detail encountered by AVs under high compression or distance. This mapping is formalized as,
I S R = f θ ( I L R ) ,
the enhanced image I S R is subsequently fed into the object detection model,
y ^ = D ϕ ( f θ I L R ) ,
the object of this work is to find and optimal SR function f θ such that the composed detection pipeline maximizes detection performance of LR inputs,
θ * = a r g m a x θ   m A P ( D ϕ ( f θ ( I L R ) ) ) ,
subject to,
I S R = f θ I L R   a n d   P S N R ( I S R , I H R ) δ ,
where δ is a minimum acceptable reconstruction quality threshold, ensuring the perceptual fidelity of the reconstructed image. Additionally, we assume that the object detector D ϕ remains fixed (i.e., not re-trained) and only the SR module is optimized to enhance detection on low-resolution inputs. P S N R ( I S R , I H R ) ensures that the super-resolution module not only enhances object detection accuracy but also maintains a minimum acceptable perceptual quality threshold. This prevents the model from generating unrealistic or excessively distorted reconstructions that could mislead the detection pipeline. By incorporating PSNR into the optimization objective, the framework guarantees a balance between detection performance and reconstruction fidelity, thereby ensuring that the enhanced low-resolution inputs remain visually consistent with their high-resolution counterparts.
To visually anchor this formulation within the application context, Figure 1 illustrates the operational flow and the relationship between the key parameters and the AV perception challenge. Figure 1 not only illustrates the overall operational flow of our proposed solution but also serves as a high-level representation of the AV perception pipeline, where the Low-Resolution image I L R originates from bandwidth constraints, and the Super-Resolution function f θ acts as the crucial compensatory mechanism prior to the fixed detector D ϕ .

4. Methodology

The proposed methodology aims to enhance object detection performance in low-resolution imagery by introducing a preprocessing stage based on a Super-Resolution (SR) approach. The overall processing pipeline is illustrated in the flow diagram. First, the targeted object in the environment is captured by the onboard camera, which produces a low-resolution (LR) image stream due to bandwidth constraints on the autonomous vehicle platform. The LR image feed is subsequently enhanced by the lightweight super-resolution (SR) module based on the DRCT architecture, generating a high-resolution (HR) perception input. This SR-refined image is then forwarded to the fixed perception detector (e.g., YOLO) to perform object localization and classification. The resulting detection output provides decision-ready information, which is transmitted to the vehicle’s control unit to support safe and reliable navigation. This modular design is specifically optimized for deployment on resource-constrained autonomous vehicles. The framework consists of two main stages as Figure 1: The overall architecture of Object Detection with Super-Resolution as preprocessing, which models the real-time onboard perception pipeline where bandwidth-limited sensor output is enhanced prior to safety-critical decision making.
It is important to note that the proposed SR to OD pipeline operates entirely on the onboard perception system of the autonomous vehicle. The low-resolution images are generated directly by the camera due to internal bandwidth and computational constraints, without undergoing any wireless transmission process. Therefore, no external channel model or communication-induced environmental impairment is involved in this workflow, and the degradation addressed in this study strictly pertains to native sensor-induced low resolution.

4.1. Super-Resolution Based Approach

Let I L R R H × W × C denote a low-resolution input image, the goal of the Super-Resolution module is to reconstruct a high-resolution counterpart I S R R s H × s W × C , where s is the upscaling factor. This is achieved using a mapping function F S R . ; θ parameterized by a deep neural network,
I S R = F S R ( I L R ; θ S R ) ,
this function F S R may be realized through architectures such as CNN [32,77], GAN [37,78,79,80], or Transformer-based networks [38,39,40], which are capable of learning complex mappings from low- to high-resolution domains by minimizing a reconstruction loss function L M S E . The SR network is trained to minimize the reconstruction loss between the super-resolved output and the ground truth high-resolution image I H R , typically using Mean Squared Error (MSE) loss.
The selection of DRCT [39] as the super-resolution backbone is driven by the specific constraints of autonomous driving scenarios. Unlike generic Transformer-based models such as Swin-IR [40,41] or HAT [38], which prioritize absolute peak PSNR often at the cost of high computational latency and training instability, DRCT’s dense-residual structure offers a superior trade-off. The dense connections effectively mitigate the vanishing gradient problem, ensuring that fine spatial details critical for identifying distant traffic agents are preserved across deep layers. Moreover, this architecture demonstrates greater robustness against artifacts compared to GAN-based alternatives [35], ensuring that the reconstructed high-resolution images remain structurally faithful to the ground truth, a non-negotiable requirement for safety-critical perception systems.
The DRCT model for single-image super-resolution is structured into three main stages: shallow feature extraction, deep feature extraction via residual groups, and final image reconstruction as Figure 2. In the first stage, a single convolutional layer processes the low-resolution input to produce initial “shallow” feature maps [81]. This convolutional embedding serves to translate pixel values into a feature space for further processing. Following this, the network enters the deep feature extraction module, which consists of multiple Residual Dense Groups (RDGs) [39]. Each RDG is a stack of sub-blocks designed to refine and enhance the feature representation while preserving spatial information.
Within each RDG, the architecture alternates between Swin Transformer Layers (STLs) and convolutional layers. Each sub-block begins with an STL to capture long-range dependencies, followed by a convolution layer with a LeakyReLU activation for local feature processing [82,83]. Importantly, the outputs of these sub-blocks are connected in a dense fashion: the feature maps produced by each layer are concatenated with the inputs of subsequent layers [84]. This dense connectivity ensures that multi-level feature information is preserved and propagated throughout the block. Moreover, each RDG includes a residual skip connection that adds the group’s input to its output, forming a dense-residual connection. This combination of dense concatenation and residual addition stabilizes the flow of information and helps maintain fine spatial details across many layers [85]. As a result, even as the network deepens, early feature representations remain accessible, mitigating the information loss that can occur in very deep architectures.
The Swin Transformer Layer itself follows the standard Swin block design. Each STL applies layer normalization before its attention operation, then performs window-based multi-head self-attention (W-MSA) within shifted windows, and finally applies another normalization and a multi-layer perceptron (MLP) with nonlinear activation [86]. Each of these stages is wrapped with a residual connection. This structure enables the STL to adaptively integrate context from non-local image regions, effectively enlarging the receptive field. In practice, the windowed self-attention allows the model to focus on relevant features across the image and to complement the convolutional processing with global context. The use of layer norm and MLP within each block ensures stable training and nonlinear feature transformation, as in the original Swin Transformer design [40].
Image reconstruction stage fuses the shallow and deep features to generate the high-resolution output. In this stage, the high-frequency details extracted by the RDGs are combined with the initial low-frequency (shallow) features often via addition and the result is processed by one or more convolutional layers (typically with activation) to produce the super-resolved image. This fusion of features ensures that both coarse spatial information and fine details contribute to the final image. In summary, the DRCT architecture leverages dense connections and transformer-based attention within a residual framework to enrich.

4.2. Super-Resolution as a Preprocessing Stage for Object-Detection

Low-resolution imagery typically lacks crucial high-frequency details such as object contours and fine textures, which are essential for accurate object detection, especially when dealing with small or distant objects. To address this limitation, an SR model is employed to generate a high-resolution version of the input image, thereby enriching the semantic and structural content available to the object detector. Recent developments in Transformer-based SR techniques have significantly advanced the field, offering superior reconstruction quality and better preservation of contextual and perceptual information compared to conventional CNN- and GAN-based methods. Among these, two architectures have demonstrated state-of-the-art performance: the Hat Transformer (HAT) and the Dual Residual Compression Transformer (DRCT). The HAT model introduces a hierarchical attention mechanism and overlapping patch embedding strategy, enabling the network to jointly capture both local and global dependencies across the image. This multi-scale representation facilitates the recovery of complex visual structures and textures, leading to more precise object detection in subsequent stages. The DRCT, on the other hand, is designed to enhance both the stability and capacity of information flow through a combination of dense residual connections and shifted-window attention mechanisms.

4.3. Object Detection Model

The object detection module receives the enhanced image I S R and outputs a set of predicted bounding boxes and corresponding class labels. The detection process is modeled as,
Υ = F O D ( I S R ; ϕ ) ,
where F O D is the object detection function parameterized by ϕ , and Υ represents the predicted object set. Common object detection models used in this context include YOLO, SSD, and Faster R-CNN. The performance of the object detector is evaluated using standard metrics such as Precision, Recall, and particularly the mAP, which is defined as,
mAP = 1 C i = 1 C A P i ,
where C is the number of object classes, and A P i is the area under the precision-recall curve for class i . The average precision ( A P ) for each class is computed as,
A P i = 0 1 P r e c i s i o n i R e c a l l d R e c a l l ,
Additionally, detection results are evaluated using Intersection over Union (IoU) to determine correct detections. A prediction is considered a true positive if,
I o U = A p r e d A G T A p r e d A G T ,
where A p r e d and A G T denote the predicted and ground truth bounding box areas, and τ is the IoU threshold with 0,5.
The YOLOv11 object detector is organized into three main stages the backbone, neck, and head each consisting of specialized modules that transform input images into final predictions [87]. The backbone serves as a multi-scale feature extractor, the neck fuses and refines these features across scales, and the head produces the final bounding box and class predictions based on the processed features. Figure 3 of the architecture diagram highlights the key modules in each stage.
In the backbone, the image is first processed by a series of convolutional layers that progressively downsample the spatial dimensions while increasing channel depth. This creates hierarchical feature maps that capture low-to-high-level information. A key innovation in YOLOv11 is the introduction of the C3k2 block in the backbone. C3k2 is a variant of the Cross-Stage Partial (CSP) bottleneck that uses two smaller convolution operations instead of a single large one, along with a reduced kernel size (“k2”), to improve processing speed and parameter efficiency. In effect, C3k2 maintains representational capacity while accelerating computation. After the initial convolutions and C3k2 blocks, the backbone incorporates an SPPF (Spatial Pyramid Pooling—Fast) module to aggregate context at multiple spatial scales. The SPPF layer applies sequential max-pooling operations and concatenates each output with the original feature map, enabling the network to capture wide spatial context with minimal overhead. In YOLOv11 this is implemented to preserve detailed image structure across scales. Following SPPF, YOLOv11 adds a C2PSA (Cross-Stage Partial Self-Attention) block. This module combines CSP-style feature partitioning with a spatial self-attention mechanism: it pools the feature maps and applies learned attention weights to emphasize informative regions. By focusing on salient spatial features, the C2PSA block enhances the model’s ability to detect small or overlapping objects. In summary, the backbone produces a set of multi-resolution feature maps rich in context, thanks to its convolutional layers, efficient C3k2 blocks, an SPPF context module, and spatial-attention via C2PSA.
The neck of YOLOv11 fuses these multi-scale features from the backbone and prepares them for detection. It typically upsamples deeper (spatially coarse) feature maps and concatenates them with shallower (higher-resolution) features, forming a feature pyramid that preserves both semantic and fine-grained information. YOLOv11’s neck continues to employ the efficient C3k2 block: after each upsampling and concatenation step, a C3k2 module processes the combined feature map. This replacement of older bottleneck layers yields faster processing and a smaller parameter count while preserving feature expressiveness. In addition, the neck integrates the C2PSA attention mechanism to further improve feature quality. Specifically, a C2PSA module is applied during feature aggregation to emphasize key spatial regions within the fused maps. This spatial self-attention helps the model retain object-relevant patterns (e.g., edges or textures of small objects) as features are combined across scales. Thus, the neck outputs a set of refined feature tensors at multiple resolutions, each enriched by context from both the backbone and the C2PSA attention filters.
The head of YOLOv11 is the prediction component that takes the refined multi-scale features from the neck and generates the final detection outputs. In practice, there are multiple detection branches corresponding to different feature map scales. In each branch, YOLOv11 again uses C3k2 blocks to process the incoming feature maps efficiently. These C3k2 layers act as flexible bottlenecks: when a configuration parameter (c3k = false) they behave like standard CSP bottlenecks, and when enabled they introduce an extra convolutional sub-block for deeper feature processing. Regardless, the fundamental benefit of the C3k2 block remains: two smaller convolutions with a smaller kernel preserve accuracy while reducing computation and parameters. After each C3k2 block sequence in the head, YOLOv11 applies CBS modules—that is, a convolutional layer followed by Batch Normalization and a SiLU (Sigmoid Linear Unit) activation. Each CBS layer refines the feature maps by extracting relevant patterns (via the convolution), stabilizing and scaling the activations (via batch normalization), and introducing nonlinearity through the SiLU function.
These refinements help ensure that only the most informative features are carried forward for final prediction. Finally, each detection branch ends with standard convolutional layers that map the processed features to the required output dimensions (bounding box coordinates and class probabilities). The final “Detect” layer then consolidates these outputs into object predictions: it produces the bounding box offsets, objectness scores (indicating presence of an object), and class scores for each candidate region.

5. Experiment and Result

5.1. Experimental Dataset

The dataset utilized in this study is the Vehicles Dataset, a subset of the Roboflow-100 open-source benchmark designed to advance research in computer vision. This dataset comprises images of various types of vehicles annotated with bounding boxes, thereby supporting object detection and related tasks. It offers substantial intra-class variation, including changes in scale, viewing angles, illumination conditions, and occlusions, which collectively contribute to the robustness of model evaluation in real-world applications. In addition, the dataset captures realistic roadway scenarios such as two-way traffic environments and diverse weather conditions making it representative of the visual challenges commonly encountered in real-world driving contexts. The specific dataset configuration is detailed in Table 2. The dataset was divided into training, validation, and test sets.
Figure 4 illustrates the statistical distribution and spatial characteristics of the vehicle dataset used in this study. As shown in Figure 4a, the dataset contains three main object classes: bus, car, and truck, with a significant imbalance where car instances dominate the dataset. Figure 4b presents the relative scale representation of bounding boxes among these classes, indicating the variation in object sizes captured within the dataset. Figure 4c shows the spatial distribution of bounding box centers across image coordinates (x, y), demonstrating that most vehicle instances are located around the central region of the image, which aligns with typical road scene compositions. Finally, Figure 4d depicts the distribution of bounding box sizes based on object width and height, revealing that the majority of vehicles occupy relatively small areas of the image frame. These observations collectively indicate that the dataset exhibits class imbalance and scale variation, which are important considerations for model training and performance evaluation in object detection and recognition tasks.

5.2. Experimental Environment

PyTorch (version 2.0.1) is a deep learning framework that is open-source and developed by Facebook in 2016. It supports machine learning and deep learning tasks, offering automatic differentiation and dynamic computation graphs for greater flexibility in model development. The framework consists of two main components: the front-end, which is a Python API for user interaction, and the back-end, which handles internal operations, including Autograd, an engine for automatic differentiation. PyTorch was selected for its high modularity and ease of customization. Specific experimental settings are detailed in Table 3.

5.3. Quantitative Result

The experimental results presented in Table 4 illustrate the impact of applying Super-Resolution (SR) as a preprocessing step for object detection using YOLOv11. The evaluation was conducted under three scaling factors ( × 2, × 3, and × 4) and compared between low-resolution (LR) images directly processed by YOLOv11 and super-resolved images generated by the proposed DRCT model prior to detection. For LR images, no PSNR/SSIM values are reported since no reconstruction is performed. The performance metrics show that applying DRCT consistently improves detection accuracy across all scales. Specifically, DRCT+YOLOv11 achieved higher mAP@50 and mAP@50–95 compared to LR Images+YOLOv11, indicating that SR enhances the quality of visual features exploited by the detector. For instance, at × 2 scaling, DRCT+YOLOv11 reached 0.88219 mAP@50 and 0.59904 mAP@50–95, outperforming the LR baseline (0.77765 and 0.61094, respectively). Moreover, precision and recall values were also improved, demonstrating the effectiveness of SR in preserving discriminative information. Notably, the PSNR and SSIM scores achieved by DRCT highlight the fidelity of reconstructed images, with values of 39.05/0.9647 at × 2, 35.3/0.9346 at × 3, and 33.32/0.9089 at × 4, reflecting strong structural and perceptual consistency. These findings confirm that the integration of DRCT-based SR in the preprocessing pipeline enhances the robustness of YOLOv11 detection performance, particularly when dealing with degraded low-resolution inputs.
The graphs in Figure 5 illustrate the convergence behavior of object detection performance when Super-Resolution (SR) is employed as a preprocessing step prior to YOLOv11 detection. The left plot presents the mAP at 50% IoU (mAP@50) across training epochs, while the right plot depicts the mAP at varying IoU thresholds between 50 and 95% (mAP@50–95). Each curve corresponds to different scaling factors ( × 2, × 3, and × 4) for both low-resolution (LR) and high-resolution (HR) super-resolved inputs. The results indicate that the integration of SR, particularly with lower scaling factors (e.g., HR   ×   2), significantly improves detection accuracy, yielding faster convergence and higher stability throughout training. Specifically, HR   ×   2 consistently outperforms LR inputs across all metrics, demonstrating superior feature preservation and enhanced discriminative capacity. In contrast, higher scaling factors such as HR   ×   4 exhibit reduced performance due to accumulated reconstruction artifacts, although they still surpass their LR counterparts. These findings confirm that SR preprocessing not only facilitates improved object detection accuracy but also contributes to more stable training dynamics, underscoring its effectiveness in handling degraded low-resolution imagery.
The experimental results depicted in Figure 6 provide insights into the consistency and reliability of incorporating proposed model. The left graph illustrates the performance gap in terms of mAP@50 difference between low-resolution baselines and super-resolved inputs across scaling factors ( × 2, × 3, and × 4). It is observed that the gap remains consistently positive throughout training, confirming that SR-enhanced inputs outperform their low-resolution counterparts. Notably, the × 2 scale exhibits the largest and most stable performance improvement, while the × 4 scale, although beneficial, demonstrates slightly reduced gains due to the presence of reconstruction artifacts. The right graph presents the standard deviation of detection accuracy across epochs, reflecting the variability of model performance. Results show a rapid decline in variability during the initial training stages, after which the curves stabilize at low values for all scales, with × 3 and × 2 exhibiting the lowest fluctuations. This indicates that SR preprocessing not only enhances detection accuracy but also ensures stable and reliable convergence across different resolution settings, thereby validating its effectiveness in improving object detection robustness on degraded visual inputs.
Figure 7 illustrates the comparative performance of various object detection models evaluated under different super-resolution scale factors (×2, ×3, and ×4), serving as a critical validation of our proposed framework. The results clearly demonstrate that the proposed DRCT+YOLOv11n (Ours) consistently achieves superior performance compared to both baseline and existing SR-assisted detection models.
As shown in Figure 7a, the mAP@50 values for DRCT+YOLOv11n consistently outperform all other methods across all scale factors. It is critical to note the performance crossover between the baseline models: YOLOv8n initially performs better than YOLOv11n at s = 2 , but YOLOv11n surpasses it at s = 3 and s = 4 . This suggests that while YOLOv8n has slightly better native performance, YOLOv11n is more robust to the increasing feature degradation caused by higher downscaling factors. This finding underscores the general difficulty of maintaining mAP stability in conventional detectors when LR input quality severely degrades. In contrast, the DRCT module effectively reconstructs high-frequency details, leading to significantly enhanced and stable accuracy. In Figure 7b, the precision trend shows that the proposed model maintains the highest precision stability. We observe a notable crossover between the existing SR-assisted methods: MsSRGAN+YOLOv5 shows higher precision than EDSR+YOLOv3 at s = 2 , but EDSR+YOLOv3 achieves higher precision at s = 4 . This instability in precision among competing SR models suggests a trade-off between the SR method’s reconstruction quality and the subsequent detector’s ability to minimize false positives, a weakness that DRCT’s highly stable feature recovery successfully overcomes. The proposed approach remains less prone to false positives even when image resolution decreases. Furthermore, Figure 7c highlights the F1-score performance. Similar to the precision analysis, the F1-Score also exhibits a crossover between EDSR+YOLOv3 and MsSRGAN+YOLOv5, reinforcing the conclusion that these competing SR methods fail to maintain a balanced performance (precision vs. recall) as image degradation increases. DRCT+YOLOv11n achieves the most balanced results between precision and recall across all tested scale factors, confirming its robustness in handling challenging low-resolution visual inputs.
These findings reveal that the integration of the DRCT architecture with YOLOv11n significantly enhances feature representation and detection consistency across varying scales. The observed performance crossovers in competing models highlight their inherent instability under varying LR conditions. In contrast, DRCT’s dense residual connections and efficient attention mechanisms ensure stable feature recovery, providing a more reliable and scalable solution for real-time object detection in low-bandwidth or resolution-constrained environments, such as autonomous vehicle perception systems.
Figure 8 presents the comparative evaluation results of object detection models with and without super-resolution enhancement, assessed using three key performance metrics: mAP@50, Precision, and F1-Score. As shown in Figure 8a, models incorporating super-resolution techniques demonstrate a significant improvement in mAP@50 compared to the baseline model without enhancement (No SR). Specifically, the DRCT+YOLO model achieves the highest mAP@50 value, indicating superior object localization accuracy. This improvement can be attributed to the DRCT’s ability to restore high-frequency texture details and preserve structural integrity in low-resolution images, thus enabling the YOLO detector to extract more discriminative features. In Figure 8b, the precision results reveal that DRCT+YOLO maintains a higher and more stable precision range than other methods such as EDSR+YOLO and MsSRGAN+YOLO, reflecting the model’s reduced false-positive rate. The integration of the dense residual and attention mechanisms within DRCT enhances feature consistency across reconstructed images, which directly contributes to improved detection reliability.
Figure 8c shows that DRCT+YOLOv11n achieves the highest F1-score, demonstrating a balanced performance between precision and recall. This suggests that the proposed framework not only minimizes incorrect detections but also successfully identifies a greater number of true positives, resulting in a more robust detection capability under degraded image conditions. The experimental results confirm that the combination of DRCT and YOLO yields the most consistent and accurate detection outcomes among all tested methods. The integration of advanced super-resolution reconstruction with a lightweight detection network enhances both feature quality and generalization ability, offering an effective solution for real-time object detection in bandwidth-limited or resolution-constrained visual environments such as autonomous systems and IoT-based surveillance applications.

5.4. Qualitative Result

The experimental results presented in Figure 9, Figure 10, Figure 11 and Figure 12 demonstrate that the proposed object detection framework consistently outperforms the baseline in terms of convergence stability, detection reliability, and localization accuracy. In particular, the integration of super-resolution preprocessing at different scales ( × 2 and × 4) proves to be highly effective in preserving fine-grained structural details and enhancing the discriminative capacity of the detector. This strategy enables more accurate recognition of small and partially occluded objects, while maintaining high precision, recall, and mAP across varying IoU thresholds. These findings highlight that super-resolution not only sustains but also improves object detection parameters, establishing it as a practical and efficient preprocessing technique for robust detection in real-world, low-resolution scenarios.
On qualitative assessment, Super-resolution models particularly those employing GAN-based [35] or perceptual-loss-driven frameworks [88] are known to introduce structural artifacts and hallucinated details that do not correspond to the true scene content. Prior studies have reported that while methods such as ESRGAN can generate visually appealing high-frequency textures, they frequently produce undesirable artifacts and inconsistent local structures. Recent analyses, including DeSRA [89], further demonstrate that SR models trained on synthetic degradations tend to generate domain-shift–induced artifacts when applied to real-world imagery. Such hallucinated patterns, fake edges, and texture distortions can mislead downstream perception tasks. In safety-critical applications such as autonomous driving, these artifacts may result in false positives, missed detections of small or distant objects, or the formation of artificial object boundaries, ultimately degrading the reliability of object-detection systems. This issue has also been highlighted in remote-sensing SR literature and imaging-reconnaissance studies, which show that artifact-induced inconsistencies can negatively affect recognition accuracy and decision making. Therefore, a comprehensive discussion of artifact types and their potential impact on autonomous-vehicle perception is essential to ensure the robustness and safety of SR-augmented detection pipelines. In several instances, particularly on low-texture surfaces and heavily blurred regions, the SR model introduced halo-like edge artifacts and hallucinated fine details that subsequently misled the detection module While SR improved detection in most scenarios, certain complex scenes exhibited a slight drop in mAP, primarily due to noise amplification and inconsistent texture reconstruction.

5.5. Comparison of Complexity Models

The results summarized in Table 5 highlight the computational complexity of proposed model. For LR images processed directly with YOLOv11n, the models maintain relatively small parameter counts (2.4–3.21M) and computational demands (6.1–8.72 GFLOPs), owing to the reduced input sizes (160 × 120 to 320 × 240). In contrast, the inclusion of the DRCT model for super-resolved inputs significantly increases the parameter size (≈30M) and FLOPs (≈17–19 GFLOPs), reflecting the added complexity of the SR reconstruction process prior to detection. Nevertheless, this additional computational cost enables the detector to process higher-quality inputs (up to 640 × 480), thereby improving detection accuracy as evidenced in prior results. Overall, the table demonstrates the trade-off between lightweight LR-based detection and SR-enhanced detection, where the latter demands greater computational resources but offers improved performance and robustness in handling low-resolution imagery.

5.6. Ablation Study

Table 6 presents a comprehensive ablation study evaluating the contribution of each architectural component within the proposed DRCT integrated with YOLOv11n for low-resolution object detection. The Lower Bound case, which performs direct detection on LR inputs, produces the lowest accuracy (mAP@50 = 0.77765) but achieves the fastest inference (23.8 ms), while conventional bicubic interpolation in the Baseline offers only marginal improvement (mAP@50 = 0.7841). The full proposed DRCT-L configuration demonstrates the highest performance (mAP@50 = 0.88219), confirming the complementary benefits of dense residual and transformer modules, though at increased computational cost (35.7 ms, 30.18 M parameters). Ablation 1, excluding dense connectivity, reduces parameters and yields competitive performance (mAP@50 = 0.84029), whereas Ablation 2, removing transformer mechanisms, results in the greatest accuracy degradation (mAP@50 = 0.80335), highlighting the transformer’s role in long-range feature modeling. Ablation 3, which reduces transformer depth, maintains reasonably high detection precision (mAP@50 = 0.8211) with improved efficiency, while Ablation 4, eliminating skip connections, decreases robustness (mAP@50 = 0.79245) and increases inference time. The Upper Bound scenario, representing theoretical performance on high-resolution inputs, achieves mAP@50 = 0.85834 with minimal latency, yet the full DRCT architecture surpasses this result, demonstrating the effectiveness of the proposed super-resolution approach in enhancing detection accuracy from LR imagery.
As illustrated in Figure 13, the relationship between mAP@50 and inference time highlights a clear trend in which the ablation variants distribute along the accuracy–efficiency trade-off curve, whereas the full DRCT configuration occupies the most favorable region in the upper-right portion of the plot indicating the highest detection accuracy among all evaluated methods despite a modest increase in computational cost. The visualization further shows that SR-based approaches such as EDSR and MSSRGAN do not surpass the efficiency or accuracy achieved by the DRCT variants, and it distinctly separates the Lower Bound and Baseline cases from models capable of reconstructing finer structural details, thereby reinforcing the efficacy of the proposed architecture.

6. Conclusions

This study demonstrates that integrating Super-Resolution using the Dense Residual Connected Transformer (DRCT) as a modular preprocessing stage significantly enhances YOLOv11n object detection performance in low-resolution autonomous vehicle scenarios. By effectively recovering fine-grained details, the proposed framework achieved consistent accuracy gains, notably improving mAP@50 by approximately 13.4% (×2) and 9.7% (×4) compared to direct low-resolution detection. These results validate the framework’s effectiveness as a practical solution for enhancing perception in bandwidth-constrained environments. However, current limitations include reliance on synthetic down-sampling, marginal computational overhead, and separate training of modules. Additionally, the dataset employed in this study, while diverse, does not fully capture complex real-world challenges such as severe weather variations, nighttime motion artifacts, or highly congested multi-lane traffic conditions factors that may influence model generalization in real deployment. To address these limitations, future work will focus on developing lightweight SR architectures robust to real-world degradations such as noise and motion blur, implementing model compression and quantization to minimize latency for edge deployment, and exploring a loosely coupled joint training scheme to optimize task-specific feedback while maintaining modular flexibility. Expanding the dataset to incorporate more diverse environmental conditions and broader traffic scenarios will also be prioritized to further strengthen the model’s robustness and external validity.

Author Contributions

Conceptualization, M.M.E.H., A.S.A. and A.S.S.; Methodology, M.M.E.H., A.S.A. and A.S.S.; Software, M.M.E.H.; Formal Analysis, M.M.E.H., A.S.A. and A.S.S.; Investigation, M.M.E.H., A.S.A. and A.S.S.; Resources, M.M.E.H., A.S.A. and A.S.S.; Data Curation, M.M.E.H.; Writing—Original Draft Preparation, M.M.E.H.; Writing—Review and Editing, M.M.E.H.; Visualization, M.M.E.H.; Validation M.M.E.H., A.S.A. and A.S.S.; Supervision, A.S.A. and A.S.S.; Project Administration, M.M.E.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Acknowledgments

The authors would like to express their deepest gratitude to the Indonesia Endowment Fund for Education (Lembaga Pengelola Dana Pendidikan, LPDP) through the Riset dan Inovasi untuk Indonesia Maju (RIIM) program, as well as to the Rumah Program Purwarupa Hasil Riset Inovasi Big Data and the Degree by Research Program, all of which are administered by the National Research and Innovation Agency (BRIN, Indonesian: Badan Riset dan Inovasi Nasional). Their collective support through research funding, materials, and infrastructural facilities has been invaluable in enabling the successful completion of this study. The authors also acknowledge the contributions of Universitas Indonesia, Universitas Pendidikan Indonesia, Universitas Nurtanio, Universitas Majalengka, and Universitas Garut for providing research facilities and academic support throughout the project.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
SRSuper-Resolution
CNNConvolutional Neural Network
GANGenerative Adversarial Network
HATLinear dichroism
DRCTDense Residual Connected Transformer
MLPMulti-layer Perceptron
YOLOYou Only Look Only
SSDSingle-Shot Multibox Detector
PSNRPeak Signal Noise to Ratio
SSIMStructure Similarity
mAPMean Average Precision
LRLow Resolution
HRHigh Resolution
RDGResidual Dense Groups
STLSwin Transformer Layer
C2PSACross-Stage Partial Self-Attention
CSPCross-Stage Partial
SPPFSpatial Pyramid Pooling—Fast
SiLUSigmoid Linear Unit
ReLURectified Linear Unit
IoUIntersection over Union

References

  1. Liu, L.; Ouyang, W.; Wang, X.; Fieguth, P.; Chen, J.; Liu, X.; Pietikäinen, M. Deep Learning for Generic Object Detection: A Survey. Int. J. Comput. Vis. 2020, 128, 261–318. [Google Scholar] [CrossRef]
  2. Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object Detection in 20 Years: A Survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
  3. Jiao, L.; Zhang, F.; Liu, F.; Yang, S.; Li, L.; Feng, Z.; Qu, R. A survey of deep learning-based object detection. IEEE Access 2019, 7, 128837–128868. [Google Scholar] [CrossRef]
  4. Diwan, T.; Anirudh, G.; Tembhurne, J.V. Object detection using YOLO: Challenges, architectural successors, datasets and applications. Multimed Tools Appl. 2023, 82, 9243–9275. [Google Scholar] [CrossRef]
  5. Li, G.; Xie, H.; Yan, W.; Chang, Y.; Qu, X. Detection of Road Objects with Small Appearance in Images for Autonomous Driving in Various Traffic Situations Using a Deep Learning Based Approach. IEEE Access 2020, 8, 211164–211172. [Google Scholar] [CrossRef]
  6. Bagloee, S.A.; Tavana, M.; Asadi, M.; Oliver, T. Autonomous vehicles: Challenges, opportunities, and future implications for transportation policies. J. Mod. Transp. 2016, 24, 284–303. [Google Scholar] [CrossRef]
  7. Wan, L.; Sun, Y.; Sun, L.; Ning, Z.; Rodrigues, J.J.P.C. Deep Learning Based Autonomous Vehicle Super Resolution DOA Estimation for Safety Driving. IEEE Trans. Intell. Transp. Syst. 2021, 22, 4301–4315. [Google Scholar] [CrossRef]
  8. Shan, T.; Wang, J.; Chen, F.; Szenher, P.; Englot, B. Simulation-based lidar super-resolution for ground vehicles. Rob. Auton. Syst. 2020, 134, 103647. [Google Scholar] [CrossRef]
  9. Liang, D.; Geng, Q.; Wei, Z.; Vorontsov, D.A.; Kim, E.L.; Wei, M.; Zhou, H. Anchor Retouching via Model Interaction for Robust Object Detection in Aerial Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5619213. [Google Scholar] [CrossRef]
  10. Deng, C.; Jing, D.; Han, Y.; Chanussot, J. Toward Hierarchical Adaptive Alignment for Aerial Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5615515. [Google Scholar] [CrossRef]
  11. Liang, D.; Zhang, J.W.; Tang, Y.P.; Huang, S.J. MUS-CDB: Mixed Uncertainty Sampling With Class Distribution Balancing for Active Annotation in Aerial Object Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5613013. [Google Scholar] [CrossRef]
  12. Ingle, P.Y.; Kim, Y.G. Real-Time Abnormal Object Detection for Video Surveillance in Smart Cities. Sensors 2022, 22, 3862. [Google Scholar] [CrossRef]
  13. Alsubaei, F.S.; Al-Wesabi, F.N.; Hilal, A.M. Deep Learning-Based Small Object Detection and Classification Model for Garbage Waste Management in Smart Cities and IoT Environment. Appl. Sci. 2022, 12, 2281. [Google Scholar] [CrossRef]
  14. Abdul-Khalil, S.; Abdul-Rahman, S.; Mutalib, S.; Kamarudin, S.I.; Kamaruddin, S.S. A review on object detection for autonomous mobile robot. IAES Int. J. Artif. Intell. 2023, 3, 1033–1043. [Google Scholar] [CrossRef]
  15. Xu, Z.; Zhan, X.; Xiu, Y.; Suzuki, C.; Shimada, K. Onboard Dynamic-Object Detection and Tracking for Autonomous Robot Navigation With RGB-D Camera. IEEE Robot. Autom. Lett. 2024, 9, 651–658. [Google Scholar] [CrossRef]
  16. Kim, H.; Kim, H.; Lee, S.; Lee, H. Autonomous Exploration in a Cluttered Environment for a Mobile Robot With 2D-Map Segmentation and Object Detection. IEEE Robot. Autom. Lett. 2022, 7, 6343–6350. [Google Scholar] [CrossRef]
  17. Rostianingsih, S.; Setiawan, A.; Halim, C.I. COCO (Creating Common Object in Context) Dataset for Chemistry Apparatus. Procedia Comput. Sci. 2020, 171, 2445–2452. [Google Scholar] [CrossRef]
  18. Tong, K.; Wu, Y. Rethinking PASCAL-VOC and MS-COCO dataset for small object detection. J. Vis. Commun. Image Represent. 2023, 93, 103830. [Google Scholar] [CrossRef]
  19. Na, B.; Fox, G.C. Object classifications by image super-resolution preprocessing for convolutional neural networks. Adv. Sci. Technol. Eng. Syst. 2020, 5, 476–483. [Google Scholar] [CrossRef]
  20. Shahriar, T.; Li, H. A Study of Image Pre-processing for Faster Object Recognition. arXiv 2020, arXiv:2011.06928. [Google Scholar] [CrossRef]
  21. Krishna, H.; Jawahar, C.V. Improving small object detection. In Proceedings of the 4th Asian Conference on Pattern Recognition, ACPR 2017, Nanjing, China, 26–29 November 2017; Institute of Electrical and Electronics Engineers Inc.: New York, NY, USA, 2017; pp. 346–351. [Google Scholar] [CrossRef]
  22. Pang, Y.; Cao, J.; Wang, J.; Han, J. JCS-Net: Joint Classification and Super-Resolution Network for Small-Scale Pedestrian Detection in Surveillance Images. IEEE Trans. Inf. Forensics Secur. 2019, 14, 3322–3331. [Google Scholar] [CrossRef]
  23. Yang, Z.; Chai, X.; Wang, R.; Guo, W.; Wang, W.; Pu, L.; Chen, X. Prior Knowledge Guided Small Object Detection on High-Resolution Images. In Proceedings of the International Conference on Image Processing, ICIP, Taipei, China, 22–25 September 2019. [Google Scholar] [CrossRef]
  24. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Cham, Switzerland, 2020. [Google Scholar] [CrossRef]
  25. Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable Detr: Deformable Transformers for End-To-End Object Detection. In Proceedings of the ICLR 2021—9th International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021. [Google Scholar]
  26. Zhang, H.; Mao, F.; Xue, M.; Fang, G.; Feng, Z.; Song, J.; Song, M. Knowledge Amalgamation for Object Detection With Transformers. IEEE Trans. Image Process. 2023, 32, 2093–2106. [Google Scholar] [CrossRef]
  27. Dai, Z.; Cai, B.; Lin, Y.; Chen, J. UP-DETR: Unsupervised Pre-training for Object Detection with Transformers. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021. [Google Scholar] [CrossRef]
  28. Dai, L.; Liu, H.; Tang, H.; Wu, Z.; Song, P. AO2-DETR: Arbitrary-Oriented Object Detection Transformer. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 2342–2356. [Google Scholar] [CrossRef]
  29. Grigorescu, S.; Trasnea, B.; Cocias, T.; Macesanu, G. A survey of deep learning techniques for autonomous driving. J. Field Robot. 2020, 37, 362–386. [Google Scholar] [CrossRef]
  30. Badue, C.; Guidolini, R.; Carneiro, R.V.; Azevedo, P.; Cardoso, V.B.; Forechi, A.; Jesus, L.; Berriel, R.; Paixão, T.M.; Mutz, F.; et al. Self-driving cars: A survey. Expert Syst. Appl. 2021, 165, 113816. [Google Scholar] [CrossRef]
  31. Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? the KITTI vision benchmark suite. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 6–21 June 2012. [Google Scholar] [CrossRef]
  32. Dong, C.; Loy, C.C.; He, K.; Tang, X. Image Super-Resolution Using Deep Convolutional Networks. IEEE Trans. Pattern. Anal. Mach. Intell. 2016, 38, 295–307. [Google Scholar] [CrossRef]
  33. Zhang, Y.; Tian, Y.; Kong, Y.; Zhong, B.; Fu, Y. Residual Dense Network for Image Super-Resolution. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar] [CrossRef]
  34. Tai, Y.; Yang, J.; Liu, X. Image super-resolution via deep recursive residual network. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
  35. Wang, X.; Yu, K.; Wu, S.; Gu, J.; Liu, Y.; Dong, C.; Qiao, Y.; Loy, C.C. ESRGAN: Enhanced super-resolution generative adversarial networks. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Cham, Switzerland, 2019. [Google Scholar] [CrossRef]
  36. Intodia, S.; Gupta, S.; Yeramalli, Y.; Bhat, A. Literature Review: Super Resolution for Autonomous Vehicles using Generative Adversarial Networks. In Proceedings of the 7th International Conference on Intelligent Computing and Control Systems, ICICCS 2023, Madurai, India, 17–19 May 2023. [Google Scholar] [CrossRef]
  37. Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.P.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 105–114. [Google Scholar] [CrossRef]
  38. Chen, X.; Wang, X.; Zhou, J.; Qiao, Y.; Dong, C. Activating More Pixels in Image Super-Resolution Transformer. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–19 June 2023. [Google Scholar] [CrossRef]
  39. Hsu, C.-C.; Lee, C.-M.; Chou, Y.-S. DRCT: Saving Image Super-resolution away from Information Bottleneck. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; Available online: http://arxiv.org/abs/2404.00722 (accessed on 23 November 2024).
  40. Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. SwinIR: Image Restoration Using Swin Transformer. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021. [Google Scholar] [CrossRef]
  41. Zhang, D.; Huang, F.; Liu, S.; Wang, X.; Jin, Z. SwinFIR: Revisiting the SwinIR with Fast Fourier Convolution and Improved Training for Image Super-Resolution. arXiv 2022, arXiv:2208.11247. [Google Scholar]
  42. Haris, M.; Shakhnarovich, G.; Ukita, N. Task-Driven Super Resolution: Object Detection in Low-Resolution Images. In Communications in Computer and Information Science; Springer: Cham, Switzerland, 2021. [Google Scholar] [CrossRef]
  43. Liu, K.; Fu, Z.; Jin, S.; Chen, Z.; Zhou, F.; Jiang, R.; Chen, Y.; Ye, J. ESOD: Efficient Small Object Detection on High-Resolution Images. IEEE Trans. Image Process. 2024, 14, 183–195. [Google Scholar] [CrossRef]
  44. Musunuri, Y.R.; Kwon, O.S.; Kung, S.Y. SRODNet: Object Detection Network Based on Super Resolution for Autonomous Vehicles. Remote Sens. 2022, 14, 6270. [Google Scholar] [CrossRef]
  45. Yang, Q.; Huang, C.; Cao, L.; Song, Q.; Jiang, X.; Liu, X.; Yuan, C. CLAHR: Cascaded Label Assignment Head for High-Resolution Small Object Detection. IEEE Access 2024, 12, 15447–15457. [Google Scholar] [CrossRef]
  46. Chen, Z.; Ji, H.; Zhang, Y.; Zhu, Z.; Li, Y. High-Resolution Feature Pyramid Network for Small Object Detection on Drone View. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 475–489. [Google Scholar] [CrossRef]
  47. Li, J.; Zhang, Z.; Tian, Y.; Xu, Y.; Wen, Y.; Wang, S. Target-Guided Feature Super-Resolution for Vehicle Detection in Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 8020805. [Google Scholar] [CrossRef]
  48. Truong, N.Q.; Nguyen, P.H.; Nam, S.H.; Park, K.R. Deep Learning-Based Super-Resolution Reconstruction and Marker Detection for Drone Landing. IEEE Access 2019, 7, 61639–61655. [Google Scholar] [CrossRef]
  49. Ma, S.; Xu, M.; Feng, W. Dam Crack Instance Segmentation Algorithm Based on Improved YOLOv8. IEEE Access 2025, 13, 84271–84283. [Google Scholar] [CrossRef]
  50. Terven, J.; Córdova-Esparza, D.M.; Romero-González, J.A. A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
  51. Zheng, X.; Bi, J.; Li, K.; Zhang, G.; Jiang, P. SMN-YOLO: Lightweight YOLOv8-Based Model for Small Object Detection in Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2025, 22, 8001305. [Google Scholar] [CrossRef]
  52. Terven, J.; Cordova-Esparza, D. A Comprehensive Review of YOLO: From YOLOv1 and Beyond. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
  53. Li, B.; Huang, S.; Zhong, G. LTEA-YOLO: An Improved YOLOv5s Model for Small Object Detection. IEEE Access 2024, 12, 99768–99778. [Google Scholar] [CrossRef]
  54. Qiu, J.; Cai, F.; Fu, N.; Yao, Y. YOLO-Air: An Efficient Deep Learning Network for Small Object Detection in Drone-Based Imagery. IEEE Access 2025, 13, 79718–79735. [Google Scholar] [CrossRef]
  55. Yang, Y.; Wang, H.; Pang, P. SAIR-YOLO: An Improved YOLOv8 Network for Sea-Air Background IR Small Object Detection. IEEE Geosci. Remote Sens. Lett. 2025, 22, 7000505. [Google Scholar] [CrossRef]
  56. Lou, H.; Duan, X.; Guo, J.; Liu, H.; Gu, J.; Bi, L.; Chen, H. DC-YOLOv8: Small-Size Object Detection Algorithm Based on Camera Sensor. Electronics 2023, 12, 2323. [Google Scholar] [CrossRef]
  57. Deng, C.; Wang, M.; Liu, L.; Liu, Y.; Jiang, Y. Extended Feature Pyramid Network for Small Object Detection. IEEE Trans. Multimed. 2022, 24, 1968–1979. [Google Scholar] [CrossRef]
  58. Mirzaei, B.; Nezamabadi-pour, H.; Raoof, A.; Derakhshani, R. Small Object Detection and Tracking: A Comprehensive Review. Sensors 2023, 23, 6887. [Google Scholar] [CrossRef]
  59. Palwankar, T.; Kothari, K. Real Time Object Detection using SSD and MobileNet. Int. J. Res. Appl. Sci. Eng. Technol. 2022, 10, 831–834. [Google Scholar] [CrossRef]
  60. Li, W.; Liu, K.; Zhang, L.; Cheng, F. Object detection based on an adaptive attention mechanism. Sci. Rep. 2020, 10, 11307. [Google Scholar] [CrossRef]
  61. Zhang, L.; Wang, M.; Jiang, Y.; Li, D.; Zhou, Y. SSRDet: Small Object Detection Based on Feature Pyramid Network. IEEE Access 2023, 11, 96743–96752. [Google Scholar] [CrossRef]
  62. Song, Z.; Zhang, Y.; Liu, Y.; Yang, K.; Sun, M. MSFYOLO: Feature fusion-based detection for small objects. IEEE Lat. Am. Trans. 2022, 20, 823–830. [Google Scholar] [CrossRef]
  63. Cao, C.; Wang, B.; Zhang, W.; Zeng, X.; Yan, X.; Feng, Z.; Liu, Y.; Wu, Z. An Improved Faster R-CNN for Small Object Detection. IEEE Access 2019, 7, 106838–106846. [Google Scholar] [CrossRef]
  64. Keys, R.G. Cubic Convolution Interpolation for Digital Image Processing. IEEE Trans. Acoust. 1981, 29, 1153–1160. [Google Scholar] [CrossRef]
  65. Duchon, C.E. Lanczos Filtering in One and Two Dimensions. J. Appl. Meteorol. 1979, 18, 1016–1022. [Google Scholar] [CrossRef]
  66. Lim, B.; Son, S.; Kim, H.; Nah, S.; Lee, K.M. Enhanced Deep Residual Networks for Single Image Super-Resolution. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
  67. Haykir, A.A.; Öksuz, I. Enhancing Object Detection in Aerial Images Using Transformer-Based Super-Resolution. In Proceedings of the UBMK 2024—9th International Conference on Computer Science and Engineering, Antalya, Turkiye, 26–28 October 2024; pp. 966–971. [Google Scholar] [CrossRef]
  68. Zhai, S.; Shang, D.; Wang, S.; Dong, S. DF-SSD: An Improved SSD Object Detection Algorithm Based on DenseNet and Feature Fusion. IEEE Access 2020, 8, 24344–24357. [Google Scholar] [CrossRef]
  69. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
  70. Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
  71. Musunuri, Y.R.; Kim, C.; Kwon, O.S.; Kung, S.Y. Object Detection Using ESRGAN With a Sequential Transfer Learning on Remote Sensing Embedded Systems. IEEE Access 2024, 12, 102313–102327. [Google Scholar] [CrossRef]
  72. Zheng, Z.; Cheng, Y.; Xin, Z.; Yu, Z.; Zheng, B. Robust Perception Under Adverse Conditions for Autonomous Driving Based on Data Augmentation. IEEE Trans. Intell. Transp. Syst. 2023, 24, 13916–13929. [Google Scholar] [CrossRef]
  73. Mostofa, M.; Ferdous, S.N.; Riggan, B.S.; Nasrabadi, N.M. Joint-SRVDNet: Joint super resolution and vehicle detection network. IEEE Access 2020, 8, 13916–13929. [Google Scholar] [CrossRef]
  74. Li, A.; Pan, Y.; Xu, Z.; Bi, H.; Gao, B.; Li, K.; Yu, H.; Chen, Y. MaTVT: A Transformer-Based Approach for Multi-Agent Prediction in Complex Traffic Scenarios. IEEE Trans. Veh. Technol. 2025, 99, 1–13. [Google Scholar] [CrossRef]
  75. Li, M.; Gao, J.; Zhao, L.; Shen, X. Adaptive Computing Scheduling for Edge-Assisted Autonomous Driving. IEEE Trans. Veh. Technol. 2021, 70, 5318–5331. [Google Scholar] [CrossRef]
  76. Tang, S.; Chen, B.; Iwen, H.; Hirsch, J.; Fu, S.; Yang, Q.; Palacharla, P.; Wang, N.; Wang, X.; Shi, W. VECFrame: A Vehicular Edge Computing Framework for Connected Autonomous Vehicles. In Proceedings of the IEEE International Conference on Edge Computing, Chicago, IL, USA, 5–10 September 2021; Institute of Electrical and Electronics Engineers Inc.: New York, NY, USA, 2021; pp. 68–77. [Google Scholar] [CrossRef]
  77. Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image super-resolution using very deep residual channel attention networks. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Cham, Switzerland, 2018. [Google Scholar] [CrossRef]
  78. Bing, X.; Zhang, W.; Zheng, L.; Zhang, Y. Medical Image Super Resolution Using Improved Generative Adversarial Networks. IEEE Access 2019, 7, 145030–145038. [Google Scholar] [CrossRef]
  79. Gui, J.; Sun, Z.; Wen, Y.; Tao, D.; Ye, J. A Review on Generative Adversarial Networks: Algorithms, Theory, and Applications. IEEE Trans. Knowl. Data Eng. 2023, 35, 3313–3332. [Google Scholar] [CrossRef]
  80. Wang, H.; Sun, J.; Diao, W.; Li, J.; Zhang, K. TAGAN: Texture and Attention Guided Generative Adversarial Network for Image Super Resolution. In Proceedings of the IEEE International Symposium on Circuits and Systems, Austin, TX, USA, 27 May–1 June 2022; pp. 3269–3273. [Google Scholar] [CrossRef]
  81. Asry, C.E.L.; Benchaji, I.; Douzi, S.; Ouahidi, B.E.L. A robust intrusion detection system based on a shallow learning model and feature extraction techniques. PLoS ONE 2024, 19, e0295801. [Google Scholar] [CrossRef]
  82. Jiang, T.; Cheng, J. Target Recognition Based on CNN with LeakyReLU and PReLU Activation Functions. In Proceedings of the 2019 International Conference on Sensing, Diagnostics, Prognostics, and Control, SDPC, Beijing, China, 15–17 August 2019. [Google Scholar] [CrossRef]
  83. El Mellouki, O.; Khedher, M.I.; El-Yacoubi, M.A. Abstract Layer for LeakyReLU for Neural Network Verification Based on Abstract Interpretation. IEEE Access 2023, 11, 33401–33413. [Google Scholar] [CrossRef]
  84. Xu, G.; Wang, Y.; Cheng, J.; Tang, J.; Yang, X. Accurate and Efficient Stereo Matching via Attention Concatenation Volume. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 2461–2474. [Google Scholar] [CrossRef]
  85. Li, W.; Li, Y.; Chen, D.; Chan, J.C.W. Thin cloud removal with residual symmetrical concatenation network. J. Photogramm. Remote Sens. 2019, 153, 137–150. [Google Scholar] [CrossRef]
  86. Jatmika, S.; Patmanthara, S.; Wibawa, A.P.; Kurniawan, F. The Model of Local Wisdom for Smart Wellness Tourism with Optimization Multilayer Perceptron. J. Theor. Appl. Inf. Technol. 2024, 102, 640–652. [Google Scholar]
  87. Mao, M.; Hong, M. YOLO Object Detection for Real-Time Fabric Defect Inspection in the Textile Industry: A Review of YOLOv1 to YOLOv11. Sensors 2025, 25, 2270. [Google Scholar] [CrossRef] [PubMed]
  88. Tej, A.R.; Halder, S.S.; Shandeelya, A.P.; Pankajakshan, V. Enhancing Perceptual Loss with Adversarial Feature Matching for Super-Resolution. In Proceedings of the International Joint Conference on Neural Networks, Glasgow, UK, 19–24 July 2020. [Google Scholar] [CrossRef]
  89. Xie, L.; Wang, X.; Chen, X.; Li, G.; Shan, Y.; Zhou, J.; Dong, C. DeSRA: Detect and Delete the Artifacts of GAN-based Real-World Super-Resolution Models. Proc. Mach. Learn. Res. 2023, arXiv:2307.02457. [Google Scholar]
Figure 1. Proposed Method of Object Detection with Super-Resolution Transformer-based as preprocessing.
Figure 1. Proposed Method of Object Detection with Super-Resolution Transformer-based as preprocessing.
Wevj 16 00678 g001
Figure 2. The structure of DRCT as Super-Resolution method.
Figure 2. The structure of DRCT as Super-Resolution method.
Wevj 16 00678 g002
Figure 3. The structure of YOLO-v11 as Object Detection method.
Figure 3. The structure of YOLO-v11 as Object Detection method.
Wevj 16 00678 g003
Figure 4. Distribution and characteristics of the vehicle dataset. (a) object class, (b) bounding boxes among classes, (c) image coordinates, and (d) object width and height.
Figure 4. Distribution and characteristics of the vehicle dataset. (a) object class, (b) bounding boxes among classes, (c) image coordinates, and (d) object width and height.
Wevj 16 00678 g004
Figure 5. Object Detection performance with and without Super-Resolution method (a) mAP50 (b) mAP50-95.
Figure 5. Object Detection performance with and without Super-Resolution method (a) mAP50 (b) mAP50-95.
Wevj 16 00678 g005
Figure 6. Performance gap across scaling factors (X2, X3 and X4): (a) mAP50 and (b) standard deviation.
Figure 6. Performance gap across scaling factors (X2, X3 and X4): (a) mAP50 and (b) standard deviation.
Wevj 16 00678 g006
Figure 7. Comparative performance of object detection models across different super-resolution scale factors: (a) mAP@50, (b) Precision, and (c) F1-score.
Figure 7. Comparative performance of object detection models across different super-resolution scale factors: (a) mAP@50, (b) Precision, and (c) F1-score.
Wevj 16 00678 g007
Figure 8. Comparative object detection performance of each model: (a) mAP@50, (b) Precision, and (c) F1-score. Blue boxes indicate the interquartile range, red lines denote the median, and whiskers represent the minimum and maximum values.
Figure 8. Comparative object detection performance of each model: (a) mAP@50, (b) Precision, and (c) F1-score. Blue boxes indicate the interquartile range, red lines denote the median, and whiskers represent the minimum and maximum values.
Wevj 16 00678 g008
Figure 9. Object detection performance without DRCT ×2 scale on (a) qualitative and (b) confusion matrix tests.
Figure 9. Object detection performance without DRCT ×2 scale on (a) qualitative and (b) confusion matrix tests.
Wevj 16 00678 g009
Figure 10. Object detection performance with DRCT ×2 scale on (a) qualitative and (b) confusion matrix tests.
Figure 10. Object detection performance with DRCT ×2 scale on (a) qualitative and (b) confusion matrix tests.
Wevj 16 00678 g010
Figure 11. Object detection performance without DRCT ×4 scale on (a) qualitative and (b) confusion matrix tests.
Figure 11. Object detection performance without DRCT ×4 scale on (a) qualitative and (b) confusion matrix tests.
Wevj 16 00678 g011
Figure 12. Object detection performance with DRCT ×4 scale on (a) qualitative and (b) confusion matrix tests.
Figure 12. Object detection performance with DRCT ×4 scale on (a) qualitative and (b) confusion matrix tests.
Wevj 16 00678 g012
Figure 13. Scatter plot showing the mAP@50–inference time trade-off across baseline, SR-based, ablation, and full DRCT configurations.
Figure 13. Scatter plot showing the mAP@50–inference time trade-off across baseline, SR-based, ablation, and full DRCT configurations.
Wevj 16 00678 g013
Table 1. Related Works’ Summary.
Table 1. Related Works’ Summary.
Ref.Core MethodTask DomainKey ContributionLimitation/Gap
[72]Visual augmentation and fusion techniques based on unpaired image-to-image (I2I) translation for adverse weather conditionsVisual perception for autonomous vehicles under diverse adverse weather conditions (rain, fog, nighttime rain, and low illumination)Integrates unpaired I2I synthesis for visual enhancement and augmentation, combined with a dual-branch architecture that processes both original and synthesized images. This approach strengthens visual perception and significantly improves object recognition accuracy across multiple adverse weather scenarios.Although effective across various adverse conditions, the method does not specifically target extreme low-resolution scenarios and still relies on relatively adequate base image quality.
[44]CNN-based SR + Object Detection using YOLOSurveillance/vehicle detectionIntegrates super-resolution and detection to improve accuracy on low-resolution images.Has not been evaluated under complex real-road driving conditions.
[73]Joint Super-Resolution and Vehicle Detection Network (Joint-SRVDNet), combining Multi-scale GAN (MsGAN) for super-resolution with a jointly trained vehicle detectorSuper-resolution of aerial images and vehicle detection on low-resolution aerial imageryDemonstrates that the method provides superior visual quality and improves vehicle detection accuracy by jointly optimizing SR loss and detection loss, enabling hierarchical and discriminative feature learning.Has not been evaluated under complex real-world driving conditions.
[74]Multi-agent Trajectory Vector Transformer (MaTVT), consisting of a dual-level encoder (low-level and high-level) and a multi-modal decoderMulti-agent trajectory prediction in complex traffic scenarios to support autonomous vehicle motion planningModel future trajectories more accurately through hierarchical encodings of motion features, agent interactions, and environmental constraints. Evaluation on the Argoverse dataset shows that MaTVT outperforms benchmark methods in accuracy, efficiency, and robustness.Although highly effective for trajectory prediction, the method does not address image processing or restoration tasks and is therefore irrelevant for visual perception or super-resolution problems.
[67]Transformer-based super-resolution using the Hybrid Attention Transformer for Image Restoration (HAT-L), integrated with YOLOv8 OBB for object detectionAerial image super-resolution and object detection enhancementDemonstrates that transformer-based super-resolution can improve visual quality and strengthen object detection accuracy. Using HAT-L, the method achieves high PSNR and SSIM on the DOTA validation set, and yields improved mAP performance when combined with YOLOv8 OBB.The approach is tailored to aerial imagery and the DOTA dataset, which limits its applicability to real-world ground-level autonomous driving scenarios.
Table 2. Dataset Description.
Table 2. Dataset Description.
AttributeDescription
Dataset NameVehicles Dataset
Annotation FormatCOCO JSON, Pascal VOC XML, YOLO TXT
Number of ClassesCar, Bus, Truck
Number of Images4058
Image Resolution640 × 480
AugmentationNo
PreprocessingNo
Vehicles conditionFront and Back View
Table 3. Environment Description.
Table 3. Environment Description.
EnvironmentEnvironment Configuration
CPUIntel(R) Core™i7-10700F (Intel Corporation, Santa Clara, CA, USA)
Memory32GB
Graphic CardNVIDIA GeForce RTX3050 (NVIDIA Corporation, Santa Clara, CA, USA)
Operating SystemWindows 10
Programming LanguagePython-3.13.7
Deep Learning FrameworkUltralytic 8.2.103
Integrated development environmentVisual Studio Code (version 1.85.1, Microsoft Corporation, Redmond, WA, USA)
Cuda11.8
CUDNN9.1.0
Table 4. Quantitative Result Object Detection performance with and without Super-Resolution method.
Table 4. Quantitative Result Object Detection performance with and without Super-Resolution method.
ModelUsing SRScalePSNR/SSIMmAP@50mAP@50-95PrecisionRecallF1-Score
YOLOv8n [55] No × 2-0.779650.613050.790320.757590.72573
No × 3-0.745220.518520.771520.725220.70235
No × 4-0.714960.438520.749850.695950.68193
YOLOv11n [87]No × 2-0.777650.599040.776190.755320.72859
No × 3-0.7487850.524760.774040.715220.71165
No × 4-0.723960.445710.773230.682520.69609
Bicubic Interpolation+YOLOv11nYes × 229.14/0.9110.78410.607760.7813180.762450.77113
Yes × 327.45/0.89550.75820.533130.7734250.721310.74621
Yes × 425.43/0.87520.73130.452130.7661210.69120.72734
EDSR+YOLOv5 [44]Yes × 238.04/0.94460.837090.546920.786240.777620.76594
Yes × 334.29/0.91450.805210.515160.779150.745150.73155
Yes × 432.31/0.88870.740860.481830.771790.708110.69806
MsSRGAN+YOLOv3 [73]Yes × 236.54/0.93360.838190.548020.787340.778720.76704
Yes × 332.79/0.90350.806310.516260.780150.746250.73305
Yes × 430.81/0.87170.741960.482930.772890.709210.69996
DRCT+YOLOv11n (Ours)Yes × 239.05/0.96470.882190.61094 0.827820.826560.81208
Yes × 335.3/0.93460.840290.566750.831120.790120.77951
Yes × 433.32/0.90890.803350.54330.839640.743020.75119
Ground Truth+YOLOv11nNo × 2-0.858340.639450.842670.815110.82802
No × 3-0.843230.617470.824420.796850.80854
No × 4-0.831120.596740.805220.775780.79114
Table 5. Complexity description on Object Detection performance.
Table 5. Complexity description on Object Detection performance.
Name of the ModelScaleSize (Pixel)Params (M)FLOPs (G)Inference Time (ms)
Low Resolution Images + YOLOv8n [55] × 2 320   × 2403.218.7219.18
× 3 210   × 1603.158.6118.94
× 4 160   × 1203.118.4918.68
Low Resolution Images + YOLOv11n [87] × 2 320   × 2402.716.5214.34
× 3 210   × 1602.576.2113.66
× 4 160   × 1202.456.1913.15
Bicubic Interpolation + YOLOv11n × 2 320   × 2402.9815.8134.76
× 3 210   × 1602.8715.8233.67
× 4 160   × 1202.4515.0328.98
EDSR+YOLOv5 [44] × 2 640   × 48011.6215.834.76
× 312.5515.834.76
× 412.9115.934.98
MsSRGAN+YOLOv3 [73] × 223.7218.9541.69
× 323.9119.0841.98
× 424.7719.2542.29
DRCT+YOLOv11n (Ours) × 230.1817.5738.65
× 330.1417.5238.54
× 430.1117.4938.48
Ground Truth Image + YOLOv11n × 22.9815.8134.76
× 32.9715.8134.76
× 42.9815.8234.76
Table 6. Comparison of baseline, proposed DRCT, and ablation variants with their key components and performance metrics.
Table 6. Comparison of baseline, proposed DRCT, and ablation variants with their key components and performance metrics.
Name of the ModelUpsampling StrategyDRCT ComponentsTechnical DescriptionmAP@50Total Parameters (M)Inference Time (ms)
Lower BoundNone (Low Resolution Input)-Direct detection from LR input without SR or DRCT enhancement0.777652.7123.8
Baseline Bicubic   Interpolation   +   YOLOv 11 n × 2-Conventional interpolation; LR image is upscaled before YOLO detection, without SR model0.78412.7125.1
Proposed DRCT- L + YOLOv 11 n × 2Full ArchitectureFull version of the proposed SR model combining Dense Residual blocks and Transformer modules0.8821930.1835.7
Ablation 1 DRCT- Var 1 × 2Without Dense ConnectionRemoves Dense Connectivity blocks to evaluate their contribution to the overall performance0.8402925.4434.5
Ablation 2 DRCT- Var 2 × 2Without Transformer mechanismEliminates the Transformer component, retaining only convolutional pathways.0.8033529.7537.01
Ablation 3 DRCT- Var 3 × 2Reduced Transformer DepthUses a lightweight Transformer (fewer MSA layers) to reduce computational complexity.0.821127.0133.21
Ablation 4 DRCT- Var 4 × 2Without Skip ConnectionsRemoves residual skip connections to examine stability and error propagation in SR reconstruction.0.7924528.5536.47
Upper BoundNone (High Resolution Input)-Theoretical upper limit: direct detection on original high-resolution images without SR0.858342.7123.8
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Haqiqi, M.M.E.; Arifin, A.S.; Satyawan, A.S. Enhancing Object Detection for Autonomous Vehicles in Low-Resolution Environments Using a Super-Resolution Transformer-Based Preprocessing Framework. World Electr. Veh. J. 2025, 16, 678. https://doi.org/10.3390/wevj16120678

AMA Style

Haqiqi MME, Arifin AS, Satyawan AS. Enhancing Object Detection for Autonomous Vehicles in Low-Resolution Environments Using a Super-Resolution Transformer-Based Preprocessing Framework. World Electric Vehicle Journal. 2025; 16(12):678. https://doi.org/10.3390/wevj16120678

Chicago/Turabian Style

Haqiqi, Mokhammad Mirza Etnisa, Ajib Setyo Arifin, and Arief Suryadi Satyawan. 2025. "Enhancing Object Detection for Autonomous Vehicles in Low-Resolution Environments Using a Super-Resolution Transformer-Based Preprocessing Framework" World Electric Vehicle Journal 16, no. 12: 678. https://doi.org/10.3390/wevj16120678

APA Style

Haqiqi, M. M. E., Arifin, A. S., & Satyawan, A. S. (2025). Enhancing Object Detection for Autonomous Vehicles in Low-Resolution Environments Using a Super-Resolution Transformer-Based Preprocessing Framework. World Electric Vehicle Journal, 16(12), 678. https://doi.org/10.3390/wevj16120678

Article Metrics

Back to TopTop