A Feature Distillation Network to Enable Object Detection on an FPGA Platform in Poor Visibility Conditions

Bhattacharya, Jhilik; Molina, Romina; Crespo, Maria Liz; Carini, Alberto; Marsi, Stefano; Ramponi, Giovanni

doi:10.3390/electronics15112454

Open AccessArticle

A Feature Distillation Network to Enable Object Detection on an FPGA Platform in Poor Visibility Conditions

by

Jhilik Bhattacharya

¹,

Romina Molina

²

,

Maria Liz Crespo

²

,

Alberto Carini

^3,*

,

Stefano Marsi

³

and

Giovanni Ramponi

³

¹

CSED, Thapar Institute of Engineering and Technology, Patiala 147004, India

²

Multidisciplinary Laboratory, Science, Technology and Innovation Unit, International Center of Theoretical Physics, 34151 Trieste, Italy

³

DIA, University of Trieste, 34127 Trieste, Italy

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(11), 2454; https://doi.org/10.3390/electronics15112454

Submission received: 30 April 2026 / Revised: 24 May 2026 / Accepted: 29 May 2026 / Published: 4 June 2026

(This article belongs to the Topic Edge AI: From Intelligent Sensing to AI-Dedicated Hardware)

Download

Browse Figures

Versions Notes

Abstract

In this paper, we propose and evaluate a feature distillation technique for object detection under poor visibility conditions, and we analyze its impact when deployed on an FPGA platform. We demonstrate via extensive experiments how different detection architectures generalize across scenes, and we infer that a scale-permuted feature extraction is the ideal choice for detection tasks in unconstrained environments with an 11–12% gain. As verified by the experiments, image enhancement often fails to provide significant detection gains. We hence introduce a joint training in a scale-permuted student network that learns dehazed features from a dual teacher network without an explicit dehazing step. The student learns to replicate not only the teacher outputs but also the decision-making process of the teacher by using attention transfer. Although the overall goal is to produce a real-time system capable of providing driving assistance in challenging scenarios, the FPGA implementation of a scale-permuted network is the first of its kind. To achieve effective implementation of the model in FPGA technology, a high-level synthesis approach and model compression techniques are employed to obtain a deployment with a good trade-off between quality and memory footprint metrics. We develop two distilled models using the joint feature distillation technique and show that these perform better in poor visibility scenes when compared to other detectors with similar size or even bigger sizes in some cases. Our 8.5 M model shows an mAP gain of almost 1% compared to YOLOv10-M with 15 M parameters, on the Cityscapes Hazy dataset. On night images from the BDD dataset, our 8.5 M model shows an approximate mAP gain of 4% compared to YOLO26-S with 9.5 M parameters. We further perform cross-domain testing with the DriveIndia dataset to show that our models generalize well beyond the distillation distribution and can be used for generic driving scenarios.

Keywords:

object detection; distillation; enhancement; deep learning; dehazing; edge computing

1. Introduction

The accurate detection, localization, and recognition of multiple objects in an image has always been a crucial task in most computer vision applications. In this field, deep-learning-based object detection has seen remarkable progress in recent decades. The continuous generation of novel network architectures and the introduction of more sophisticated components of such networks have brought significantly increased performances on standard training and testing datasets. The wider adoption of these tools for practical applications across different areas—ranging from estimating traffic and crowds to industrial manufacturing, surveillance, and autonomous navigation—has been hampered by some problems still present in the optimization and integration of these techniques. The current focus, and the focus of our contribution, is on reusability, real-time operation, and reliability in real-world environments [1,2].

Along this course, the first step is to choose the most appropriate network that suits the target application in terms of resource requirements and performance precision. Then, common practices adopted for reducing computational complexities involve model pruning, quantization, and model distillation. Structured or unstructured pruning can be used to remove units with low utilization, hence resulting in a more manageable model [3]. Data quantization from float-32 model weights to float-16, and even int-8 if necessary, can also reduce the computation load to a great extent [3]. Distillation techniques help transfer the learning of a large network (referred as a teacher model) to a smaller network (student model); they can be grouped into response-based, feature-based, and relation-based categories [4]. While response-based techniques use the final output of the network for the distillation process, feature-based and relation-based techniques exploit the intermediate features and their flows between the layers in the network, providing precise knowledge transfer from the teacher to the student [4].

In the current work, we explore the case of an object detection task for autonomous driving in poor visibility conditions using limited computational resources. The end implementation focuses on an FPGA unit; therefore, one of the main requirements is to obtain sufficient model compression. Vision-based systems for surveillance, assisted driving, or fully autonomous vehicles must work with a robust object detection algorithm that functions satisfactorily in all environmental conditions. The criticality of the problem is increased by the fact that real-time object detection systems required for low-powered devices are implemented with versions of the original base networks that often provide lower detection precision compared to their larger base version. For example, the performance of YOLO26-S (9.5 M) [5] on the COCO dataset drops by

9 %

compared to its base model YOLO26-X (55.7 M).

In our contribution, we first focus on the following research question:

Given the performance of current benchmark object detection techniques for poor-visibility images, which is the most appropriate choice of a teacher architecture for adaptation on FPGA?

Object detection models often struggle in cases of degraded images where object boundaries and texture are blurred due to haze or low light [6]. We infer via several guided experiments that the scale-dependent fusion techniques [5,7] used in object detection networks suffer due to these poor contrast and obscured features. At the same time, permuting scale inputs at various levels helps increase robustness even with out-of-distribution data.

We then tackle a second research question:

When inference speed is of preference, how much importance should be given to the gain obtained by using an image enhancement step prior to object detection?

Indeed, to provide better object detection performances for poor-quality images, the images are often enhanced using suitable preprocessing techniques. Dehazing techniques often add artificial noise and structural distortions that hamper the performance of the object detectors [6,8]. Also, running image enhancement prior to object detection during real-time applications further increases the computational cost. We systematically demonstrate the results of object detection after the application of different benchmark dehazing techniques along with their computational parameters. It is observed that while most of these image restoration techniques are trained with a huge number of parameters and show prominent image quality metrics on synthetic datasets, their visual quality or object detection precision improvement is not significant when applied to real images, due to domain shift between synthetic and real conditions.

The final research question that we consider is as follows:

Knowledge distillation techniques are capable of transferring the learning from a larger model to a smaller one [9]. Can the same principle be applied for learning object detection features of an enhanced image from a degraded one without involving an explicit enhancement step?

We propose a distillation framework that jointly performs feature-level dehazing and compression for object detection under poor visibility conditions, and we evaluate its hardware impact when deployed on an FPGA unit. We make the following main contributions:

We show that, compared to SOA object detectors that rely on hierarchical fusion with a feature pyramid network or attention-based fusion with scale dependency, a scale-permuted architecture is best suited for degraded images. Such an architecture not only provides more robustness in domain shifts without fine-tuning but is also suited for knowledge distillation.
We propose a joint distillation technique based on dual-teacher training, in which the compressed model not only distills knowledge from a larger object detection backbone via output mapping but also learns the teacher’s decision-making process through attention transfer, while simultaneously distilling feature maps from enhanced images without requiring an explicit denoising step.
We deploy the compressed network on an FPGA unit. To the best of our knowledge, FPGA implementation of a scale-permuted network for object detection in poor visibility conditions is the first of its kind.

The three issues above define the novel contributions of our paper. A further contribution is that we prove our method to be veritably generalizable, since after the first training and testing we apply it to totally different datasets without any change in the architecture or refinement of the parameters.

The rest of this article is organized in the following sections: Section 2 provides a brief background on real-time detection with FPGAs. Section 3 provides a concise understanding of the architecture choice for the detection task. The distillation network developed in the current work is discussed in Section 4, while Section 5 summarizes the inferences drawn from the various experiments. Section 6 presents the conclusions and future directions of this research.

2. Real-Time Object Detection on FPGAs

Recent research has extensively explored FPGA-based acceleration of YOLO-based object detection architectures. Zhang et al. [10] proposed an end-to-end FPGA implementation for YOLOv2 and YOLOv3 that efficiently manages data transfer of weights and feature maps from external DDR memory. Their framework, evaluated on the KC705 and VC709 platforms, leverages the Roofline model to guide design space exploration and performance optimization. REQ-YOLO [11] introduces a resource-aware weight quantization framework that jointly optimizes software and hardware components to better exploit FPGA resources. In [12], the authors present Sim-YOLOv2, a binary-weight variant of YOLOv2 with 3- to 6-bit activation quantization. A scalable streaming architecture is introduced to minimize off-chip memory access, enabling the entire network to be stored in on-chip memory. Operating on

300 \times 300

inputs, the design achieves a latency of

9.15

ms on the VC707 FPGA, at the expense of an accuracy reduction of

10.15 %

. Communication with the host system is handled via PCI Express. Cai et al. [13] developed a parameterized FPGA architecture for YOLOv3-Tiny, a lightweight variant of YOLOv3, employing DMA-based data transfers between off-chip memory and the accelerator. Their system reports a latency of 532 ms. The authors of [14] proposed an FPGA-based real-time object detection and classification system using YOLOv3-Tiny, tailored for edge computing with a focus on traffic light detection. The model is quantized using INT8 fixed-point arithmetic, significantly reducing memory bandwidth requirements while improving power efficiency and computational throughput. Their implementation achieves

99 %

detection accuracy while consuming only 3.5 W of power. A low-power, low-latency FPGA accelerator for vehicle detection and tracking was introduced in [15], based on compressed YOLOv3 and YOLOv3-Tiny models. Targeting smart transportation applications, the system maintains robust performance under challenging conditions, achieving parameter reductions of

85.7 %

and

98.2 %

, respectively, using INT16 quantization.

More recently, Sha et al. [2] demonstrated FPGA acceleration of YOLOv6, achieving a mean average precision (mAP) of

84.9 %

on the PASCAL VOC2007 dataset at a resolution of

352 \times 352

. Their optimization pipeline includes input resolution reduction and quantization-aware training, substantially reducing computational cost with negligible accuracy degradation. Notably, the entire YOLOv6 model fits within on-chip memory, eliminating reliance on energy-intensive DRAM access.

Beyond YOLO architectures, single-stage detectors such as SSD have also been extensively explored on FPGA platforms. Kang et al. [16] presented a real-time SSD-based detection framework using a VGG16 backbone, achieving full on-chip deployment on an XC7VX690T FPGA. By applying accelerator-aware pruning to reach

87.5 %

sparsity, the system processes

640 \times 480

images at 42 FPS with an mAP of

78.13 %

. Yu et al. [17] proposed a heterogeneous FPGA–CPU acceleration framework for SSD, targeting a system comprising an Intel Xeon Silver 4116 CPU and an Arria 10 FPGA. Two SSD variants—SSD+ Inception and SSD+MobileNet—are implemented using reduced-precision floating-point formats (16-bit and 11-bit), with 70–

80 %

of the computation offloaded to the FPGA. Kim et al. [18] further presented a real-time FPGA implementation of SSDLite-MobileNetV2 with partial 8-bit quantization on the ZC706 platform. Despite the dominance of YOLO-based approaches in FPGA object detection research, comparatively limited attention has been given to detection under poor visibility conditions, such as adverse weather or low-light environments. An exception is the recent work by Vaithianathan et al. [1], which proposes an FPGA-based real-time object detection system for autonomous driving that maintains high accuracy under challenging environmental conditions. FPGA-based acceleration of dehazing and low-light enhancement networks has also been explored in [19,20,21]. Overall, we observe a dominant research trend toward YOLO-based acceleration on FPGA platforms, incorporating specialized pruning and post-quantization strategies [22], as well as FPGA implementations targeting image enhancement tasks such as dehazing and low-light improvement. While end-to-end frameworks for detection-friendly dehazing [8] and haze-aware distillation techniques [9] have been investigated, there remains a clear need for robust and resource-efficient detection frameworks that explicitly address degraded imaging conditions in real-world deployments while leveraging teacher-driven learning trajectories. Our proposed framework distinguishes itself from existing work in two principal respects: First, it employs a scale-permuted detection backbone, as opposed to conventional scale-dependent architectures. Through extensive experimentation, we demonstrate that decoupling scale representations mitigates the effects of domain shift, thereby improving detector generalization. To the best of our knowledge, this is the first implementation of a scale-permuted detection architecture tailored for FPGA deployment. Second, we introduce a dual-teacher training paradigm that combines multilevel feature distillation with the learning of an attention-guided teacher trajectory. This design highlights that, beyond matching teacher outputs, it is equally important for the student to capture the underlying reasoning process through which the teacher arrives at its predictions. Experimental results show that this training strategy significantly outperforms training from scratch and is particularly beneficial for lightweight models.

3. Image Restoration and Detection for Degraded Images

In this section, we demonstrate that scale-permuted backbone architectures exhibit stronger generalization across datasets, making them particularly well suited for knowledge distillation. We further show that applying explicit image enhancement prior to object detection provides limited benefits when weighed against the added computational and architectural complexity. To support these claims, we evaluate a range of detection architectures that achieve average precision values in the range of 40–

50 %

on the COCO validation dataset. Specifically, we consider RetinaNet [23], CenterNet [24], EfficientDet-D6 [7], SpineNet [25], YOLOv10 [26], and RT-DETR [27] to assess detection performance under poor visibility conditions. The selection of these models was motivated not only by their competitive inference accuracy but also by the diverse architectural and training innovations that they introduce. RetinaNet serves as a foundational one-stage detector that addresses class imbalance through the use of focal loss. CenterNet employs an hourglass-style feature extraction network composed of sequentially stacked downsampling and upsampling convolutional blocks with skip connections, enabling effective keypoint-based object localization. EfficientDet utilizes ImageNet-pretrained EfficientNet backbones and performs repeated bidirectional feature fusion across multiple spatial resolutions before classification and bounding-box regression. SpineNet introduces a scale-permuted backbone that enables flexible multi-scale feature extraction. YOLOv10 eliminates non-maximum suppression through an end-to-end training formulation to reduce inference latency. Among the class of transformer-based detectors, RT-DETR achieves high detection precision with a relatively low parameter count through end-to-end optimization.

For evaluation, we conducted experiments on multiple datasets containing traffic scenes affected by severe visibility degradation, including RESIDE (RTTS) [28], DawnHaze [29], and Cityscapes [30,31]. These datasets comprise 2591, 300, 500, and 1500 images, respectively, and represent real-world urban environments under heavy fog conditions. In all cases, we evaluated detection performance across five object categories: car, bus, bicycle, person, and motorcycle.

Table 1 reports the average precision achieved by the selected pretrained models without any fine-tuning on the target datasets. While these architectures significantly outperform conventional Faster R-CNN and Mask R-CNN models on the COCO validation benchmark, the results indicate a pronounced degradation in performance when applied to unseen datasets with adverse visibility conditions. This observation highlights the limited cross-domain generalization of standard detection pipelines and motivates the use of scale-permuted distillation strategies to improve robustness under image degradation.

The above observations give rise to several key insights and motivating questions:

High detection accuracy on the COCO benchmark does not necessarily translate to strong cross-dataset generalization or reliable performance in real-world, real-time scenarios.
It is therefore essential to determine whether the observed performance degradation across datasets is primarily caused by poor image quality or by limitations in the generalization capability of the detection architectures.
A natural question that follows is whether the application of appropriate enhancement or dehazing techniques can improve downstream detection performance.

To address these questions, we applied a range of image enhancement algorithms to the hazy datasets and evaluated whether detection performance improves when inference is performed on the enhanced images. A large body of prior work on dehazing focuses on improving the perceptual quality of images, with most methods being trained on synthetically generated hazy data for which paired ground-truth supervision is available. The methods selected for evaluation span a broad spectrum of architectural paradigms, ranging from early CNN-based dehazing networks to attention-driven, transformer-based, and more recent prompt-guided approaches.

Among early learning-based methods, AOD-Net (AOD) [32] introduced an updated atmospheric scattering model for end-to-end dehazing. Other representative approaches include GCANet [33], which employs gated context aggregation to adaptively fuse contextual information, and the Multi-Scale Boosted Dehazing Network with Dense Feature Fusion (MSBDN) [34], which combines multi-scale feature extraction with dense feature fusion. The effectiveness of attention mechanisms in selectively enhancing degraded regions is demonstrated by FFA-Net [35] and ChaIR [36].

Transformer-based architectures have also gained prominence in image restoration tasks, with DEHAZEFORMER [37] and MAXIM [38] representing widely adopted models capable of handling multiple image processing objectives. In addition, we evaluate several prompt-based enhancement methods, including PROMPTIR [39], CAPTNET [40], FDTANet [41], and DIFFUIR [42], which have attracted significant attention in recent years. These approaches generate task-specific prompts conditioned on the input image to regulate the type of restoration applied. As a result, they are typically trained on heterogeneous datasets comprising hazy, rainy, and noisy images, enabling them to adapt to a wide range of degradation scenarios.

Finally, in contrast to the predominantly supervised enhancement models discussed above, we also include unsupervised image restoration techniques such as D4 [43] and H2CGAN [44] to provide a comprehensive comparison across supervision paradigms.

The results reported in Table 2 summarize object detection performance on dehazed and enhanced images. We employed SpineNet-143 for detection, as it achieves the highest precision among the evaluated architectures, as demonstrated in Table 1. Interestingly, the results indicate that most enhancement techniques yield little to no improvement in detection performance. In several cases, dehazing even leads to a substantial degradation in detection accuracy. Among the evaluated methods, ChaIR exhibits only marginal gains of

0.1

and

0.8

mAP on the RESIDE and DawnHaze datasets, respectively. Other techniques that show slight improvements include D4 on DawnHaze, and MAXIM and DIFFUIR on RESIDE.

Despite achieving strong objective image quality scores on dehazing tasks using synthetically generated hazy images, the majority of enhancement techniques fail to sufficiently improve image quality for downstream object detection on real hazy scenes. Overall, we observed that unsupervised enhancement methods such as D4 and H2CGAN tend to be more favorable for detection tasks compared to supervised counterparts. However, even in these cases, the observed gains are marginal. Given the additional computational complexity introduced by an explicit enhancement stage, such limited improvements do not justify its inclusion when targeting low-power or resource-constrained deployment scenarios. To visually illustrate the qualitative outputs of different enhancement techniques, we present sample dehazed images from the Cityscapes dataset [30,31] in Figure 1.

A natural question that arises from these observations is whether the nominal gains achieved by certain enhancement techniques are specific to SpineNet or if similar trends hold across other detection architectures. To this end, Table 3 reports the mAP achieved by selected enhancement methods—D4, MAXIM, and ChaIR—across multiple detectors. The results reveal detector-dependent behavior: RT-DETR benefits from all three enhancement techniques, YOLO exhibits improvement only with MAXIM, and EfficientDet shows no performance gain with any of the evaluated methods. Nevertheless, two consistent conclusions emerge from these experiments:

When performance gains are observed, they remain marginal across all detectors.
The best detection performance obtained using enhanced images, regardless of the detector, remains significantly lower than the precision achieved by SpineNet operating directly on degraded images.

While several studies have shown that end-to-end pipelines, jointly optimizing enhancement and detection, can outperform sequential approaches [6,8], these improvements are often restricted to the domains on which the models are trained. For instance, as reported in [8], detection performance on the RESIDE (RTTS) dataset reaches

47.03

mAP with Faster R-CNN and

43.90

mAP with MSBDN + Faster R-CNN, improving to

53.15

mAP when joint optimization is applied. However, as shown in Table 1, the baseline SpineNet achieves a significantly higher detection performance of

62.66

mAP when applied directly to degraded images. This indicates that the choice of detection architecture plays a crucial role and may be equally important to the training strategy alone, if not more so.

We conducted an additional experiment on the Cityscapes dataset to examine how detection performance varies under different haze levels using pretrained detection models trained on COCO. Cityscapes is a traffic-oriented dataset comprising 500 validation images with synthetically applied haze at attenuation levels

α = 0.005

,

α = 0.01

, and

α = 0.02

, making it well suited for controlled evaluation of haze severity. Rather than reporting mean average precision across all categories, Table 4 presents class-wise precision scores to enable a more fine-grained analysis. As expected, detection performance is highest on clean images and degrades progressively with increasing haze density. Notably, SpineNet consistently outperforms all other evaluated detectors across all haze levels, including the clean setting. Even when all models are pretrained on COCO, SpineNet achieves an average gain of approximately

13 %

over the competing architectures on clean images. This performance gap persists under increasing haze, indicating that the scale-permuted backbone of SpineNet provides superior generalization not only in the presence of image degradation but also across broader out-of-distribution scenarios. The class-wise results further demonstrate that this improvement is consistent across all object categories, rather than being driven by a particular class. In contrast, the remaining detectors exhibit significantly poorer generalization, with pronounced performance degradation as haze intensity increases.

Based on these findings, we can conclude that the SpineNet backbone is the most suitable architectural choice for the subsequent experiments. Moreover, given the marginal gains observed when incorporating explicit enhancement steps—and the additional computational cost that they introduce—we find that prior image enhancement is not essential when the detector backbone itself exhibits strong robustness and generalization capabilities.

4. Dual-Teacher Framework

Based on the previous observations that the enhancement pipeline does not contribute significantly to the detection gains, we used an enhancement prior in which the student directly learns feature maps generated by the detection teacher

T_{d}

using images denoised with an enhancement teacher

T_{e}

.

4.1. Enhancement Teacher

To reduce reliance on an explicit dehazing step, which often does not yield significant performance improvement, we formulate the distillation problem to jointly learn dehazing-aware and detection-oriented feature representations. We denote the teacher network features extracted from hazy images

I^{h}

as

F_{t}^{h}

, and the corresponding student features learned from them as

F_{s}^{h}

. When a dehazed image

I^{c}

is used for detection, clean-domain features

F_{t}^{c}

are generated by the teacher and used to supervise the learning of student features

F_{s}^{c}

. Conventional approaches typically aim to train a student network either to learn hazy-domain features

F_{s}^{h}

from teacher features

F_{t}^{h}

using

I^{h}

directly, without any preprocessing, or to learn clean-domain features

F_{s}^{c}

from

F_{t}^{c}

by first applying an enhancement network

T_{e}

to generate a pseudo-clean image

I^{c}

from

I^{h}

. In contrast, we propose to learn clean-domain student features

F_{s}^{c}

directly from clean teacher features

F_{t}^{c}

while using the hazy image

I^{h}

as input to the student network. As a result, the proposed student architecture is trained to infer clean-image feature representations from degraded inputs under the supervision of clean-domain features produced by a teacher network

T_{d}

(see Figure 2).

An unsupervised degradation removal model

T_{e}

is employed exclusively at the teacher level, enabling more robust feature supervision while improving generalization to out-of-distribution degradations without introducing additional computational overhead during student inference. To support this framework, we exploit a CycleGAN-based training strategy with separate encoders and decoders for each image domain (i.e., clean, hazy, and rainy). In addition to the clean and hazy domains, we also consider a rainy domain to demonstrate that the proposed approach generalizes naturally to multiple types of image degradation (see Figure 3).

Supervised training on synthetic datasets often leads to overfitting to artificial noise patterns, resulting in poor generalization to real-world hazy or otherwise degraded scenes. In contrast, the adversarial loss employed in our GAN framework promotes better generalization, while geometric consistency and contextual losses further ensure the generation of visually realistic images that preserve scene structure without overfitting to synthetic artifacts.

Hereafter, we denote the generated clean image as

I_{gen}^{c}

. An enhanced image

I_{gen}^{c}

can thus be obtained from a degraded image

I^{x}

using

I_{g e n}^{c} = D_{c} (E_{x} (I^{x})), \{\begin{matrix} x = h, & if C l a s s (I^{x}) \in h a z y (h) \\ x = r, & if C l a s s (I^{x}) \in r a i n y (r) \end{matrix}

(1)

where the encoder

E_{x}

encodes a degraded image x, and the decoder

D_{c}

generates a clean/degraded image from encoded features of x. Note that, although we primarily focus on dehazing tasks, this mixed training strategy across multiple domains can also be employed when the same enhancement teacher

T_{e}

is used for both dehazing and deraining tasks.

In addition to the cycle reconstruction loss

(R_{x}, R_{c})

and the generator loss

(G L)

, a geometric consistency loss

(G C_{x})

and a latent loss

(L_{x}, L_{c})

are employed during training. The geometric consistency loss measures a feature-level discrepancy between

I^{'}

(i.e., a generated clean or degraded image) and

I^{″}

, where

I^{″}

denotes the enhanced image obtained from a geometrically rotated version of

I^{x}

. The operator

G (\cdot)

represents the inverse geometric transformation applied to

I^{''}

, enabling the computation of the loss. This loss constrains the network to prevent the generation of false structures in the output by enforcing consistency between features extracted from a degraded image and those extracted from its geometrically transformed counterpart. The latent loss encourages both encoders to learn consistent feature representations from the original image pairs

(I^{x}, I^{c})

and their corresponding generated counterparts

(I_{gen}^{c}, I_{gen}^{x})

. Furthermore, cycle consistency losses are considered for the hazy → clean and rainy → clean mappings, as no image translation between the hazy and rainy domains is required. The overall loss function is defined by the equations presented below:

L = R_{x} + R_{c} + G C_{x} + L_{x} + L_{c} + G L

(2)

\begin{matrix} R_{x} = | | I^{x} - I_{r e c}^{x} {| |}_{L 1}, I_{r e c}^{x} = D_{x} (E_{c} (I_{g e n}^{c})) and R_{c} = | | I^{c} - I_{r e c}^{c} {| |}_{L 1}, I_{r e c}^{c} = D_{c} (E_{x} (I_{g e n}^{x})) \end{matrix}

(3)

\begin{matrix} L_{x} = | | E_{x} (I^{x}) - E_{c} (I_{g e n}^{c}) {| |}_{L 1} and L_{c} = | | E_{c} (I^{c}) - E_{x} (I_{g e n}^{x}) {| |}_{L 1} \end{matrix}

(4)

\begin{matrix} G C_x = | | I^{'} - G (I^{''}) {| |}_{L 1} + | | V (I^{'}) - V (G (I^{''})) {| |}_{L 2} \end{matrix}

(5)

The geometric consistency loss can be computed for both domains using either an image-based L1 loss or a feature-based L2 loss. However, in the current implementation, it is applied to a single domain. The feature-based loss can be calculated using any suitable feature extraction method; VGG features, denoted above as

V (\cdot)

, are a commonly adopted choice for this purpose. This mixed training framework extends H2CGAN [44] to multiple degraded domains, rather than being restricted to two. The architecture of the encoders and decoders is adapted from [45]. The teacher model is trained using both hazy and rainy datasets and is subsequently used to generate pseudo-clean images for computing the teacher feature maps

F_{t}^{c}

. Since this procedure is performed exclusively during student model training, it does not affect inference at the edge.

4.2. Scale-Permuted Distillation Network

In contrast to scale-decreased backbone architectures, where the feature resolution is uniformly reduced with depth, a scale-permuted architecture enables the extraction of multi-scale features, thereby improving spatial localization. In scale-decreased backbones, the progressive reduction in feature resolution leads to insufficient preservation of spatial information, causing deeper layers to become increasingly less localized. The flexibility of the scale-permuted architecture to either increase or decrease the feature resolution at any network depth, together with unconstrained feature fusion across multiple scales, enhances the quality of the generated feature representations and results in improved detection performance. The scale-permuted backbone, illustrated in Figure 4, consists of two stem layers (the bottom two blue blocks). All remaining blocks are associated with specific feature levels, and blocks belonging to the same level share an identical architecture (i.e., the same feature dimensionality), denoted by the same color. In total, there are six feature levels, ranging from

L_{2}

to

L_{7}

, comprising 1, 2, 4, 4, 2, and 2 blocks, respectively. For instance, the

L_{2}

level corresponds to the ash-colored block (third from the bottom), while the

L_{3}

blocks are shown in yellow (layers 5 and 14). Similarly, blocks at levels

L_{4}

,

L_{5}

,

L_{6}

, and

L_{7}

are depicted in green, light gray, magenta, and pink, respectively. As observed in the figure, blocks from different levels are interleaved throughout the network depth, enabling an optimized pattern of feature-resolution changes along with effective multi-scale feature fusion. Layers 13–17, corresponding to blocks from

L_{4}

,

L_{3}

,

L_{5}

,

L_{7}

, and

L_{6}

, respectively, constitute the backbone outputs and are forwarded to the detection unit. The numbers of filters across feature levels for the teacher model are set to 64, 83, 164, 166, 332, and 664, which are reduced to 20, 64, 41, 48, 83, and 96 in the student model. In the teacher network, all diamond-shaped blocks represent residual blocks composed of two convolutional layers, whereas rectangular blocks denote bottleneck blocks with three convolutional layers. Instead of employing bottleneck blocks, the student model adopts transfer blocks to reduce the parameter count and accelerate convergence. These transfer blocks consist of two stacked

1 \times 1

convolutional layers and serve to learn linear projections across channels, effectively behaving as lightweight multi-layer perceptrons in the channel dimension. This design emphasizes channel-wise knowledge transfer rather than relearning spatial features, preserving the dominant feature subspace while implicitly performing a low-rank factorization in channel space. Further architectural details of the teacher and the two student networks considered in this work are provided in Table 5.

A multilevel, feature-based distillation strategy is employed using all five feature map outputs from layers 13–17. We denote the feature-extraction backbone of the teacher and student networks as

(f_{t} (x), θ_{t})

and

(f_{s} (x), θ_{s})

, respectively, for an input x, where

θ_{t}

and

θ_{s}

represent the learned network parameters. A global feature loss

l_{f}

, computed between the teacher and student feature embeddings

F_{t}

and

F_{s}

is used to facilitate the distillation of the teacher’s representations. While this approach is effective when applied across multiple feature levels, a single global loss enforces overall similarity and may introduce imbalance across feature scales. In particular, feature maps with higher spatial resolution may dominate the loss, causing contributions from lower-resolution feature maps to be underemphasized. To mitigate this effect, we introduce a weighting factor

λ

across feature levels to reduce the dominance of any particular level in the overall distillation loss. An alternative strategy is to separately model foreground and background losses, as increased separation between foreground and background features can improve the precision of the detection head. However, since our objective function is formulated to compute losses exclusively over feature representations—without reliance on detection ground-truth annotations—we instead adopt a teacher trajectory loss

l_{t}

. This loss encourages the student to learn the teacher’s reasoning process, resulting in a structured distillation framework in which the student can infer which regions of the feature maps are most salient to the teacher.

l_{f} = \sum_{i = 1}^{5} λ_{i} | | F_{t}^{i} - F_{s}^{i} | |

(6)

\begin{matrix} l_{t} = \sum_{i = 1}^{5} | | \hat{A_{t}^{i}} - \hat{A_{s}^{i}} | | \\ A_{z} = \sum^{c} {(F_{z})}^{2}, z \in {s, t} \\ \hat{A_{z}} = \frac{A_{z}}{| | A_{z} {| |}_{2}} \end{matrix}

(7)

The activation map

{\hat{A}}_{z}

provides a scale-invariant heatmap computed across channels at each spatial location, thereby emphasizing high-magnitude responses from the teacher network. The total loss L is defined as a weighted combination of the global feature loss

l_{f}

and the teacher trajectory loss

l_{t}

. Experimental results demonstrate that guiding the student to learn discriminative regions—corresponding to high-activation responses—via

l_{t}

, while anchoring this process with the feature loss

l_{f}

, yields superior performance compared to approaches that rely on global feature losses combined with structural similarity losses such as SSIM.

L = l_{f} + ρ l_{t}

(8)

In this setting, the student network is trained to fulfill two objectives: learning point-to-point feature differences, guided by the feature loss

l_{f}

; and learning the most discriminative regions in the image, guided by the teacher trajectory loss

l_{t}

. To prevent the network from suffering from gradient explosion, we regulate the weighting parameter

ρ

such that, during the initial training epochs, the network places greater emphasis on the feature loss. As training progresses and the feature loss begins to stabilize, the value of

ρ

is gradually increased. Training is initiated with a learning rate of

0.001

and

ρ = 0.1

. The value of

ρ

is subsequently increased to

0.5

and then reduced to

0.3

toward the final epochs. The feature-level weighting coefficients

λ

are set to

{1.2, 1.5, 1.0, 1.5, 2.0}

, ordered from the highest- to the lowest-resolution feature maps. These values are determined empirically. The overall procedure for training and inference is summarized in Algorithm 1. It can be observed that the coefficients

λ

corresponding to feature levels L3–L6 must be maintained within the range of 1 to

1.5

. In particular, increasing

λ

for higher-resolution feature maps (i.e., L3 and L4) beyond the range of

1.2

–

1.5

leads to a reduction in average precision, especially for larger objects such as cars and buses. This degradation is primarily attributable to the dominance of high-resolution feature losses during optimization. Furthermore, initializing training with a higher value of

ρ

results in slower convergence. A lower initial value of

ρ

ensures that the student network prioritizes stabilization of the overall loss via

l_{f}

, before progressively focusing on discriminative regions through

l_{t}

. Additionally, assigning higher level-wise weighting coefficients to

l_{t}

significantly destabilizes training, leading to a substantial drop in detection precision across all classes. Finally, variations in batch size were found to have negligible impact on detection performance. Specifically, batch sizes of 2, 4, and 8 exhibited similar convergence behavior.

We adopt a weight-compression-based initialization strategy [46] to initialize the student model. Specifically, a two-step singular value decomposition (SVD) is applied to the weight matrices of layers in which the channel dimensionality is reduced. Let

W_{s}^{l} \in R^{h \times w \times c_{in} \times c_{out}}

denote the factorized weight tensor of the l-th layer of the student network. This tensor is derived from the corresponding teacher weight matrix

W_{t}^{l} \in R^{h \times w \times in \times out}

of Teacher A (Spine49S). The decomposition proceeds by first reshaping

W_{t}^{l}

into a two-dimensional matrix in

R^{in \times (w \cdot h \cdot out)}

and extracting the top

c_{in}

principal components. The resulting intermediate weight matrix

W_{i}

is then reshaped into

R^{(w \cdot h \cdot c_{in}) \times out}

, from which the top

c_{out}

principal components are selected. This initialization procedure is performed once for each eligible layer prior to the commencement of training. Using this approach, the feature extractor is reduced from

9.53

M parameters to

2.1

M parameters, corresponding to an approximate reduction of

77.9 %

. The resulting network, denoted as Student 1A, is initially trained for 30 epochs. Subsequently, the trained model is fine-tuned using target features from a second teacher, Teacher B (Spine96), which comprises

36.2

M parameters. This results in an overall parameter reduction of

94.2 %

, and the resulting model is referred to as Student 1B. It should be noted that the feature extractors of both student models contain the same number of parameters. However, the detection head differs in size, with

2.07

M parameters for Student 1A and

6.48

M parameters for Student 1B. When combined with their respective detection heads, these models are referred to as Student 1Ad and Student 1Bd, with total parameter counts of

4.17

M and

8.58

M, respectively. In addition, we designed a smaller backbone architecture, denoted as Student 2, which yields a detection model referred to as Student 2Ad with a total parameter count of

3.67

M.

Algorithm 1 Dehazing + Distillation Training

1: Input: Hazy domain

I^{h}

, Clear domain

I^{c}

, Hyperparameters

λ_{c y c l e}, ρ, λ_{i}

2: Networks: Encoders

E_{x}, E_{c}

, Decoders

D_{x}, D_{c}

, Teacher

T_{d} (f_{t}, h e a d_{t})

, Student

S_{d} (f_{s}, h e a d_{t})

, Pretrained VGG Layer 5 V
3: Output: Generated clean

I_{g e n}^{c}

i.e.,

I^{'}

, Generated Hazy

I_{g e n}^{h}

, Features from

S_{d}

:-

F_{s}

4: for each

< I^{h}, I^{c} >

do
5:

z_{h} \leftarrow E_{x} (I^{h})

6:

I_{g e n}^{c} \leftarrow D_{c} (z_{h})

7:

I^{"} \leftarrow D_{c} (E_{x} (r (I^{h}))

, r(.) is forward rotation function
8:

I_{g e n}^{c} \leftarrow D_{c} (z_{h})

9:

z_{c}^{'} \leftarrow E_{c} (I_{g e n}^{c})

10:

I_{r e c}^{h} \leftarrow D_{x} (z_{c}^{'})

11:

L_{x} \leftarrow {∥ z_{c}^{'} - z_{h} ∥}_{1}

12:

R_{x} \leftarrow {∥ I_{r e c}^{c} - I^{c} ∥}_{1}

13:

G C_{x} \leftarrow ∥ I^{'} - G (I^{"}) ∥_{1} + {∥ V (I^{'}) - V (G (I^{"})) ∥}_{2}

, G(.) is inverse rotation function
14:

z_{c} \leftarrow E_{c} (I^{c})

15:

I_{g e n}^{h} \leftarrow D_{x} (z_{c})

16:

z_{h}^{'} \leftarrow E_{x} (I_{g e n}^{h})

17:

I_{r e c}^{c} \leftarrow D_{c} (z_{h}^{'})

18:

L_{c} \leftarrow {∥ z_{h}^{'} - z_{c} ∥}_{1}

19:

R_{c} \leftarrow {∥ I_{r e c}^{c} - I^{c} ∥}_{1}

20: end for
21:

L_{c y c l e} \leftarrow λ_{c y c l e} * (R_{h} + R_{c})

22:

G L \leftarrow Adv (I_{g e n}^{c}, I^{c}) + Adv (I_{g e n}^{h}, I^{h})

23:

L \leftarrow L_{c y c l e} + G L + L_{x} + L_{c} + G C_{x}

24: Update

θ_{E_{x}}, θ_{D_{x}}, θ_{E_{c}}, θ_{D_{c}}

using

\nabla_{θ} L

25: for each

< I_{g e n}^{c}, I^{h} >

do
26:

F_{t} \leftarrow f_{t} (I_{g e n}^{c})

27:

F_{s} \leftarrow f_{s} (I^{h})

28:

l_{f} \leftarrow \sum_{i = 1}^{5} λ_{i} {∥ F_{s}^{i} - F_{t}^{i} ∥}_{2}

, i denotes the feature levels
29:

l_{t} \leftarrow \sum_{i = 1}^{5} {∥ {\hat{A}}_{s}^{i} - {\hat{A}}_{t}^{i} ∥}_{2}

,

{\hat{A}}_{z} \leftarrow \frac{\sum^{c} F_{z}^{2}}{| | \sum^{c} F_{z}^{2} {| |}_{2}}

,

z \in {s, t}

30:

L \leftarrow l_{f} + ρ l_{t}

31: Update

θ_{f_{s}}

using

\nabla_{θ_{f_{s}}} L

32: end for
33: Detection for input

I^{h}

:

h e a d_{t} (f_{s} (I^{h}))

4.3. The FPGA Implementation

To evaluate the impact of the distilled architecture on FPGA hardware, we translated the student model into register-transfer level (RTL) code using the hls4ml framework. This tool converts a trained software model into a high-level synthesis (HLS) project, from which the corresponding hardware design can be generated. Table 6 summarizes the FPGA resource utilization for both 8-bit and 16-bit fixed-point implementations on a Xilinx Zynq UltraScale+ MPSoC ZCU102 (XCZU9EG-FFVB1156-2-E). The 16-bit implementation exhibits a substantial increase in resource utilization compared to the 8-bit design, particularly in DSP blocks and BRAM, with utilizations of

65 %

and

63 %

, respectively. In contrast, the 8-bit configuration uses no DSP resources and only

34 %

of available BRAM. Flip-flop utilization increases from

40 %

in the 8-bit design to

60 %

in the 16-bit design, while LUT usage remains constant at

4 %

for both configurations. Overall, these results indicate that the proposed compression strategy yields a compact model that does not impose excessive demands on FPGA resources. The increased DSP utilization in the 16-bit implementation reflects higher parallelism relative to the 8-bit design, which relies exclusively on the general-purpose logic fabric. This architectural choice enables higher throughput at the expense of resource efficiency. Furthermore, the reuse factor in the 16-bit design is lower than that of the 8-bit design, resulting in reduced latency. The absence of DSP usage in the 8-bit configuration indicates a higher degree of resource reuse within the logic fabric. This behavior is due to operator mapping performed by hls4ml under aggressive reuse settings.

The observed

20 %

increase in flip-flop utilization for the 16-bit design indicates deeper pipelining, which improves latency and throughput by overlapping operations. The identical LUT utilization across both configurations suggests that the increased pipeline depth does not introduce additional combinational logic complexity. Moreover, the 16-bit design relies more heavily on on-chip memory buffers, whereas the 8-bit design primarily streams data from off-chip memory, thereby conserving on-chip resources. These results highlight a trade-off in which the Zynq UltraScale+ MPSoC favors higher throughput through increased local storage in the 16-bit design, while prioritizing resource efficiency and external memory usage in the 8-bit configuration.

Regarding performance, the HLS kernel processes a

320 \times 320

input frame in

4.2 ms

at a clock frequency of

100 MHz

on the ZCU102 platform, corresponding to a theoretical throughput of approximately

238 FPS

. When accounting for DMA transfers of the full-resolution

2048 \times 1024

input frame (approximately

6 MB

, with a transfer latency of ∼6 ms), the estimated end-to-end latency increases to approximately

10 ms

, yielding an effective throughput of around

100 FPS

. Power consumption was estimated using the Xilinx Power Estimator (XPE) v2023.2, targeting the XCZU9EG-FFVB1156-2-E device operating at

100 MHz

with an assumed toggle rate of

5 %

. The 16-bit fixed-point implementation exhibited an estimated total power consumption of 1.8 W (including both programmable logic and processing system), while the 8-bit implementation reduced the estimated power consumption to

1.1 W

, corresponding to a factor

1.6

reduction. Based on the measured inference latency of

4.2 ms

, the resulting energy per inference is

7.6 mJ

and

4.6 mJ

for the 16-bit and 8-bit designs, respectively. Both implementations maintain a junction temperature below 27 °C, with a thermal margin exceeding 73 °C, confirming safe operation on the ZCU102 evaluation board. It should be noted that the reported throughput figures are derived from HLS-level synthesis estimates and do not yet account for post-route timing closure or additional overhead introduced by the complete software stack.

5. Results and Discussion

Experimental results on hazy, rainy, and low-light images are reported in the subsequent subsections. We further examine the influence of the loss functions and architectural design choices adopted during training.

5.1. Detection on Hazy Scenes

Table 7 reports the performance of the distilled network under two haze levels (

0.01

and

0.02

) on the Cityscapes validation dataset. The results are presented both with and without the proposed preprocessing-based distillation strategy. Here,

F_{t}^{h}

denotes feature representations extracted as

(f_{t} (I^{h}), θ_{t})

from hazy images, while

F_{t}^{c}

denotes clean-domain features obtained as

(f_{t} (D_{c} (E_{x} (I^{h}))), θ_{t})

. The distillation teacher

T_{e}

, comprising the encoder–decoder pair

(E_{x}, D_{c})

, is first trained independently using unpaired hazy images (Synthetic Foggy Cityscapes), rainy images (RID), and clean images (original Cityscapes).

The student network was subsequently trained for 100 epochs under multiple configurations, including training with and without mixed-precision (

F P 16

) arithmetic, as well as with and without pruning (denoted as P). From these experiments, we derived the following key observations:

For both haze levels, the distilled student consistently achieves slightly better performance than the teacher network.
Our hypothesis that using clean target feature maps during detection (i.e., $〈 I^{h}, F_{t}^{c} 〉$ ) yields improved performance compared to using hazy feature maps (i.e., $〈 I^{h}, F_{t}^{h} 〉$ ) is validated. This demonstrates that the explicit image enhancement step can be omitted, while still achieving improved detection performance through joint learning of clean-domain features from hazy inputs.
For the lower haze level ( $0.01$ ), mixed-precision ( $F P 16$ ) training combined with pruning results in notably improved detection precision. At higher haze levels, however, configurations employing pruning and FP16 weights may require additional training epochs to achieve stable convergence.

Finally, in Table 8, we compare the performance of Student 1Bd against other state-of-the-art detection architectures. We can observe that the Student 1Bd network achieves performance comparable to that of SpineNet 49S and YOLOv10-M while employing a significantly reduced number of parameters. Furthermore, the Student 2Ad network has a parameter count comparable to that of YOLOv10-N, yet it delivers substantially better performance and approaches that of YOLOv10-S, which uses approximately twice as many parameters.

5.2. Detection on Rainy Scenes

We further evaluated the performance of the distilled network on rainy scenes, as reported in Table 9. For this evaluation, we used the RIS dataset [47], which consists of 2285 images and encompasses a diverse set of real-world rain conditions with corresponding object annotations. The RIS dataset includes a mixture of clean images captured after rainfall, foggy rain scenes, and images affected by visible rain streaks, making it a suitable benchmark for assessing robustness under various rain-induced degradations.

Table 9 confirms that the proposed distilled models demonstrate strong robustness even on real rainy traffic scenes. The Student 1Bd model achieves the highest mAP of

21.40

, outperforming both YOLO26-S and SpineNet-49S despite using fewer parameters. It can be observed that the more compact Student 2Ad model (

3.67

M parameters) gives a comparable performance of

21.2

mAP, with a significant gain of

3 %

w.r.t YOLO26-S, which has almost 3× the parameter count.

These results highlight the effectiveness of the proposed distillation strategy in learning degradation-robust features that generalize well to real-world rain conditions, including fog–rain and rain-streak artifacts. The strong performance of Student 2Ad further underscores the suitability of the proposed approach for resource-constrained deployment scenarios, where maintaining competitive detection accuracy with minimal model complexity is critical.

5.3. Detection on Low-Light Scenes

For low-light scenarios, we evaluated object detection performance using night-time images from the BDD100K dataset. A total of 5735 night images were selected for this evaluation. As reported in Table 10, the proposed

8.58

M-parameter model achieves an improvement of more than

4 %

mAP compared to YOLO26-S, even if it cannot achieve the same mAP as SPINE49S. This result further demonstrates the robustness of the distilled architecture under challenging illumination conditions, highlighting its ability to generalize effectively across diverse degradation types without increasing model complexity.

5.4. Ablation

In this subsection, we analyze the impact of different distillation strategies, loss functions, and training configurations. In Table 11, models denoted as

T 143 - X

correspond to detection performance obtained using SpineNet-143 (66.9 M parameters) on the Cityscapes validation set, where

X \in {Hazy (H), Processed (P), Clean (C)}

. All student variants V1–V5 consist of 28.7 M parameters. Student V1 is trained using only a mean squared error (MSE) feature loss

(l_{f})

. Student V2 employs a feature-based SSIM loss, while Student V3 utilizes a combination of both MSE and SSIM losses. These three variants correspond to single-teacher distillation, where training is performed using hazy images and hazy-domain feature maps extracted from a detection teacher

T_{d}

. Student V4 represents a dual-teacher setting

(T_{e}, T_{d})

, in which both an enhancement teacher and a detection teacher are used. Finally, Student V5 corresponds to our proposed dual-teacher framework incorporating the teacher-reasoning-guided loss.

These experiments are designed to address the following questions:

Does training on hazy images limit inference on clean inputs? Comparisons between S-V1-H-H and S-V1-H-P indicate that the student network is not biased toward hazy inputs and, in fact, achieves improved performance when evaluated on cleaner images.
Can dual distillation from a large teacher outperform training from scratch? We observed that both S-V4 and S-V5 achieve nearly a $9 %$ improvement in mAP compared to a similarly sized model, T49-H (28.31 M parameters), trained from scratch. This demonstrates the effectiveness of dual-teacher distillation using a very large teacher (66.9 M parameters).

Among all configurations, S-V5-H2P-H represents the most practical scenario, where the student learns to detect objects from hazy inputs as if they originated from dehazed images. In contrast, S-V4-P2C-P reflects an idealized case in which both training and inference are performed on clean images. These findings motivate a hierarchical and iterative distillation scheme rather than direct distillation from a large teacher to a compact student. Specifically, a large teacher

T 143

(66.9 M parameters) first transfers knowledge to an intermediate student S-V4 (28.7 M parameters), which subsequently serves as a teacher for students with fewer than 10 M parameters. In addition to reporting object detection performance, we provide feature similarity visualizations between different network pairs in Figure 5. Feature representations were extracted from multiple networks to analyze their similarity. To quantify the similarity between two feature sets A and B, we computed the number of spatial locations across the feature maps where A and B differ, and we visualized the distribution using histograms. Feature maps from levels L3–L7 were used in the analysis, and all spatial locations per image were considered. We utilized features from T49, T143, S-V1-H, and S-V4-P2C to examine various scenarios. All features were computed using the Cityscapes validation set. “Hazy 49” and “Hazy 143” denote features extracted from hazy images using T49 and T143, respectively, while “Processed 49” and “Processed 143” correspond to features obtained from enhanced images. “V1 Hazy” and “V1 Processed” denote features from S-V1-H when applied to hazy and processed images, respectively. “V4P2C” represents features from S-V4-P2C. All processed images were generated using

T_{e}

, while “Clean 143” denotes features extracted from clean Cityscapes images using T143. The results indicate a substantial dissimilarity between T49 and T143 in both hazy and processed feature spaces. However, when knowledge is distilled from T143 into the student S-V1-H, the resulting representations exhibit strong similarity to the teacher in both domains. A similar trend is observed for S-V4-P2C, where the student aligns well with the clean feature space of T143, despite starting from hazy images (and subsequently processed via

T_{e}

). Nevertheless, it can be inferred that aligning features between teacher and student in the same input domain (as in S-V1-H) is comparatively easier than the more challenging cross-domain alignment required when matching to clean representations.

To evaluate the generalization capability of the intermediate student, we conducted cross-dataset experiments. While Cityscapes—comprising traffic scenes from multiple German cities—was used for training, we evaluated performance on the DriveIndia dataset [48], which contains traffic imagery collected from various cities in India, as shown in Table 12. The observed

1 %

improvement in mAP over the T49 baseline indicates that the gains achieved through distillation are not biased towards the training dataset.

Finally, to investigate whether the proposed distillation strategy generalizes across detection families, we applied the same framework to the YOLOv10 architecture, using YOLOv10-S as the baseline and YOLOv10-X as the teacher. To align the feature map dimensions between the teacher and student, additional convolutional layers were introduced at the detection head. We distilled the backbone using the same loss function L for 50 epochs. As reported in Table 13, no improvement was observed on BDD night images. Although a

3 %

mAP gain was achieved on Cityscapes at a haze level of

0.02

, the distilled model underperformed compared to YOLOv10-M (15.4 M parameters), which achieved an mAP of

28.75

. These results suggest that the effectiveness of the proposed distillation strategy is influenced by architectural compatibility between the teacher and student models.

6. Conclusions and Future Work

In this work, we have demonstrated that a scale-permuted architecture is an effective alternative to improve object detection performances, compared to multi-scale fusion with full-pyramid-network- or attention-based strategies. When scale is decoupled from rigid pyramid structures, the permutable architecture exhibits increased robustness to domain shifts without fine-tuning, making it particularly well suited for object detection in hazy, rainy, and low-light environments.

Building on this architectural choice, we introduced a joint dual-teacher distillation framework that enables the student model to learn dehazed feature representations directly from degraded inputs. The proposed structured distillation allows the student not only to match detection outputs but also to capture the teacher’s decision-making process through attention transfer, while implicitly distilling enhancement knowledge without requiring an explicit denoising stage. The experimental results show that this approach consistently improves generalization across multiple poor-visibility conditions and remains effective even when combined with reduced-precision training and pruning strategies.

We developed two student model variants to support deployment across a range of target platforms and demonstrated performance across hazy, rainy, and low-light scenes. We have shown that a structured distillation that learns the reasoning path of the teacher performs more strongly across different poor visibility conditions.

Finally, we demonstrated the practical deployability of the proposed framework by implementing the compressed student models on an FPGA platform. To the best of our knowledge, this represents the first FPGA deployment of a scale-permuted object detection architecture tailored for poor-visibility scenarios. Results on real hardware show that the distilled models can operate with modest computational resources, confirming their suitability for resource-constrained edge devices and validating the end-to-end design choices of the proposed approach.

A key direction for future work is to further compress the feature extractor to enable deployment on low-cost FPGA devices. Another promising direction is the investigation of continuous refinement and task adaptation in distilled networks, particularly with respect to the phenomenon of loss of plasticity—the degradation of a model’s ability to learn new tasks without compromising previously acquired knowledge. Addressing this challenge may require the development of dynamic reinitialization strategies for underutilized network components, potentially in conjunction with domain-transfer mechanisms, to support sustained learning without performance degradation.

Dehazing and denoising networks generally lead to a loss of feature information, which can adversely affect the operation of downstream detectors. It will be important to investigate how such networks can be designed and tuned to yield more significant detection improvements in the specific context of our dual-teacher setting.

Author Contributions

Conceptualization, J.B.; methodology, J.B., R.M., M.L.C., A.C., S.M. and G.R.; software, J.B. and R.M.; formal analysis, J.B. and R.M.; investigation, J.B. and R.M.; writing—original draft preparation, J.B., R.M., M.L.C., A.C., S.M. and G.R.; funding acquisition, M.L.C. All authors have read and agreed to the published version of the manuscript.

Funding

The authors acknowledge partial funding support from DST, India & MAE Italy (INT/Italy/P-33/2022(ER)).

Data Availability Statement

All data and code related to this research can be found at https://github.com/jhilikb/Distillation accessed on 28 May 2026.

Acknowledgments

A.C., S.M, and G.R. gratefully acknowledge the support of the University of Trieste. R.M. and M.L.C. gratefully acknowledge the support of ICTP for funding the research. The authors also acknowledge the contribution of Aditya Roy Choudhury to this project.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Vaithianathan, M. Real-Time Object Detection and Recognition in FPGA-Based Autonomous Driving Systems. Int. J. Comput. Trends Technol. 2024, 72, 145–152. [Google Scholar] [CrossRef]
Sha, X.; Yanagisawa, M.; Shi, Y. An FPGA-Based YOLOv6 Accelerator for High-Throughput and Energy-Efficient Object Detection. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 2025, 108, 473–481. [Google Scholar] [CrossRef]
Yvinec, E. Efficient Neural Networks: Post Training Pruning and Quantization. Ph.D. Thesis, Sorbonne Université, Paris, France, 2023. [Google Scholar]
Gou, J.; Yu, B.; Maybank, S.J.; Tao, D. Knowledge Distillation: A Survey. Int. J. Comput. Vis. 2021, 129, 1789–1819. [Google Scholar] [CrossRef]
Jocher, G.; Qiu, J. Ultralytics YOLO26, version 26; Ultralytics: Frederick, MD, USA, 2026.
Kumar, A.; Chadha, A. From Fog to Failure: The Unintended Consequences of Dehazing on Object Detection in Clear Images. In Proceedings of the ICLR 2025 Workshops: I Can’t Believe It’s Not Better (ICBINB), Singapore, 28 April 2025. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. arXiv 2020, arXiv:1911.09070. [Google Scholar] [CrossRef]
Li, C.; Zhou, H.; Liu, Y.; Yang, C.; Xie, Y.; Li, Z.; Zhu, L. Detection-Friendly Dehazing: Object Detection in Real-World Hazy Scenes. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 8284–8295. [Google Scholar] [CrossRef]
Qin, H.; Lu, X.; Wang, L.; Wang, Y. Foggy-Aware Teacher: An Unsupervised Domain Adaptive Learning Framework for Object Detection in Foggy Scenes. IEEE Robot. Autom. Lett. 2025, 10, 7508–7515. [Google Scholar] [CrossRef]
Zhang, D.; Wang, A.; Mo, R.; Wang, D. End-to-end acceleration of the YOLO object detection framework on FPGA-only devices. Neural Comput. Appl. 2024, 36, 1067–1089. [Google Scholar] [CrossRef]
Ding, C.; Wang, S.; Liu, N.; Xu, K.; Wang, Y.; Liang, Y. REQ-YOLO: A resource-aware, efficient quantization framework for object detection on FPGAs. In Proceedings of the 2019 ACM/SIGDA international symposium on field-programmable gate arrays, Seaside, CA, USA, 24–26 February 2019; pp. 33–42. [Google Scholar]
Nguyen, D.T.; Nguyen, T.N.; Kim, H.; Lee, H.J. A high-throughput and power-efficient FPGA implementation of YOLO CNN for object detection. IEEE Trans. Very Large Scale Integr. (VLSI) Systems 2019, 27, 1861–1873. [Google Scholar] [CrossRef]
Cai, L.; Dong, F.; Chen, K.; Yu, K.; Qu, W.; Jiang, J. An FPGA based heterogeneous accelerator for single shot multibox detector (SSD). In Proceedings of the 2020 IEEE 15th International Conference on Solid-State & Integrated Circuit Technology (ICSICT), Kunming, China, 3–6 November 2020; pp. 1–3. [Google Scholar]
Amin, R.A.; Hasan, M.; Wiese, V.; Obermaisser, R. FPGA-Based Real-Time Object Detection and Classification System Using YOLO for Edge Computing. IEEE Access 2024, 12, 73268–73278. [Google Scholar] [CrossRef]
Zhai, J.; Li, B.; Lv, S.; Zhou, Q. FPGA-based vehicle detection and tracking accelerator. Sensors 2023, 23, 2208. [Google Scholar] [CrossRef]
Kang, H.J. Real-time object detection on 640x480 image with vgg16+ ssd. In Proceedings of the 2019 International conference on field-programmable technology (ICFPT), Tianjin, China, 9–13 December 2019; pp. 419–422. [Google Scholar]
Yu, Z.; Bouganis, C.S. A parameterisable FPGA-tailored architecture for YOLOv3-tiny. In Proceedings of the Applied Reconfigurable Computing. Architectures, Tools, and Applications: 16th International Symposium, ARC 2020, Toledo, Spain, 1–3 April 2020; Springer: Berlin/Heidelberg, Gemany, 2020; pp. 330–344. [Google Scholar]
Kim, S.; Na, S.; Kong, B.Y.; Choi, J.; Park, I.C. Real-time SSDLite object detection on FPGA. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2021, 29, 1192–1205. [Google Scholar] [CrossRef]
Ngo, D.; Son, J.; Kang, B. VBI-Accelerated FPGA Implementation of Autonomous Image Dehazing: Leveraging the Vertical Blanking Interval for Haze-Aware Local Image Blending. Remote Sens. 2025, 17, 919. [Google Scholar] [CrossRef]
Li, X.; Li, W.; Yan, X.; Wang, W.; Bu, F. Object Detection Method Based on Polarimetric Features and PFOD-Net Under Adverse Weather Conditions. Appl. Sci. 2026, 16, 1698. [Google Scholar] [CrossRef]
Xie, F.; Jing, H.; Lu, Z.; Ju, S.; Peng, B.; Xie, T.; Yang, L.; Han, W.; Wang, Z.; Sai, G. FPGA-Based Front-End Low-Light Enhancement for Deterministic Vision-Only Driving Perception. Electronics 2026, 15, 1224. [Google Scholar] [CrossRef]
Fata, J.S.; Elmannai, W.M. Low-Cost FPGA-Enhanced CNN Accelerator for Real-Time YOLO Object Detection and Classification. IEEE Access 2026, 14, 34614–34642. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. arXiv 2018, arXiv:1708.02002. [Google Scholar] [CrossRef]
Zhou, X.; Wang, D.; Krähenbühl, P. Objects as Points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
Du, X.; Lin, T.Y.; Jin, P.; Ghiasi, G.; Tan, M.; Cui, Y.; Le, Q.V.; Song, X. SpineNet: Learning scale-permuted backbone for recognition and localization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11592–11601. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Lv, W.; Zhao, Y.; Chang, Q.; Huang, K.; Wang, G.; Liu, Y. RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer. arXiv 2024, arXiv:2407.17140. [Google Scholar]
Li, B.; Ren, W.; Fu, D.; Tao, D.; Feng, D.; Zeng, W.; Wang, Z. Reside: A benchmark for single image dehazing. arXiv 2017, arXiv:1712.04143. [Google Scholar]
Kenk, M.A.; Hassaballah, M. DAWN: Vehicle detection in adverse weather nature dataset. arXiv 2020, arXiv:2008.05402. [Google Scholar] [CrossRef]
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar]
Sakaridis, C.; Dai, D.; Hecker, S.; Van Gool, L. Model Adaptation with Synthetic and Real Data for Semantic Dense Foggy Scene Understanding. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 707–724. [Google Scholar]
Li, B.; Peng, X.; Wang, Z.; Xu, J.; Feng, D. Aod-net: All-in-one dehazing network. In Proceedings of the IEEE international conference on computer vision, Venice, Italy, 22–29 October 2017; pp. 4770–4778. [Google Scholar]
Chen, D.; He, M.; Fan, Q.; Liao, J.; Zhang, L.; Hou, D.; Yuan, L.; Hua, G. Gated context aggregation network for image dehazing and deraining. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 7–11 January 2019; pp. 1375–1383. [Google Scholar]
Dong, H.; Pan, J.; Xiang, L.; Hu, Z.; Zhang, X.; Wang, F.; Yang, M.H. Multi-scale boosted dehazing network with dense feature fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2157–2167. [Google Scholar]
Qin, X.; Wang, Z.; Bai, Y.; Xie, X.; Jia, H. FFA-Net: Feature fusion attention network for single image dehazing. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11908–11915. [Google Scholar]
Cui, Y.; Knoll, A. Exploring the potential of channel interactions for image restoration. Knowl. Based Syst. 2023, 282, 111156. [Google Scholar] [CrossRef]
Song, Y.; He, Z.; Qian, H.; Du, X. Vision transformers for single image dehazing. IEEE Trans. Image Process. 2023, 32, 1927–1941. [Google Scholar] [CrossRef]
Tu, Z.; Talebi, H.; Zhang, H.; Yang, F.; Milanfar, P.; Bovik, A.; Li, Y. Maxim: Multi-axis mlp for image processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, 18–24 June 2022; pp. 5769–5780. [Google Scholar]
Potlapalli, V.; Zamir, S.; Khan, S.; Khan, F. Promptir: Prompting for all-in-one blind image restoration. arXiv 2023, arXiv:2306.13090. [Google Scholar] [CrossRef]
Gao, H.; Yang, J.; Zhang, Y.; Wang, N.; Yang, J.; Dang, D. Prompt-based Ingredient-Oriented All-in-One Image Restoration. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 9458–9471. [Google Scholar] [CrossRef]
Gao, H.; Ma, B.; Zhang, Y.; Yang, J.; Yang, J.; Dang, D. Frequency domain task-adaptive network for restoring images with combined degradations. Pattern Recognit. 2025, 158, 111057. [Google Scholar] [CrossRef]
Zheng, D.; Wu, X.M.; Yang, S.; Zhang, J.; Hu, J.F.; Zheng, W.S. Selective Hourglass Mapping for Universal Image Restoration Based on Diffusion Model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 25445–25455. [Google Scholar]
Yang, Y.; Wang, C.; Liu, R.; Zhang, L.; Guo, X.; Tao, D. Self-Augmented Unpaired Image Dehazing via Density and Depth Decomposition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleands, LA, USA, 18–24 June 2022; pp. 2037–2046. [Google Scholar]
Mishra, P.; Bhattacharya, J.; Sharma, R.K.; Ramponi, G. H2CGAN: Manageable AI for Scene Understanding Tasks in Hazy/Rainy Environment. IEEE Access 2024, 12, 89163–89182. [Google Scholar] [CrossRef]
Bhattacharya, J.; Carini, A.; Marsi, S.; Ramponi, G. A Polynomial and Fourier Basis Network for Vision-Based Translation Tasks. Electronics 2026, 15, 52. [Google Scholar] [CrossRef]
Bhattacharya, J.; Ramponi, G. Speeded-up Convolution Neural Network for classification tasks using multiscale 2-dimensional decomposition. Neurocomputing 2020, 410, 61–70. [Google Scholar] [CrossRef]
Li, S.; Araujo, I.B.; Ren, W.; Wang, Z.; Tokuda, E.K.; Junior, R.H.; Cesar-Junior, R.; Zhang, J.; Guo, X.; Cao, X. Single Image Deraining: A Comprehensive Benchmark Analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 3833–3842. [Google Scholar]
Kumar, R.; Reddy, D.S.; Rajalakshmi, P. Driveindia: An Object Detection Dataset for Diverse Indian Traffic Scenes. In Proceedings of the 2025 IEEE 28th International Conference on Intelligent Transportation Systems (ITSC), Gold Coast, Australia, 18–21 November 2025; pp. 4249–4254. [Google Scholar]

Figure 1. Dehazing of Cityscapes images with different enhancement methods.

Figure 2. Dual-distillation training structure illustrating the use of the enhancement teacher

T_{e}

and the detection teacher

T_{d}

to train the student S. Leveraging teacher features

F_{t}^{c}

, the student S learns to generate detection features

F_{s}^{c}

from a hazy image

I^{h}

that match those extracted from a clean image.

Figure 2. Dual-distillation training structure illustrating the use of the enhancement teacher

T_{e}

and the detection teacher

T_{d}

to train the student S. Leveraging teacher features

F_{t}^{c}

, the student S learns to generate detection features

F_{s}^{c}

from a hazy image

I^{h}

that match those extracted from a clean image.

Figure 3. Enhancement teacher block

T_{e}

.

E_{x}

and

E_{c}

denote the encoders, and

D_{x}

and

D_{c}

denote the decoders of the degraded (x) and clean (c) domains, respectively. Reconstructed images for

I^{c}

,

I^{r}

, and

I^{h}

are denoted by

I_{rec}^{c}

,

I_{rec}^{r}

, and

I_{rec}^{h}

, while

I_{gen}^{c}

,

I_{gen}^{r}

, and

I_{gen}^{h}

represent the generated images in the clean, rainy, and hazy domains, respectively.

Figure 3. Enhancement teacher block

T_{e}

.

E_{x}

and

E_{c}

denote the encoders, and

D_{x}

and

D_{c}

denote the decoders of the degraded (x) and clean (c) domains, respectively. Reconstructed images for

I^{c}

,

I^{r}

, and

I^{h}

are denoted by

I_{rec}^{c}

,

I_{rec}^{r}

, and

I_{rec}^{h}

, while

I_{gen}^{c}

,

I_{gen}^{r}

, and

I_{gen}^{h}

represent the generated images in the clean, rainy, and hazy domains, respectively.

Figure 4. Modified SpineNet architecture, adapted from [25].

Figure 5. Feature similarity analysis between teacher and student models. “Hazy 49” and “Hazy 143” denote feature representations of hazy Cityscapes images extracted using T49 and T143, respectively. “Processed 49” and “Processed 143” correspond to features obtained from processed Cityscapes images using T49 and T143. “V1 Hazy” and “V1 Processed” denote features extracted by the S-V1-H student model on hazy and processed Cityscapes inputs, respectively, while “V4-P2C” represents features from the S-V4-P2C student model. All processed images are generated using

T_{e}

. “Clean 143” denotes features extracted from the clean Cityscapes dataset using T143. A significant dissimilarity is observed between the feature spaces of T49 and T143 in both the hazy and processed domains (in red and green). In contrast, the distilled feature maps of S-V1-H, trained with T143, exhibit improved similarity to the teacher representations (in yellow and blue).

Figure 5. Feature similarity analysis between teacher and student models. “Hazy 49” and “Hazy 143” denote feature representations of hazy Cityscapes images extracted using T49 and T143, respectively. “Processed 49” and “Processed 143” correspond to features obtained from processed Cityscapes images using T49 and T143. “V1 Hazy” and “V1 Processed” denote features extracted by the S-V1-H student model on hazy and processed Cityscapes inputs, respectively, while “V4-P2C” represents features from the S-V4-P2C student model. All processed images are generated using

T_{e}

. “Clean 143” denotes features extracted from the clean Cityscapes dataset using T143. A significant dissimilarity is observed between the feature spaces of T49 and T143 in both the hazy and processed domains (in red and green). In contrast, the distilled feature maps of S-V1-H, trained with T143, exhibit improved similarity to the teacher representations (in yellow and blue).

Table 1. Object detection mAP results on hazy datasets using different detection architectures. It can be seen that SpineNet shows significant gain (shown in red) for hazy datasets.

Model Name	No. of Parameters	COCO	RESIDE	Dawn Haze
Retina-Net	53.1 M	43.9	35.7	51.0
Centre-Net	53 M	44.5	17.5	23.7
Efficient-Det6	52 M	50.5	44.3	59.4
YOLOv10	29.5 M	54.4	47.4	57.6
RT-DETR	76 M	54.3	49.5	57.4
SpineNet	42.7/	47.1/	60.6/	70.6/
96/143	66.73	48.1	62.7	75.5

Table 2. Object detection mAP on dehazed images using SpineNet. The Total Gain is the summation of the gains over the three datasets. The top score is highlighted in red.

	Cityscapes	RESIDE	DawnHaze	Total Gain
Hazy Image	36.52	62.67	75.47
ChaIR	36.07	62.72	76.21	0.34
D4	38.10	62.06	76.32	1.83
GCANeT	38.94	61.80	73.13	−0.79
MAXIM	36.01	62.71	74.95	−0.98
AOD	36.50	57.87	63.22	−17.07
CAPTNET	33.62	47.28	70.83	−22.93
DEHAZEFORMER	37.43	60.25	63.18	−13.80
DIFFUIR	36.31	62.73	72.87	−2.74
FDTANET	34.98	46.73	71.35	−21.59
FFA	37.46	62.54	75.26	0.61
MSBDN	38.71	60.66	73.55	−1.74
PROMPTIR	36.40	60.84	74.41	−3.01
H2CGAN	39.56	62.16	75.79	2.86

Table 3. Object detection mAP performance on enhanced images across different detection networks. Although gains with respect to the original image are observed in some cases, Spine achieves the highest mAP (shown in red), significantly outperforming the other detectors.

	Effdetv6	YOLOv10	RTDETR	Spinenet
DawnHaze (DH)	59.39	57.59	57.36	75.46
DH_ChaIR	57.18	56.61	59.45	76.20
DH_MAXIM	56.10	58.21	58.46	74.95
DH_D4	57.09	55.83	58.35	76.32
RESIDE (R)	44.34	47.37	49.45	62.66
R_ChaIR	46.24	47.15	47.56	62.71
R_MAXIM	44.79	47.22	47.34	62.71
R_D4	46.33	47.12	47.81	62.06

Table 4. Object detection mAP performance on Cityscapes images across different detection networks and for different synthetic haze values (clean,

0.005

,

0.001

,

0.002

). The top scores are highlighted in red.

Table 4. Object detection mAP performance on Cityscapes images across different detection networks and for different synthetic haze values (clean,

0.005

,

0.001

,

0.002

). The top scores are highlighted in red.

	Bicycle	Bus	Car	Motorcycle	Person	Truck	Train	Traffic Light	Avg
EfficientDet
Clean	39.83	53.18	48.13	28.33	28.05	35.72	4.76	18.67	32.08
0.005	38.76	51.38	47.11	27.71	28.4	30.68	4.35	18.81	30.9
0.01	36.14	47.6	46.03	27.29	27.87	29.8	4.35	18.26	29.66
0.02	32.45	37.8	42.63	25.04	26.18	26.19	0	16.15	25.80
YOLOv10
Clean	34.94	52.24	49.85	27.62	31.67	25.74	7.97	18.67	31.08
0.005	33.43	52.25	47.78	28.72	31.11	25.97	9.58	20.7	31.19
0.01	32.73	52.24	45.76	25.48	30.61	25.75	8.7	20.76	30.25
0.02	30.84	47.27	41.69	23.57	28.7	25.49	9.94	19.74	28.40
RT-Detr
Clean	34.96	53.07	49.28	29.27	30.46	28.47	17.39	23.66	33.32
0.005	34.39	48.29	46.3	27.43	30.43	29.38	4.35	25.92	30.81
0.01	32.64	46.81	44.17	24.52	29.54	28.65	8.7	25.02	30.00
0.02	26.85	45.86	38.08	24.83	31.84	32.54	10	24.81	29.35
SpineNet143
Clean	48.43	63.26	67.04	44.46	44.11	30.56	40.58	32.6	46.38
0.005	48.05	59.39	65.28	40.68	42.97	29.78	40.72	33.85	45.09
0.01	46.52	55.72	63.48	36.92	42.44	30.62	35.06	33.25	43.00
0.02	43.65	47.93	60.33	30.52	39.63	30.39	29.81	30.15	39.05

Table 5. Parameter details of teacher and student networks.

	Teacher
Filter Sizes	64, 83, 164, 166, 332, 664
No. of Blocks	stem (2), $L_{2}$ (1), $L_{3}$ (2), $L_{4}$ (4), $L_{5}$ (4), $L_{6}$ (2), $L_{7}$ (2)
Basic Blocks	residual (Conv 3 × 3, Conv 3 × 3),
	bottleneck (Conv 1 × 1, Conv 3 × 3, Conv 1 × 1)
	Student 1
Filter Sizes	64, 83, 164, 166, 332, 664
No. of Blocks	stem (2), $L_{2}$ (1), $L_{3}$ (2), $L_{4}$ (4), $L_{5}$ (4), $L_{6}$ (2), $L_{7}$ (2)
Basic Blocks	residual (Conv 3 × 3, Conv 3 × 3),
	transfer (Conv 1 × 1, Conv 1 × 1)
	Student 2
Filter Sizes	20, 83, 41, 64, 128, 192
No. of Blocks	stem (2), $L_{2}$ (1), $L_{3}$ (2), $L_{4}$ (4), $L_{5}$ (4), $L_{6}$ (2), $L_{7}$ (2)
Basic Blocks	residual (Conv 3 × 3, Conv 3 × 3),
	transfer (Conv 1 × 1, Conv 1 × 1)

Table 6. FPGA resource utilization. Part: XCZU9EG-FFVB1156-2-E.

Metrics	8-Bits Fixed-Point	16-Bits Fixed-Point
BRAM	34%	63%
DSP	0%	65%
LUT	4%	4%
FF	40%	60%
Frequency [MHz]	100	100

Table 7. Performance (mAP) of distilled student network (Student 1Bd) on Cityscapes hazy with haze levels of

0.01

and

0.02

. Top scores are highlighted in red.

Table 7. Performance (mAP) of distilled student network (Student 1Bd) on Cityscapes hazy with haze levels of

0.01

and

0.02

. Top scores are highlighted in red.

	$< I^{h}, F_{t}^{h} >$	$< I^{h}, F_{t}^{c} >$	FP32	FP16	P	Haze 0.01	Haze 0.02
Student		✓		✓	✓	33.85	30.23
Student		✓		✓		33.28	29.97
Student	✓			✓	✓	32.77	28.90
Student	✓			✓		33.50	28.97
Student		✓	✓		✓	33.81	30.32
Student		✓	✓			33.82	30.51
Student	✓		✓		✓	32.70	29.01
Student	✓		✓			33.40	28.81
Teacher						32.86	29.01

Table 8. Detection results (mAP) for different YOLO and Spine variants under haze levels of

0.01

and

0.02

on the Cityscapes dataset. The

8.58

M-parameter student model (highlighted in red) achieves the highest mAP among both similarly sized and larger competitors. The

3.67

M-parameter student model (highlighted in green) demonstrates a significant improvement of 6–8 mAP points compared to a similarly sized

2.30

M-parameter competitor (highlighted in orange).

Table 8. Detection results (mAP) for different YOLO and Spine variants under haze levels of

0.01

and

0.02

on the Cityscapes dataset. The

8.58

M-parameter student model (highlighted in red) achieves the highest mAP among both similarly sized and larger competitors. The

3.67

M-parameter student model (highlighted in green) demonstrates a significant improvement of 6–8 mAP points compared to a similarly sized

2.30

M-parameter competitor (highlighted in orange).

	YOLO	YOLO	YOLO	SPINE	Student	YOLO	Student
	v10-M	v10-S	26-S	49s	1Bd (Ours)	v10-N	2Ad (Ours)
Haze level 0.01	32.90	22.98	28.04	32.86	33.85	15.9	23.54
Haze level 0.02	28.75	21.13	25.54	29.01	30.50	14.4	20.82
Pars (M)	15.4	7.20	9.50	11.15	8.58	2.30	3.67

Table 9. Object detection performance (mAP) on the RIS dataset. Results are reported for the proposed Student 1Bd and Student 2Ad models in comparison with existing architectures. The

8.58

M-parameter student model (highlighted in red) achieves the highest mAP among both similarly sized and larger competitors. The

3.67

M-parameter student model (highlighted in orange) attains comparable performance to the

8.58

M model despite a

57 %

reduction in the number of parameters.

Table 9. Object detection performance (mAP) on the RIS dataset. Results are reported for the proposed Student 1Bd and Student 2Ad models in comparison with existing architectures. The

8.58

M-parameter student model (highlighted in red) achieves the highest mAP among both similarly sized and larger competitors. The

3.67

M-parameter student model (highlighted in orange) attains comparable performance to the

8.58

M model despite a

57 %

reduction in the number of parameters.

	YOLO	SPINE	Student	Student
	26-S	49s	1Bd (Ours)	2Ad (Ours)
mAP	15.01	18.47	21.40	21.2
Pars (M)	9.5	11.15	8.58	3.67

Table 10. Object detection performance (mAP) on the BDD dataset. Results are reported for the proposed Student 1Bd and Student 2Ad models in comparison with competing architectures. Although the

8.58

M-parameter student model (shown in green) does not surpass the

11.15

M-parameter baseline (shown in red), it outperforms the

9.5

M-parameter competitor. The drop seen in the

3.67

M student (shown in orange) w.r.t to the bigger student is comparatively more than other degradations.

Table 10. Object detection performance (mAP) on the BDD dataset. Results are reported for the proposed Student 1Bd and Student 2Ad models in comparison with competing architectures. Although the

8.58

M-parameter student model (shown in green) does not surpass the

11.15

M-parameter baseline (shown in red), it outperforms the

9.5

M-parameter competitor. The drop seen in the

3.67

M student (shown in orange) w.r.t to the bigger student is comparatively more than other degradations.

	YOLO	SPINE	Student	Student
	26-S	49s	1Bd (Ours)	2Ad (Ours)
mAP	20.41	29.8	24.38	15.07
Pars (M)	9.5	11.15	8.58	3.67

Table 11. Object detection performance (mAP) on the Cityscapes validation dataset for different student variants (V1–V5). The notation S-I-Y-Z denotes a student model of version I, trained on domain Y and evaluated on domain Z. In the case of dual-teacher training, Y is expressed as

A 2 B

, where A represents the input domain and B the target domain. Teachers of different capacities are denoted as T49 and T143. The teacher trained on clean images (highlighted in red) achieves only a 5 mAP point improvement compared to the best student model trained on hazy inputs (highlighted in green), despite the student having approximately

57 %

fewer parameters. Furthermore, the student significantly outperforms the T49-H hazy baseline (highlighted in orange) within a comparable parameter range.

Table 11. Object detection performance (mAP) on the Cityscapes validation dataset for different student variants (V1–V5). The notation S-I-Y-Z denotes a student model of version I, trained on domain Y and evaluated on domain Z. In the case of dual-teacher training, Y is expressed as

A 2 B

, where A represents the input domain and B the target domain. Teachers of different capacities are denoted as T49 and T143. The teacher trained on clean images (highlighted in red) achieves only a 5 mAP point improvement compared to the best student model trained on hazy inputs (highlighted in green), despite the student having approximately

57 %

fewer parameters. Furthermore, the student significantly outperforms the T49-H hazy baseline (highlighted in orange) within a comparable parameter range.

	Bicycle	Bus	Car	Motorcycle	Person	Traffic Light	Train	Truck	Avg.	Pars
S-V1-H-H	40.71	48.09	62.25	29.76	40.57	29.36	35.29	26.60	39.07	28.7
S-V1-H-P	41.55	54.40	62.77	33.09	42.11	28.44	32.01	29.72	40.51	28.7
S-V2-H-P	40.42	52.25	60.91	34.59	41.07	27.88	33.33	25.38	39.47	28.7
S-V3-H-P	40.33	53.43	63.97	32.62	42.37	29.30	36.25	26.83	40.63	28.7
S-V4-H2P-H	41.01	50.50	62.22	31.23	41.96	29.74	34.96	26.04	39.70	28.7
S-V4-H2C-H	44.24	49.58	62.18	34.40	42.90	28.04	29.32	26.90	39.69	28.7
S-V4-P2C-P	42.97	53.96	65.32	36.71	42.29	28.92	32.54	25.44	41.01	28.7
S-V5-H2C-H	42.82	53.44	64.52	36.99	42.92	28.43	33.24	26.85	41.15	28.7
T49-H	34.32	46.17	55.31	23.33	34.63	21.20	16.44	25.41	32.10	28.3
T143-H	45.99	54.10	62.97	35.75	41.59	32.21	34.16	30.00	42.09	66.9
T143-C	48.43	63.26	67.04	44.46	44.11	32.60	40.58	30.56	46.38	66.9
T143-P	45.86	56.87	62.02	37.98	42.39	31.38	44.57	29.92	43.87	66.9

Table 12. Generalization performance on the DriveIndia dataset in terms of object detection accuracy (mAP). The distilled S-V4-H2C-H student model (highlighted in green), trained using the T143 teacher model (highlighted in red), achieves performance comparable to similarly sized competitors, including the T49 baseline and YOLOv2-L (highlighted in orange).

	Bicycle	Bus	Car	Motorcycle	Person	Truck	Avg.	Pars (M)
S-V4-H2C-H	11.44	44.56	89.76	80.60	33.10	37.15	49.43	28.7
T143	17.56	39.89	82.80	78.29	75.52	43.81	56.31	66.9
T49	18.76	37.29	86.31	81.58	29.16	37.01	48.35	28.3
YOLO26-L	18.83	37.86	85.92	81.80	22.93	40.84	48.03	24.8

Table 13. Distillation results on the YOLO architecture using the YOLOv10-X teacher model (highlighted in red). Object detection performance is reported in terms of mAP. The

19.6

M-parameter distilled model exhibits a 3 mAP point decrease on the BDD dataset compared to the 7 M-parameter model. The observed 3 mAP point improvement on the Cityscapes dataset is not substantial when considering the increase in model size from YOLOv10-S to S2X.

Table 13. Distillation results on the YOLO architecture using the YOLOv10-X teacher model (highlighted in red). Object detection performance is reported in terms of mAP. The

19.6

M-parameter distilled model exhibits a 3 mAP point decrease on the BDD dataset compared to the 7 M-parameter model. The observed 3 mAP point improvement on the Cityscapes dataset is not substantial when considering the increase in model size from YOLOv10-S to S2X.

	BDD Night	Cityscapes (0.02)
YOLOv10-S (7 M)	20.41	21.12
YOLOv10-X (28 M)	32.90	30.62
Distilled s2x (19.6 M)	17.49	24.72

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bhattacharya, J.; Molina, R.; Crespo, M.L.; Carini, A.; Marsi, S.; Ramponi, G. A Feature Distillation Network to Enable Object Detection on an FPGA Platform in Poor Visibility Conditions. Electronics 2026, 15, 2454. https://doi.org/10.3390/electronics15112454

AMA Style

Bhattacharya J, Molina R, Crespo ML, Carini A, Marsi S, Ramponi G. A Feature Distillation Network to Enable Object Detection on an FPGA Platform in Poor Visibility Conditions. Electronics. 2026; 15(11):2454. https://doi.org/10.3390/electronics15112454

Chicago/Turabian Style

Bhattacharya, Jhilik, Romina Molina, Maria Liz Crespo, Alberto Carini, Stefano Marsi, and Giovanni Ramponi. 2026. "A Feature Distillation Network to Enable Object Detection on an FPGA Platform in Poor Visibility Conditions" Electronics 15, no. 11: 2454. https://doi.org/10.3390/electronics15112454

APA Style

Bhattacharya, J., Molina, R., Crespo, M. L., Carini, A., Marsi, S., & Ramponi, G. (2026). A Feature Distillation Network to Enable Object Detection on an FPGA Platform in Poor Visibility Conditions. Electronics, 15(11), 2454. https://doi.org/10.3390/electronics15112454

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Feature Distillation Network to Enable Object Detection on an FPGA Platform in Poor Visibility Conditions

Abstract

1. Introduction

2. Real-Time Object Detection on FPGAs

3. Image Restoration and Detection for Degraded Images

4. Dual-Teacher Framework

4.1. Enhancement Teacher

4.2. Scale-Permuted Distillation Network

4.3. The FPGA Implementation

5. Results and Discussion

5.1. Detection on Hazy Scenes

5.2. Detection on Rainy Scenes

5.3. Detection on Low-Light Scenes

5.4. Ablation

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI