Next Article in Journal
Fractional Boundary Value Problems with Parameter-Dependent and Asymptotic Conditions
Previous Article in Journal
Solutions to Variable-Order Fractional BVPs with Multipoint Data in Ws,p Spaces
Previous Article in Special Issue
Estimation of Fractal Dimensions and Classification of Plant Disease with Complex Backgrounds
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Estimation of Fractal Dimension and Semantic Segmentation of Motion-Blurred Images by Knowledge Distillation in Autonomous Vehicle

Division of Electronics and Electrical Engineering, Dongguk University, 30 Pildong-ro, 1-gil, Jung-gu, Seoul 04620, Republic of Korea
*
Author to whom correspondence should be addressed.
Fractal Fract. 2025, 9(7), 460; https://doi.org/10.3390/fractalfract9070460
Submission received: 19 June 2025 / Revised: 8 July 2025 / Accepted: 11 July 2025 / Published: 15 July 2025

Abstract

Research on semantic segmentation for remote sensing road scenes advanced significantly, driven by autonomous driving technology. However, motion blur from camera or subject movements hampers segmentation performance. To address this issue, we propose a knowledge distillation-based semantic segmentation network (KDS-Net) that is robust to motion blur, eliminating the need for image restoration networks. KDS-Net leverages innovative knowledge distillation techniques and edge-enhanced segmentation loss to refine edge regions and improve segmentation precision across various receptive fields. To enhance the interpretability of segmentation quality under motion blur, we incorporate fractal dimension estimation to quantify the geometric complexity of class-specific regions, allowing for a structural assessment of predictions generated by the proposed knowledge distillation framework for autonomous driving. Experiments on well-known motion-blurred remote sensing road scene datasets (CamVid and KITTI) demonstrate mean IoU scores of 72.42% and 59.29%, respectively, surpassing state-of-the-art methods. Additionally, the lightweight KDS-Net (21.44 M parameters) enables real-time edge computing, mitigating data privacy concerns and communication overheads in internet of vehicles scenarios.

1. Introduction

The emergence of deep learning technology significantly expedited the progress of autonomous driving, thereby catalyzing research on semantic segmentation using road scene data and leading to remarkable advancements [1,2]. However, in most of these studies, clear images devoid of motion blur have been relied upon and the challenges posed by a motion-blurred environment neglected. In general, high-level knowledge needs to be extracted by combining low-level features such as points and lines when analyzing images. Motion-blurred images resulting from camera or subject movements often feature indistinct points and lines, which hinder the extraction of high-level information and diminish semantic segmentation accuracy. Consequently, segmentation techniques that performed well in the past also suffer a decline in performance in motion-blurred environments. This tendency implies a risk of dangerous accidents during autonomous driving when objects within images are not accurately recognized.
Various approaches have been explored to mitigate the decline in semantic segmentation accuracy in motion-blurred environments [3,4,5,6,7,8,9,10]. These approaches can be categorized into those that involve improving segmentation performance by first restoring motion-blurred images and those that do not include the restoration step before semantic segmentation is performed on images acquired from a front-viewing vehicle camera. The methods used to perform semantic segmentation after image restoration result in high segmentation accuracy, but the additional image restoration process inevitably increases the number of trainable parameters of the entire network and requires a very high inference time.
To address these drawbacks, this study presents a method for improving semantic segmentation performance in motion-blurred remote sensing images obtained from a vehicle’s front-viewing camera without using a blur restoration network. In contrast to prior research, this study presents four distinct contributions:
  • The knowledge distillation (KD) technique is adopted to improve semantic segmentation performance at a low computational cost, without the additional restoration step for motion-blurred images captured from a vehicle’s front-viewing camera. For this purpose, a KD-based semantic segmentation network (KDS-Net) is proposed.
  • A new type of edge-enhanced segmentation loss (EESL) is proposed to utilize edge features that disappear because of motion blur. To achieve this, an edge-mask generator (EMG) is designed to generate an edge mask from ground truth segmentation data, ensuring the KDS-Net’s robustness in edge regions.
  • To enhance segmentation performance on objects of various sizes in a motion-blurred environment, multiscale inputs are used for the backbone encoder and decoder architecture, and the shallow convolution module (SCM) is applied to reduce the resultant increase in computational load. Additionally, a feature attention fusion module (FAFM) is proposed to utilize multiscale features.
  • Our light KDS-Net minimizes the number of parameters to 21.44 million, demonstrating its ability to operate on embedded systems such as edge computing at real-time speed for real-world vehicle applications. From them, we confirm that KDS-Net can be applied to the edge intelligence-empowered internet of vehicles by removing data privacy concerns and communication overheads caused by transmitting the huge amount of images from a vehicle’s frontal-viewing camera to and receiving the segmentation result from the high computing cloud by 5G technology.
  • To analyze the shape level consistency of the segmentation results under motion blur, we incorporate fractal dimension estimation to evaluate the complexity and irregularity of class-specific regions, thereby providing a structural measure of segmentation correctness within the proposed KD framework. In addition, the proposed model and code, along with experimental databases, are disclosed via GitHub (https://github.com/JSI5668/KDS-Net.git (accessed on 13 July 2025)) to facilitate fair evaluation by other researchers.
The remainder of this paper is organized as follows: Section 2 presents different semantic segmentation methods that involve using images obtained from a vehicle’s front-viewing camera; Section 3 details the proposed method; Section 4 presents an analysis of the experimental results; Section 5 provides relevant discussions; and Section 6 concludes the paper.

2. Related Work

Semantic segmentation methods that involve using images acquired from front-viewing cameras of vehicles are categorized into those that consider motion blur and those that do not. The methods in which motion blur is not considered are further classified into those that perform real-time semantic segmentation and those that achieve high accuracy without performing in real time. Semantic segmentation methods in which motion blur is considered are categorized based on whether they utilize a restoration network.

2.1. Not Considering Motion Blur

2.1.1. Semantic Segmentation for Real-Time Processing

Zhao et al. [11] emphasized the importance of not compromising image quality while increasing the speed of the semantic segmentation algorithm for practical tasks. They employed a method for lowering the resolution of input images to reduce computational load. A multi-branch approach was implemented to mitigate the decrease in segmentation accuracy. Detailed information loss occurs when the resolution of an input image is reduced. Zhao et al. applied a shallow convolutional neural network (CNN) for a bottom branch that uses original resolution images and a deep CNN for a top branch that uses 1/4 resolution images. Detailed information missed in the top branch owing to the low resolution was restored in the bottom branch, thus lowering the overall computational cost while maintaining accuracy. Li et al. [12] discovered that previous methods involving real-time processing utilized either dilated convolution, which maintains the number of parameters while generating a large receptive field, or depth-wise separable convolution, which can reduce the number of parameters. However, they suggested that simply replacing the standard convolution with depth-wise convolution is not advisable owing to significant performance degradation. They proposed a well-designed combination of dilated convolution and depth-wise separable convolution for real-time semantic segmentation.
Wu et al. [13] argued that spatial dependency and contextual information play a crucial role in enhancing segmentation accuracy. They designed a context-guided block (CG block) that maximizes the utilization of local features, surrounding context, and global context. They created a context-guided network (CGNet) based on CG blocks to reduce the number of parameters and efficiently use the memory space. These aforementioned techniques have high processing speed but limited semantic segmentation accuracy, leading to the exploration of the following methods.

2.1.2. Semantic Segmentation with High Accuracy

Chen et al. [14] proposed four enhanced versions, from DeeplabV1 to DeeplabV3-Plus, to improve the accuracy of semantic segmentation. They recommended actively using atrous convolution in DeeplabV1, atrous spatial pyramid pooling to capture multi-scale context in DeeplabV2, and atrous convolution in the existent residual network (ResNet) in DeeplabV3 to achieve a dense feature map. Finally, in DeeplabV3-Plus, semantic segmentation performance was enhanced by combining separable convolution with atrous convolution. Fu et al. [15] aimed to obtain rich contextual dependency in scene segmentation tasks using a self-attention mechanism. Unlike previous methods in which context information is acquired through multi-scale feature fusion, a dual-attention network (DANet) that adaptively integrates local features with global dependencies was used. For this purpose, two attention modules (position and channel attention modules) were added to the DANet. In the position attention module, the features of each position are selectively aggregated by using the weighted sum of the features of all positions. In the channel attention module, mutually dependent channel maps are selectively highlighted by integrating related features between all channel maps. The outputs of these two attention modules are then summed to improve feature representation, resulting in accurate semantic segmentation.
Wang et al. [16] emphasized that high-resolution representation is essential for semantic segmentation and proposed a high-resolution network (HRNet). Most previous methods involve restoring high-resolution representation from low-resolution representation. However, the HRNet is designed to maintain high-resolution representation throughout the process. Initially, the high-resolution subnetwork is used as the first stage. High-to-low resolution subnetworks are sequentially added, and each multi-resolution subnetwork is connected in parallel. This approach leads to rich semantic information and demonstrates excellence in various applications, including human pose estimation, semantic segmentation, and object detection. In the aforementioned methods, motion blur was not considered, resulting in degraded segmentation performance in motion-blurred images. Thus, methods in which motion blur is considered were explored.

2.2. Considering Motion Blur

Motion blur, which occurs when a camera shakes or a subject moves, degrades semantic segmentation performance owing to the resultant ambiguous edges and loss of texture information in images. To overcome this drawback, several studies have been conducted on semantic segmentation considering motion blur in images obtained from a vehicle’s front-viewing camera. These studies can be categorized into those in which image restoration was performed before segmentation and those in which image restoration was not included.

2.2.1. Semantic Segmentation with Motion Blur Restoration

Jeong et al. [3] initially performed image restoration to enhance semantic segmentation performance in motion-blurred images captured by a vehicle’s front-viewing camera and subsequently proposed a supervised dual-attention network for motion deblurring (SDAN-MD). The SDAN-MD is a multi-stage model that incorporates the UNet as a subnetwork. Spatial and channel attentions are used in each stage, with a supervised dual-attention module designed to provide a supervisory signal from the ground truth. Additionally, Jeong et al. utilized a feature map extracted from a pretrained segmentation network using a clear blur-free image as a perceptual loss to further improve semantic segmentation performance. Effectively restoring motion blur prior to semantic segmentation led to notable performance improvements. However, these methods involve extended processing times due to the additional blur restoration process. Thus, methods that do not involve blur restoration have also been explored.

2.2.2. Semantic Segmentation Without Motion Blur Restoration

Data Augmentation-Based Method
Kamann et al. [5] asserted that CNNs need to be robust to degraded images with noise and blur to ensure the safety of critical applications such as autonomous driving. They addressed this issue by enhancing bias towards object shapes. They used fake images where colors randomly chosen for each class label were alpha-blended into RGB training images and introduced a novel data augmentation technique known as painting by numbers. Consequently, their proposed network demonstrated superior performance over other networks in road scene databases. Franchi et al. [4] introduced a data augmentation technique that generates new unlabeled images from superpixels to enhance robustness, a fundamental requirement in semantic segmentation. They established components of an optimal deep CNN (DCNN) learning system for semantic segmentation by combining the system’s mixing technology with a teacher–student framework. Despite the minor performance improvements of these methods, further explorations have been undertaken to identify alternative approaches.
KD-Based Method
Guo et al. [6] observed that semantic segmentation performance is degraded because of the discrepancy between the feature distribution trained using clear images without blur and that using degraded images. They introduced a novel network leveraging teacher–student networks to bridge this discrepancy. The source network has the same target network architecture, where only the target network is used during testing.
Fan et al. [17] proposed augmentation-free dense contrastive knowledge distillation (Af-DCD) which transfers structural knowledge via masked feature mimicking and contrastive learning. This approach showed strong performance on frontal-view driving datasets such as CamVid and Cityscapes. Mansourian et al. [18] introduced attention-guided feature distillation (AttnFD), which uses attention priors to guide the student toward spatially informative regions of the teacher’s feature map. This spatial guidance improves boundary localization and overall segmentation accuracy. Liu et al. [19] proposed boundary-privileged knowledge distillation (BPKD), which separates edge and body regions, and applies region-specific distillation to enhance semantic consistency and structural awareness in the student model.
However, these methods, which utilize both teacher and student networks, present challenges owing to their high learning difficulty and potential performance discrepancies between the two networks. Thus, alternative methods have been proposed.
Enhanced Segmentation Model-Based Method
Vertens et al. [7] proposed a new CNN architecture that predicts the label and motion status of each pixel in images. They devised an architecture to learn the process of generating semantic motion labels by integrating optical flow maps and segmentation kernels through a pair of consecutive images. Yu et al. [8] introduced a technique in which segmentation and edge detection are synergistically combined for mutual benefit. They claimed that segmentation can be easily converted to contour edges for edge learning. Thus, they proposed a multi-task learning method for combined edge and segmentation learning and experimentally proved that their method can result in performance improvement in corrupted images having blur.
Zhang et al. [9] provided the DADA-seg database containing various elements such as motion blur and object occlusions. Furthermore, they proposed a multi-model segmentation architecture based on event-aware fusion (EF) and event-aware domain adaptation (EDA). Rahman et al. [10] claimed that segmentation models exhibiting strong performance remain vulnerable to diverse external factors, leading to diminished segmentation accuracy. To address this shortcoming, they proposed a failure detection framework for finding pixel-level errors. They designed the framework to learn a failure detection network while utilizing the features of the segmentation model. Accordingly, they devised a segmentation network robust to various conditions and experimentally proved its exceptional performance. However, given the escalating complexity of these methods, this study presents the following approach as a resolution.
Fusion of KD-Based and Enhanced Segmentation Model-Based Method
To address the drawbacks of the aforementioned methods, this study proposes the KDS-Net, a semantic segmentation network robust to motion blur. The KDS-Net eliminates the need for image restoration during inference by leveraging the advantages of the KD method and an enhanced segmentation model-based approach.

3. Proposed Method

3.1. Overview of Proposed Method

Figure 1 shows the overall procedure of our proposed method. In the training step (step (1)), motion-blurred images obtained from a vehicle’s front-viewing camera are captured. In step (2), these motion-blurred images are restored to clean images by applying a restoration network. The open and pretrained SDAN-MD model [3] was utilized as a restoration framework because it outperforms other restoration models in terms of semantic segmentation performance when applied to motion-blurred images. In step (3), the proposed KDS-Net is trained for semantic segmentation on both the restored images from step (2) and the motion-blurred images not restored in step (1).
Once step (3) is completed, the model trained for segmentation of both restored and unrestored images is employed as a teacher model to implement knowledge distillation (KD) onto a student model. In step (4), a student model is trained for testing, where the input is a motion-blurred image that has not undergone restoration. Here, the student model incorporates KD from the teacher model, which has been trained on a larger database and is trained to mimic the representations of the teacher model. Both our teacher and student models utilize the KDS-Net proposed in this study.
In the subsequent testing step, after a motion-blurred image is captured in step (5), semantic segmentation is directly performed using the student network of the KDS-Net without restoring the motion-blurred images. In other words, a restoration model is not required during testing because only the student model of the trained KDS-Net is used. The proposed KD method is explained in further detail in the following subsection.

3.2. KDS-Net for the Retoration of Motion-Blurred Image

In general, employing a vast database and network facilitates feature extraction, thereby enhancing the performance of deep learning technology. However, limitations in terms of capacity and processing speeds exist. KD is used to address these challenges by alleviating constraints such as memory space, learning time, and practical limitations, enabling improved performance across various tasks dependent on factors such as model complexity and database size [20]. The core idea of KD involves training a high-performance teacher model on a large database and transferring its knowledge to a student model for inference, allowing the student model to achieve comparable performance with a smaller database and simpler architecture.
This study proposes the KDS-Net, which enhances semantic segmentation performance in motion-blurred images without requiring image restoration during inference by utilizing KD. A large database was constructed by combining motion-deblurred images obtained using a pretrained restoration network with motion-blurred images. A segmentation network trained on this database served as the teacher model, while a network trained only on motion-blurred images was designated as the student model. The teacher and student models were designed with identical architectures to ensure compatibility and efficient knowledge transfer, with the teacher model leveraging the additional restored images to extract richer features during training.
The knowledge transferred through KD encompassed responses, features, and relations, and was passed to the student model via offline-KD, online-KD, or self-KD methods [21]. Offline-KD, where the student model is trained using a pretrained teacher model, was employed in this study. Features from the encoders of both models were extracted, and the mean squared error (MSE) between them was incorporated into the loss function. Instead of directly using the features, vector quantization was applied to map feature vectors in a codebook, enabling efficient knowledge transfer. This approach retained essential data features while helping the student model learn complex representations from the teacher model. Once trained, only the student model is utilized for inference.

3.2.1. Detail Description of KDS-Net

Figure 2 displays the overall architecture of the KDS-Net, which was used as both the teacher and student models in this study. Table 1 and Table 2 present the detailed architecture of the encoder and decoder of the KDS-Net. In the proposed KDS-Net, four different scaled images are used as the input, with each image represented at magnification ratios of ×1, ×1/2, ×1/4, and ×1/8, as shown in stages 1–4 in Figure 2. Once the original resolution image passes through the first encoder block, an image that has been downsampled to a ×1/2 magnification ratio becomes an input that passes through the SCM. The top image in Figure 3 illustrates the detailed architecture of the SCM. In SCM, two sets of 3 × 3 and 1 × 1 convolutional layers are used for efficiency and concatenated with the input of SCM. Then, the concatenated features are refined through the 1 × 1 layer and passed through the 3 × 3 layer.
After being passed through the FAFM, it is merged with a feature that has undergone processing in the first encoder block. The bottom image in Figure 3 depicts the detailed architecture of the FAFM. We fused the features in the FAFM by using channel attention. Channel attention was employed because as the encoder layers deepen, the spatial resolution of a feature map decreases, leading to information accumulation along the channel axis. In this case, we used a combination of global average pooling (GAP) and global max pooling (GMP) because these two pooling methods provide complementary information.
GAP calculates the average value of each channel, offering global contextual information that helps us understand the overall distribution of features. Conversely, GMP calculates the maximum value of each channel, highlighting the most important features and aiding in the identification of crucial information in specific regions. By combining these two pooling methods, we maximize the effectiveness of channel attention, thereby enabling more effective learning of inter-channel interactions and their relative importance. The mathematical formulations for the channel attention and FAFM we employed are as follows:
Z = C o n v ( X )
f c G A P = C o n v 1 H × W i = 1 H j = 1 W Z i j c
f c G M P = C o n v ( max Z i j c   1 i H ,   1 j W ) )
S c = σ ( f c G A P + f c G M P )
X i j c = X i j c + ( S c · X i j c ) .
The input X is the concatenation of the output from the SCM and the output from the dense block of the previous stage, which is then transformed into Z through a convolutional layer as shown in Equation (1). Subsequently, as shown in Equations (2)–(4), the channel information is summarized using GAP and GMP, and the importance of each channel S c is generated through the sigmoid function σ . Finally, this importance is reflected to produce the output X i j c , as shown in Equation (5), which is the output of the FAFM with channel attention.
Through the FAFM in each encoder block, we facilitated the fusion of features from the image with a magnification ratio of ×1/2. This process was repeated until the feature of the image having a ×1/8 magnification ratio was fused and subsequently fed into the decoder. The decoder architecture of the KDS-Net is identical to that of the Nested U-Net [22], with the segmentation output of the final decoder block used to calculate the losses. In the KDS-Net, EESL, which incorporates the proposed EMG, is used alongside the categorical cross-entropy loss typically used in semantic segmentation. Detailed explanations of the EMG and EESL are provided in Section 3.2.2 and Section 3.3.2, respectively.

3.2.2. EMG

Recently, perceptual loss [23] has been widely employed along with per-pixel loss in image generation tasks such as style transfer or image synthesis. In this method, fake and real image pixels are not directly compared; instead, perceptual similarity is examined by performing comparisons in a feature space. A visual geometry group (VGG) network [24] pretrained with ImageNet is used. This approach aids in identifying high-level features, thereby facilitating the recognition of more intricate details. However, applying this method to semantic segmentation tasks has limitations. As image generation models such as generative adversarial networks (GANs) [25] typically produce three-channel outputs, a generated fake image is input into the VGG network, along with a real image, to extract a feature map. However, in semantic segmentation tasks, the number of output channels differs from the number of input channels, and the image pixel distributions vary significantly. Given these circumstances, this study introduces the EMG, which leverages the benefits of perceptual loss in semantic segmentation and alleviates the challenge of accurately detecting edges in the presence of motion blur.
We designed the EMG, a shallow CNN that generates an edge mask from the ground truth of semantic segmentation. The EMG was designed with four convolution layers as shown in Figure 4, with batch normalization and the rectified linear unit (ReLU) included between each layer. An edge mask, which is the EMG label, is required for training the EMG. Before the EMG is trained, the Laplacian filter is utilized to generate the ground truth edge mask, E*, from the ground truth segmentation mask, S*; this mask is then used as the label for training the EMG. In the first stage, the training label for the EMG is created. The model is frozen after completing the training for the EMG, which generates the predicted edge mask, Ê*, from S*. During EMG training, the utilized loss is cross-entropy loss, specifically termed as “edge loss” to distinguish it from the cross-entropy loss employed in the second stage. Subsequently, this model is utilized as a VGG network, with which we commonly use perceptual loss.
In the second stage, the segmentation network’s outputs, Ŝ and S*, are input into the EMG, and feature maps are extracted from each layer, where the difference between the two is reflected as L2 loss. This approach is used to reduce the difference between the edge of Ŝ and the edge of S*, resulting in more refined segmentation outputs. This can be expressed mathematically in Equations (6)–(9), and the detailed structure of the EMG is presented in Table 3.
E* = Laplacian (S*)
Ê* = EMG (S*)
Ê = EMG (Ŝ)
Edge loss = Cross entropy (Ê*, E*)

3.3. Loss Functions of KDS-Net

The KDS-Net proposed in this study utilizes cross-entropy loss, commonly employed in semantic segmentation, in addition to distillation loss, which transfers knowledge from a teacher model to a student model, and EESL. This EESL, designed to minimize edge discrepancies between the ground truth and model output, utilizes the EMG method introduced here. Further explanations for the two additional losses are provided in the following subsection.

3.3.1. Distillation Loss with Vector Quantization

In this study, the MSE was computed and applied as the distillation loss, ensuring that the features extracted from the student model closely resemble those extracted from the teacher model. The distillation equation used in this study is as follows:
L d i s t i l l a t i o n =   i = 1 4 ( V Q ( ϕ t i ( I B ) ) V Q ( ϕ s i ( I B ) ) 2
where IB is the motion-blurred image used as an input. ϕ t i (IB) represents the feature map of the teacher model extracted from the dense block in the i-th stage presented in Table 1, whereas ϕ s i (IB) represents the feature map of the student model extracted from the dense block in the i-th stage. Moreover, V Q represents vector quantization (VQ). We adopted VQ as the basis for our knowledge distillation framework, as it offers an effective balance between compactness and representational fidelity. Unlike other compression techniques such as pruning or low-rank decomposition, which primarily reduce redundancy in parameter space, VQ discretizes feature space representations, enabling the student network to capture high-level semantics in a compressed yet structured form. This quantized guidance helps align the latent space of the student with that of the teacher without introducing additional architectural complexity. Furthermore, the codebook-based representation in VQ acts as an undetermined regularizer, encouraging stable and interpretable feature transfer during distillation. When VQ is used, v number of codewords are aggregated to form one codebook, resulting in a total of g codebooks. The value of continuous features is allocated to the nearest discrete codeword. In the experiment conducted with the training data, showing the best segmentation accuracy, we adjusted optimal parameter v to match the number of dimensions of a feature map extracted from each encoder block. We set g to 8000 and the number of encoder blocks for feature extraction to 3. Features were ultimately extracted from the second encoder block. Finally, this loss was trained to minimize the difference between features extracted from the teacher model and those extracted from the student model.

3.3.2. EESL with EMG

Motion-blurred images generally reduce segmentation accuracy because the edges in the semantic segmentation results are compromised owing to the loss of detailed information. Therefore, this study introduces the EMG, which aims to preserve edge information in the damaged semantic segmentation output as much as possible, as detailed in Section 3.2.2. The equation for EESL utilizing the EMG is as follows:
L E E S L =   i = 1 3 ( E M G i ( g t ) E M G i ( p r e d ) ) 2   .
Here, E M G i ( g t ) and E M G i ( p r e d ) indicate the feature map extracted from the i-th conv layer presented in Table 3; the ground truth and predicted segmentation masks as the EMG input. We used the MSE to minimize the difference between the two and trained a segmentation network that produces more precise outputs, even from motion-blurred images with damaged edge information due to the loss. Finally, three losses are combined in the proposed KDS-Net: cross-entropy loss and the two aforementioned losses, as shown in Equation (12).
L t o t a l   l o s s = L c r o s s   e n t r o p y + L d i s t i l l a t i o n + λ L E E S L
where λ is a balancing parameter to ensure that the losses have a favorable impact on learning; thus, we set λ to 0.05 by conducting the experiment using the training data where the segmentation accuracy is the highest.

3.4. Fractal Dimension Estimation (FDE)

Fractal structures exhibit self-similarity and often deviate from standard geometric representations [26]. The fractal dimension (FD) offers a quantitative measure of structural complexity, indicating whether a shape is more spatially concentrated or dispersed. In this study, we estimate the FD of a specific semantic class (e.g., car, sidewalk, and bicyclist) from semantic segmentation masks. For each selected class, we extract both the ground truth binary mask and the predicted mask produced by our proposed KDS-Net, and perform fractal analysis on each. The FD reflects the level of geometric detail in these binary shapes. A higher FD implies greater contour complexity or spatial irregularity in the segmented region. The FD for each class-specific region is estimated using the box-counting method [27], where N denotes the number of square boxes that intersect the foreground region, and λ is the box scaling factor. In our setting, FD is computed separately for the ground truth mask and the prediction result produced by KDS-Net, both of which are converted to a binary form for a specific semantic class. The FD is determined by the following Equation (13):
F D = lim λ 0 l o g ( N λ ) l o g ( 1 / λ ) ,
where 1 F D 2 , and for all λ > 0 , there exists an N λ . The pseudocode for estimating the FD of the activated part of the image using the box-counting method is provided in Algorithm 1.
Algorithm 1 Pseudocode for FDE
Input: BM: Binary masks of a specific semantic class extracted from the ground truth and KDS-Net predictions
Output: FD: Fractal dimension
1:  Determine the maximum dimension of the box and round it to the nearest power of 2 Max_dimension = max(size(BM))
   λ = 2^[log2(Max_dimension)]
2:  If the image size is smaller than λ , pad the image to match the dimension of λ
   if size(BM) < size( λ )
     Pad_width = ((0, λ BM.shape [0]), (0, λ BM.shape [1]))
     Pad_BM = pad(BM, Pad_width, mode = ‘constant’, constant_values = 0)
   else
     Pad_BM = BM
3:  Initialize an array storing the number of boxes for each dimension size
n = zeros(1, λ + 1)
4:  Compute the number of boxes, N( λ ) containing at least one pixel of the positive region n[ λ + 1] = sum(BM [:])
5:  While λ > 1:
     a. Reduce the size of λ by a factor of 2
     b. Update the number of boxes N( λ )
6:  Compute log(N( λ ) ) and log( 1 / λ ) for each λ
7:  Fit a line to the points [(log( 1 / λ ), log(N( λ ) )] using the least squares method
8:  FD is determined by the slope of the fitted line
Return FD

4. Experimental Results and Analysis

4.1. Experimental Databases and Setup

In this study, we used the two most commonly utilized open road scene semantic segmentation databases, Cambridge driving labeled video (CamVid) [28] and Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) [29], for the experiment. These databases collectively encompass 12 classes, such as buildings, roads, vehicles, and unlabeled classes, and provide ground truth data for each class. The CamVid database has 701 images, all having the same resolution of 960 × 720 pixels, whereas the KITTI database contains 445 images having different resolutions. To facilitate a comparison with the SDAN-MD [3], which enhanced semantic segmentation performance in a motion-blurred environment using the CamVid and KITTI databases, we resized the CamVid images to 320 × 240 pixels and the KITTI images to 512 × 176 pixels in this study. For consistency, we implemented a two-fold cross-validation [30] during the experiment. The 701 images in the CamVid database were split into subsets A (351 images) and B (350 images), whereas the 445 images in the KITTI database were divided into subsets A (223 images) and B (222 images). During the first fold validation, we used subset A of each database as the training set and subset B as the testing set. For the second fold validation, we swapped subsets A and B and computed the average performance. Furthermore, we randomly selected 1/10 of the training set as the validation set in this study. Several databases, such as GoPro [31] and Hide [32], generate artificially blurred images, whereas the RealBlur database [33] captures blurred images in real-world environments. However, none of these databases provide ground truth segmentation labels. Because no databases containing motion-blurred images with ground truth segmentation labels exist, we created motion-blurred images by applying Kupyn et al.’s method [34] to the CamVid and KITTI databases. The primary goal was to enhance semantic segmentation performance in motion-blurred images captured by using a vehicle’s front-viewing camera. Thus, we constructed a motion-blurred road scene image database using the CamVid and KITTI road scene databases, both including ground truth segmentation labels, as depicted in Figure 5.
For training, we employed online data augmentation techniques, including random crop, color jitter, and horizontal flip [35]. Random crop involves selecting a random portion of an image at a specified patch size for cropping during each training iteration. Random color jitter involves randomly altering the lightness, hue, and saturation of an image. Random horizontal flip entails flipping an image horizontally with a random probability. The random color jitter technique was used owing to the mixed nature of the dataset, containing both dark and bright images. Horizontal flip was preferred over vertical flip because, in road scene databases, vehicles, pedestrians, and bicycles are unlikely to appear in the upper part of an image or in the sky. Detailed information about the databases used in this study is presented in Table 4.
Our experiments were conducted on a desktop computer equipped with an Intel® Core™ i7-6700 (Intel Corp., Santa Clara, CA, USA) central processing unit (CPU) and 16 GB of RAM (Samsung Electronics Co., Ltd., Suwon City, Gyeonggi-do, Republic of Korea), alongside an NVIDIA GeForce RTX 3060 (NVIDIA Corp., Santa Clara, CA, USA) graphics processing unit (GPU) with 12 GB of graphics memory [36]. The computer ran on an Ubuntu 20.04.3 operating system. The training and testing algorithms of our network were implemented using the PyTorch framework (version 1.12.1) [37].

4.2. Training of KDS-Net

For training the proposed KDS-Net, we employed the adaptive moment estimation (Adam) optimizer [38] as a method for optimizing weight parameters. The initial learning rate was 0.0001, with beta 1 and beta 2 set to 0.9 and 0.999, respectively, and epsilon set to 10−8. Additionally, we ensured that the initial learning rate decreased by a factor of 0.1 at each step. The batch size was set to 4, and the number of epochs was set to 400. The CamVid database was trained on images resized to 224 × 224 pixels, whereas the KITTI database was trained on a patch size of 160 × 160 pixels. As summarized in Table 5, the teacher model and student model demonstrated different training times per epoch for both databases. As shown in Figure 1, the teacher model required significantly more time due to the training process involving both motion-blurred images and their corresponding restored versions, effectively doubling the amount of data processed. On the other hand, while the student model incurred additional time for knowledge distillation from the teacher model, it only processed motion-blurred images directly, resulting in comparatively reduced training time. Figure 6 reveals the training and validation losses and their corresponding accuracies across the epochs during the training of the proposed KDS-Net. As the number of epochs increased, the training loss approached zero and the training accuracy converged to a sufficiently high value, indicating that our network was stably trained on the training dataset. Furthermore, as the number of epochs increased, the validation loss converged to a small value and the corresponding accuracy to a large value, indicating that our network did not overfit the training data.

4.3. Testing of Proposed Method

4.3.1. Evaluation Metrics

In this study, to compare our proposed network with other networks, we utilized five commonly used semantic segmentation evaluation metrics from existing research, as shown in Equations (14)–(18): pixel accuracy (PA), mean pixel accuracy (mPA), class intersection over union (Class IoU), frequency-weighted intersection over union (FW IoU), and mean intersection over union (mIoU) [39,40].
P A = i = 0 K n i i i = 0 K c i
m P A = 1 K + 1   i = 0 K n i i c i
Class IoU = n i i c i + j = 0 K n j i n i i
F W   I o U = k = 0 K c k 1 i = 0 K c i C l a s s   I o U i
m I o U = 1 K + 1   i = 0 K n i i c i + j = 0 K n j i n i i
In each equation, K + 1 represents the number of classes, and c i denotes the total number of pixels in the i-th class. n i i represents the number of pixels correctly classified as the i-th class, while n j i indicates the number of pixels of the i-th class, being misclassified as the j-th class. Equation (14), PA represents the ratio of correctly classified pixels to the total number of pixels. Equation (15), mPA denotes the average value obtained by dividing the PA calculated for each class by the total number of classes.
Equation (16), Class IoUi, represents the ratio of the number of common pixels between the ground truth and prediction for the i-th class to the total number of pixels. While PA and mPA do not account for incorrect classes, Class IoUi decreases as the number of incorrect classes increases because the denominator becomes larger. If there is a significant difference in the number of pixels per class and a greater emphasis is desired for classes with more pixels, Equation (17), FW IoU is used. FW IoU multiplies the pixel frequency of each class by the Class IoUi. Due to class imbalance in the databases used in this study, CamVid and KITTI, we additionally employed this evaluation metric. Lastly, Equation (18), mIoU, represents the average of Class IoUi across all classes, calculated by summing the Class IoUi and dividing by the number of classes.

4.3.2. Testing on CamVid Database

Ablation Studies
In this study, we conducted a total of three types of ablation studies. Table 6 presents the setup for the first ablation study, while Table 7 shows the performance results for each case described in Table 6. In Table 6, VQ represents vector quantization. The results, as seen in Table 6 and Table 7, indicate that Case 6, which incorporates all the methods proposed in this study, achieved the highest semantic segmentation performance. This confirms that the methods proposed in this study enhance semantic segmentation performance for road scene images in environments with motion blur.
We conducted a second ablation experiment on EMG using the method from Case 6 in Table 6 (proposed method). Each case in Table 8 indicates from which of the three layers in Figure 4 and Table 3 the feature map was extracted in EMG. Table 9 presents the semantic segmentation performance corresponding to each case. As seen in Table 8 and Table 9, the EESL, which is the loss using EMG, achieved the best performance when feature maps from all layers of the EMG were used.
Lastly, we conducted an ablation study on the dense block of Figure 2 and Table 1, where features were extracted from the encoder of both the teacher model and the student model during the KD process. Table 10 presents the various cases for the third ablation study, and Table 11 shows the semantic segmentation performance for each case.
As seen in Table 10 and Table 11, we observed that performance improves as deeper encoder blocks of both the teacher model and the student model are used during KD. Additionally, using all encoder blocks yielded better performance compared to using only a single encoder block.
Comparative Experiments Between the Proposed and State-of-the-Art (SOTA) Methods
In this subsection, we compared the proposed KDS-Net with state-of-the-art (SOTA) methods. The SOTA methods were categorized into those that include an image blur restoration process and those that do not, with the proposed method not including a restoration process. In all experiments, for methods that include an image blur restoration process without semantic segmentation, the DeepLabV3-Plus method was used as the semantic segmentation network for performance comparison. Table 12 compares PA, mPA, FW IoU, and mIoU for each method, while Table 13 compares Class IoU for each class within the images. Note that the student in Table 12 corresponds to Case 4 from Table 6 and Table 7. The teacher, on the other hand, is the model responsible for providing knowledge distillation, as explained in Figure 1 and Table 5.
As shown in Table 12, the proposed KDS-Net achieves the highest performance across all metrics, surpassing both the methods that include an image restoration process and those that do not. Additionally, as observed in Table 13, the proposed KDS-Net outperforms in all classes except for the car class. This can be attributed to the fact that SDAN-MD [3] includes a restoration process as part of its segmentation pipeline, which may enhance its performance specifically for the car class compared to our proposed method.
Figure 7 visually presents the semantic segmentation results of the proposed method compared to the SOTA methods. While the segmentation results using other SOTA methods show reasonable performance for classes such as road, car, and sidewalk, the segmentation performance for smaller objects is significantly lower. In contrast, the proposed KDS-Net demonstrates high segmentation performance even for smaller objects such as poles and sign symbols.
Analysis of Feature Maps by Gradient-Weighted Class Activation Mapping (Grad-CAM)
In this subsection, we used Grad-CAM [57] to visualize and compare the feature maps extracted by the proposed KDS-Net and state-of-the-art (SOTA) methods. In Grad-CAM, important features are typically displayed in reddish and yellowish colors. However, this saliency information is represented as a continuous heatmap rather than a discrete mask. As such, it is inherently ambiguous to define a strict threshold that distinguishes important feature regions from less relevant ones. This ambiguity makes it difficult to compute quantitative metrics such as the intersection over union (IoU) between the Grad-CAM activation map and the actual value. Therefore, we adopt a qualitative interpretation strategy, focusing on whether the model’s attention aligns with semantically meaningful regions rather than enforcing a direct spatial correspondence with the actual value.
In this study, we compared the feature map extracted from the final output layer of each segmentation network. For restoration models, the restored images were used as input to the segmentation network, and the feature maps were similarly extracted from the final layer for comparison. As shown in Table 13, the performance for the pole and pedestrian classes was the lowest when measuring the performance for each class. Therefore, we extracted and compared the feature maps for the pole and pedestrian classes.
Figure 8 shows the feature maps extracted for the pole class (a) and the pedestrian class (b), with comparisons across the different methods. The results, as illustrated in Figure 8, demonstrate that the feature maps extracted by the proposed KDS-Net are more similar to the ground truth than those extracted by other networks. This confirms that the KDS-Net improves semantic segmentation accuracy in motion-blurred environments compared to the state-of-the-art (SOTA) methods.

4.3.3. Testing on KITTI Database

Comparative Experiments Between the Proposed and SOTA Methods
In this subsection, we compared the semantic segmentation performance of the proposed KDS-Net with other state-of-the-art (SOTA) methods using another road scene database, KITTI. Table 14 compares PA, mPA, FWIoU, and mIoU, while Table 15 compares the Class IoU for all classes. From Table 14, it can be seen that the proposed KDS-Net achieves the highest semantic segmentation accuracies compared to the SOTA methods. Additionally, as indicated in Table 15, although our proposed method does not achieve the best performance for all classes, it demonstrates higher accuracy for most classes on the KITTI database. Figure 9 visually presents the semantic segmentation results of the proposed method and the SOTA methods. As observed in Figure 9, the proposed KDS-Net produces results that are most similar to the ground truth when tested on the KITTI database, outperforming the other methods.
Analysis of Feature Maps by Grad-CAM
We used the KITTI database to extract and analyze the feature maps of the proposed KDS-Net and state-of-the-art (SOTA) methods using Grad-CAM. As previously mentioned, important features in Grad-CAM are typically displayed in reddish and yellowish colors. We conducted experiments on the two classes with the lowest performance as indicated in Table 15, and extracted feature maps from the final layer of the semantic segmentation networks, as described in the section Analysis of Feature Maps by Gradient-Weighted Class Activation Mapping (Grad-CAM). Figure 10a shows the Grad-CAM image for the pedestrian class, and Figure 10b shows the Grad-CAM image for the bicyclist class. As shown in Figure 10, when comparing the Grad-CAM images of the proposed KDS-Net with other SOTA methods, it is evident that the other methods do not properly focus on the relevant class regions. In contrast, the KDS-Net effectively focuses on the relevant class regions, producing results that most closely resemble the ground truth. This confirms that the proposed KDS-Net improves semantic segmentation performance for motion-blurred images more effectively than the SOTA methods.
Comparisons of Inference Time and Computational Cost
In this subsection, we compared the inference time per image of the proposed KDS-Net with state-of-the-art (SOTA) methods. The inference time was measured on both a desktop computer and a Jetson TX2 embedded system. As shown in Figure 11, the Jetson TX2 features an NVIDIA Pascal™ family GPU (256 CUDA cores), 8 GB of memory shared between the CPU and GPU, and a memory bandwidth of 59.7 GB/s. Additionally, it uses less than 7.5 watts of power. The specifications of the desktop computer are described in Section 4.1. The reason for measuring on the Jetson TX2 embedded system is that road scene segmentation is often performed using a camera mounted on a vehicle. In such cases, the algorithm operates as edge computing within the vehicle’s embedded system for edge intelligence-empowered internet of vehicles. Therefore, to verify whether the proposed system can operate on an embedded system, we conducted experiments on the Jetson TX2 embedded system and compared the results with SOTA methods, as shown in Table 16. For the methods using restoration models, the inference time was measured by summing the inference time of the restoration process and the segmentation process.
The inference time per image measured with the proposed method was 6.29 ms (milliseconds) on the desktop computer and 37.18 ms on the Jetson embedded system, corresponding to processing speeds of approximately 159 frames per second (fps) and 31.1 fps, respectively. This confirms that the proposed method can operate on an embedded system with limited computing resources at real-time speed. Additionally, as shown in Table 16, the processing time of the proposed method was the lowest on the desktop computer and the third lowest on the Jetson embedded system compared to the SOTA methods.
In Table 17, we compare the number of parameters, GPU memory requirements, and floating-point operations (FLOPs) of the proposed KDS-Net and the state-of-the-art (SOTA) methods. Although the proposed KDS-Net does not have the lowest computational cost, it achieves the highest semantic segmentation accuracy, as demonstrated in Table 12, Table 13, Table 14 and Table 15 and Figure 7, Figure 8, Figure 9, Figure 10 and Figure 11, aligning with the primary objective of this study.

5. Discussions

5.1. Statistical Analysis

We conducted a t-test to examine the statistical significance of the performance difference between the proposed KDS-Net and the second-best method (SDAN-MD), as shown in Table 12. Additionally, we evaluated the practical significance using Cohen’s d-value [58] as the effect size measure. A Cohen’s d-value close to 0.2 indicates a small effect size, close to 0.5 indicates a medium effect size, and close to 0.8 indicates a large effect size. A large effect size suggests that the difference between the two groups is practically significant.
As shown in Figure 12, p-values for PA, mPA, FW IoU, and mIoU are 0.14 × 10−1, 0.09 × 10−1, 0.05 × 10−1, and 0.005 × 10−1, respectively. This indicates that the performance difference between our proposed KDS-Net and the second-best method is statistically significant at a 95% confidence level for PA and at a 99% confidence level for all the remaining evaluation metrics. Additionally, the Cohen’s d-values measured between the proposed method and the second-best method are 16.06, 24.83, 41.15, and 400.92, respectively, all indicating a large effect size. These results confirm that our proposed method demonstrates significantly higher accuracy compared to the second-best method.

5.2. Analysis of Error Cases by Proposed Method

In this subsection, we analyzed the error cases of the proposed method. One of the error cases identified in our proposed method is the issue of class imbalance, which is particularly caused by the objects that occupy a small portion of the image, such as poles and pedestrians. These classes play a crucial role in various applications, including autonomous driving, but they appear less frequently in the training dataset compared to larger classes such as roads and buildings.
Due to this class imbalance, as shown in Figure 13, our proposed KDS-Net sometimes fails to accurately segment small objects such as poles and pedestrians. Semantic segmentation involves pixel-level classification, and IoU measures how well the predicted pixels overlap with the ground truth pixels. Therefore, for classes with a small number of pixels, the network finds it challenging to learn fine details and achieve high segmentation accuracy. Conversely, classes with a larger number of pixels are easier for the network to learn and segment accurately.
This issue, caused by class imbalance within the dataset, can lead to significant safety concerns in real-world applications such as with autonomous vehicles. Consequently, future research should focus on exploring various approaches to address this problem, thereby improving model accuracy and ensuring safety in real-world environments.

5.3. FD Analysis for Class-Wise Segmentation Quality

The FD analysis was performed on the binary masks corresponding to a specific semantic class (e.g., car, bicyclist), extracted from both the ground truth and the KDS-Net predictions. To compute the FD score, the box-counting method described in Algorithm 1 of Section 3.4 was applied.
Figure 14 illustrates the FD analysis for four different semantic classes. For each class, we extract binary masks from both the ground truth annotations and the predictions made by KDS-Net, and present the corresponding FD estimation plots for each case. These plots report the FD score, correlation coefficient (C) between the log of 1/box size (log ( 1 / λ ) ) and box count (log(N(λ)), and the coefficient of determination (R2) for the fitted regression line. The FD scores quantify the complexity of each class-specific shape within the segmentation mask. The FD scores serve as a quantitative indicator of the structural complexity of object shapes within segmentation masks. In the case of car class, as shown in Figure 14a, the FD value for the ground truth is 1.41403, while the value estimated from the predicted mask by KDS-Net is 1.41657. Both cases show strong C values, with 0.99109 for the ground truth and 0.99102 for the prediction, and the corresponding R2 values are 0.98226 and 0.98211, respectively. These results indicate that the predicted shape closely mirrors the geometric characteristics of the ground truth annotation. In Figure 14b, the bicyclist class typically associated with more irregular and narrow structures shows a ground truth FD of 1.32727 and a predicted FD of 1.33675. Despite the potential complexity of this category, the C values remain strong, with 0.98632 for the ground truth and 0.98695 for the prediction. Similarly, the R2 values are 0.97282 and 0.97406, demonstrating a reliable and well-fitting log-log regression in both cases. For the sidewalk class depicted in Figure 14c, the ground truth FD is 1.40099, and the predicted FD is 1.39811. The C values are 0.99118 for the ground truth and 0.99133 for the prediction, with corresponding R2 values of 0.98244 and 0.98273. This indicates a strong alignment between the structural properties of the two masks. In Figure 14d, the sign symbol class yields a ground truth FD of 1.47055 and a predicted FD of 1.47309. The C values are 0.99438 and 0.99556 for the ground truth and prediction, respectively, while the R2 values are 0.98880 and 0.99115, respectively. These consistently high metrics demonstrate that KDS-Net can accurately preserve the fine-grained details even in small, intricate object categories. Taken together, these results confirm that the proposed method exhibits high structural fidelity across diverse semantic categories. The minimal differences in FD values, along with strong correlation and regression performance, support the method’s effectiveness in preserving the geometrical traits of segmented objects.
Although the estimation of FD could be integrated into the training metric as a regularizer in principle, this integration entails several practical limitations. To compute FD during training, class-wise binary masks should be generated for each semantic category (class). Given that our datasets contain 12 classes, FD should be estimated individually per class at each training iteration, which substantially increases computational cost and implementation complexity.
Furthermore, integrating FD-based training loss terms on a per-class basis can lead to optimization imbalance, as the network may become biased toward classes with more geometrically complex or spatially dominant structures. This behavior diverges from conventional training loss formulations that treat all classes jointly and uniformly. Based on these considerations, we utilize FD estimation only in a post-hoc manner, where it serves as a structural similarity metric to evaluate the segmentation accuracy of predicted mask compared to ground truth one in terms of complexity and geometric detail, as shown in Figure 14 and Table 18.

5.4. Comparisons of Proposed Method with SOTA Motion Deblurring Researches

To evaluate the effectiveness of our proposed method under motion blur conditions, we conducted a comprehensive comparison using the CamVid and KITTI datasets. As shown in Table 19 and Table 20, our method (KDS-Net) achieves superior segmentation performance across all metrics (PA, mPA, FW IoU, and mIoU) compared to the state-of-the-art motion deblurring methods. This demonstrates the robustness and generalizability of our approach in motion-degraded scenarios.
Furthermore, Table 21 presents a comparison of model complexity, including the number of parameters, GPU memory consumption, and FLOPs. Although KDS-Net does not record the lowest FLOPs among all methods, it shows a clear advantage in efficiency, having the smallest number of parameters and requiring the least GPU memory. These results confirm that our method achieves a favorable trade-off between performance and computational cost, making it suitable for deployment in resource-constrained environments.

6. Conclusions

In this study, we addressed the issue of performance degradation in road scene semantic segmentation caused by motion blur resulting from camera shake or object movement. The proposed KDS-Net is a semantic segmentation model optimized for motion-blurred road scene images, utilizing a knowledge distillation (KD) approach to transfer information from a teacher model to a student model. Additionally, we designed an edge mask generator (EMG) to create edge masks from ground truth segmentation masks, which were used as EESL during the training of KDS-Net.
Based on these methods, KDS-Net extracts features with diverse receptive fields through the semantic context module (SCM) and feature aggregation fusion module (FAFM) in the encoder and passes them to the decoder to produce the segmentation output. Ultimately, KDS-Net outputs the segmentation results without requiring a restoration model during inference. In the section Ablation Studies, ablation studies demonstrated that SCM, FAFM, EMG, KD, and variational quantization (VQ) enhance the semantic segmentation performance of motion-blurred images. Furthermore, comparative experiments using two road scene databases showed that the proposed KDS-Net excelled in extracting detailed features for segmentation lost due to motion blur, outperforming various state-of-the-art (SOTA) models, as confirmed by Grad-CAM comparisons. Additionally, t-tests and Cohen’s d-value experiments indicated that the proposed method achieved significantly higher accuracy compared to the second-best method. In addition, to analyze the shape level consistency of the segmentation results under motion blur, we incorporated FDE to evaluate the complexity and irregularity of class-specific regions, thereby providing a structural measure of segmentation correctness within the proposed KD framework.
Our lightweight KDS-Net minimizes the number of parameters to 21.44 million, demonstrating its ability to operate on embedded systems as edge computing at real-time speed for real-world vehicle applications, as shown in the experiments in Section 5.2. From these experiments, we confirm that KDS-Net can be applied to the edge intelligence-empowered internet of vehicles by removing data privacy concerns and communication overheads caused by transmitting the huge amount of images from the vehicle’s frontal-viewing camera to and receiving the segmentation result from the high-computing cloud using 5G technology.
However, as shown in Figure 13, segmentation errors by the proposed method occur in images due to class imbalance. Therefore, future research will focus on reducing errors in cases of class imbalance by exploring new data augmentation techniques and learning strategies to correct class imbalance. Moreover, we plan to study methods to minimize computational cost while preventing performance degradation as much as possible. We will also address issues of various image degradation in road scene segmentation.

Author Contributions

Methodology, S.I.J.; conceptualization, M.S.J.; supervision, K.R.P.; writing—original draft, S.I.J.; writing—editing and review, K.R.P. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Ministry of Science and ICT (MSIT), Korea, through the Information Technology Research Center (ITRC) Support Program under Grant IITP-2025-RS-2020-II201789, and in part by the Artificial Intelligence Convergence Innovation Human Resources Development Supervised by the Institute of Information & Communications Technology Planning & Evaluation (IITP) under Grant IITP-2025-RS-2023-00254592.

Data Availability Statement

Data are disclosed via GitHub (https://github.com/JSI5668/KDS-Net.git (accessed on 13 July 2025)).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Gong, L.; Zhang, Y.; Zhang, Y.; Yang, Y.; Xu, W. Erroneous pixel prediction for semantic image segmentation. Comput. Vis. Media 2022, 8, 165–175. [Google Scholar] [CrossRef]
  2. Wang, Y.; Li, Y.; Elder, J.H.; Wu, R.; Lu, H. Class-conditional domain adaptation for semantic segmentation. Comput. Vis. Media 2024, 10, 1013–1030. [Google Scholar] [CrossRef]
  3. Jeong, S.I.; Jeong, M.S.; Kang, S.J.; Ryu, K.B.; Park, K.R. SDAN-MD: Supervised dual attention network for multi-stage motion deblurring in frontal-viewing vehicle-camera images. J. King Saud Univ.-Comput. Inf. Sci. 2023, 35, 101556. [Google Scholar] [CrossRef]
  4. Franchi, G.; Belkhir, N.; Ha, M.L.; Hu, Y.; Bursuc, A.; Blanz, V.; Yao, A. Robust semantic segmentation with superpixel-mix. arXiv 2021, arXiv:2108.00968. [Google Scholar]
  5. Kamann, C.; Rother, C. Increasing the robustness of semantic segmentation models with painting-by-numbers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 369–387. [Google Scholar]
  6. Guo, D.; Pei, Y.; Zheng, K.; Yu, H.; Lu, Y.; Wang, S. Degraded image semantic segmentation with dense-gram networks. IEEE Trans. Image Process. 2019, 29, 782–795. [Google Scholar] [CrossRef] [PubMed]
  7. Vertens, J.; Valada, A.; Burgard, W. SMSnet: Semantic motion segmentation using deep convolutional neural networks. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 582–589. [Google Scholar]
  8. Yu, Z.; Huang, R.; Byeon, W.; Liu, S.; Liu, G.; Breuel, T.; Anandkumar, A.; Kautz, J. Coupled segmentation and edge learning via dynamic graph propagation. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6-14 December 2021; Volume 34, pp. 4919–4932. [Google Scholar]
  9. Zhang, J.; Yang, K.; Stiefelhagen, R. ISSAFE: Improving semantic segmentation in accidents by fusing event-based data. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1132–1139. [Google Scholar]
  10. Rahman, Q.M.; Sunderhauf, N.; Corke, P.; Dayoub, F. FSNet: A failure detection framework for semantic segmentation. IEEE Robot. Autom. Lett. 2022, 7, 3030–3037. [Google Scholar] [CrossRef]
  11. Zhao, H.; Qi, X.; Shen, X.; Shi, J.; Jia, J. ICNet for real-time semantic segmentation on high-resolution images. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 405–420. [Google Scholar]
  12. Li, G.; Yun, I.; Kim, J.; Kim, J. DABNet: Depth-wise asymmetric bottleneck for real-time semantic segmentation. arXiv 2019, arXiv:1907.11357. [Google Scholar]
  13. Wu, T.; Tang, S.; Zhang, R.; Cao, J.; Zhang, Y. CGNet: Alight-weight context guided network for semantic segmentation. IEEE Trans. Image Process. 2020, 30, 1169–1179. [Google Scholar] [CrossRef] [PubMed]
  14. Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
  15. Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
  16. Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3349–3364. [Google Scholar] [CrossRef] [PubMed]
  17. Fan, J.; Li, C.; Liu, X.; Song, M.; Yao, A. Augmentation-Free Dense Contrastive Knowledge Distillation for Efficient Semantic Segmentation. In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 10–16 December 2023; pp. 1–13. [Google Scholar]
  18. Mansourian, A.M.; Jalali, A.; Ahmadi, R.; Kasaei, S. Attention-guided Feature Distillation for Semantic Segmentation. arXiv 2025, arXiv:2403.05451v3. [Google Scholar]
  19. Liu, L.; Wang, Z.; Phan, M.H.; Zhang, B.; Ge, J.; Liu, Y. BPKD: Boundary Privileged Knowledge Distillation for Semantic Segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 1062–1072. [Google Scholar]
  20. Hinton, G. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
  21. Yang, C.; Yu, X.; An, Z.; Xu, Y. Categories of response-based, feature-based, and relation-based knowledge distillation. In Advancements in Knowledge Distillation: Towards New Horizons of Intelligent Systems; Springer: Berlin/Heidelberg, Germany, 2023; pp. 1–32. [Google Scholar]
  22. Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Proceedings of the International Workshop on Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, Granada, Spain, 20 September 2018; pp. 3–11. [Google Scholar]
  23. Johnson, J.; Alahi, A.; Li, F.-F. Perceptual losses for real time style transfer and super-resolution. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 694–711. [Google Scholar]
  24. Simonyan, K. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  25. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 8–13 December; 2014; Volume 27. [Google Scholar]
  26. Brouty, X.; Garcin, M. Fractal properties; information theory, and market efficiency. Chaos Solitons Fractals 2024, 180, 114543. [Google Scholar] [CrossRef]
  27. Yin, J. Dynamical fractal: Theory and case study. Chaos Solitons Fractals 2023, 176, 114190. [Google Scholar] [CrossRef]
  28. Brostow, G.J.; Shotton, J.; Fauqueur, J.; Cipolla, R. Segmentation and recognition using structure from motion point clouds. In Proceedings of the European Conference on Computer Vision, Marseille, France, 12–18 October 2008; pp. 44–57. [Google Scholar]
  29. Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The KITTI dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
  30. Kohavi, R. A study of cross-validation and bootstrap for ac curacy estimation and model selection. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, Montreal, QC, Canada, 20–25 August 1995. [Google Scholar]
  31. Nah, S.; Kim, T.H.; Lee, K.M. Deep multi-scale convolutional neural network for dynamic scene deblurring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3883–3891. [Google Scholar]
  32. Shen, Z.; Wang, W.; Lu, X.; Shen, J.; Ling, H.; Xu, T.; Shao, L. Human aware motion deblurring. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5572–5581. [Google Scholar]
  33. Rim, J.; Lee, H.; Won, J.; Cho, S. Real-world blur dataset for learning and benchmarking deblurring algorithms. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 184–201. [Google Scholar]
  34. Kupyn, O.; Budzan, V.; Mykhailych, M.; Mishkin, D.; Matas, J. DeblurGAN: Blind motion deblurring using conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8183–8192. [Google Scholar]
  35. Plompen, A.J.; Cabellos, O.; De Saint Jean, C.; Fleming, M.; Algora, A.; Angelone, M.; Archier, P.; Bauge, E.; Bersillon, O.; Blokhin, A. The joint evaluated fission and fusion nuclear data library, JEFF-3.3. Eur. Phys. J. A 2020, 56, 181. [Google Scholar] [CrossRef]
  36. NVIDIA. NVIDIA GeForce RTX 30 Series Graphics Cards. 2024. Available online: https://www.nvidia.com/en-us/geforce/graphics-cards/30-series/ (accessed on 13 July 2025).
  37. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L. Pytorch: An imperative style, high-performance deep learning library. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
  38. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  39. Everingham, M.; Eslami, S.A.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes challenge: A retrospective. Int. J. Comput. Vis. 2015, 111, 98–136. [Google Scholar] [CrossRef]
  40. Mongkhonthanaphon, S.; Limpiyakorn, Y. Classification of titanium microstructure with fully convolutional neural networks. J. Phys. Conf. Ser. 2019, 1195, 012022. [Google Scholar] [CrossRef]
  41. Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H.; Shao, L. Multi-stage progressive image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14821–14831. [Google Scholar]
  42. Chen, L.; Lu, X.; Zhang, J.; Chu, X.; Chen, C. Hinet: Half instance normalization network for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 182–192. [Google Scholar]
  43. Cho, S.J.; Ji, S.W.; Hong, J.P.; Jung, S.W.; Ko, S.J. Rethinking coarse to-fine approach in single image deblurring. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 4641–4650. [Google Scholar]
  44. Mao, X.; Liu, Y.; Liu, F.; Li, Q.; Shen, W.; Wang, Y. Intriguing findings of frequency selection for image deblurring. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 1905–1913. [Google Scholar]
  45. Xu, Y.; Zhu, Y.; Quan, Y.; Ji, H. Attentive deep network for blind motion deblurring on dynamic scenes. Comput. Vis. Image Underst. 2021, 205, 103169. [Google Scholar] [CrossRef]
  46. Gao, H.; Zhang, Y.; Yang, J.; Dang, D. Mixed hierarchy network for image restoration. Pattern Recognit. 2024, 161, 111313. [Google Scholar] [CrossRef]
  47. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
  48. Xiao, T.; Liu, Y.; Zhou, B.; Jiang, Y.; Sun, J. Unified perceptual parsing for scene understanding. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 418–434. [Google Scholar]
  49. Hou, Q.; Zhang, L.; Cheng, M.M.; Feng, J. Strip pooling: Rethinking spatial pooling for scene parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4003–4012. [Google Scholar]
  50. Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic seg mentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
  51. Hong, Y.; Pan, H.; Sun, W.; Jia, Y. Deep dual-resolution networks for real-time and accurate semantic segmentation of road scenes. arXiv 2021, arXiv:2101.06085. [Google Scholar]
  52. Romero, A.; Ballas, N.; Kahou, S.E.; Chassang, A.; Gatta, C.; Bengio, Y. FitNets: Hints for thin deep nets. arXiv 2014, arXiv:1412.6550. [Google Scholar]
  53. Zagoruyko, S.; Komodakis, N. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In Proceedings of the International Conference on Learning and Representations (ICLR), Toulon, France, 24–26 April 2017; pp. 1–13. [Google Scholar]
  54. Tung, F.; Mori, G. Similarity-preserving knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1365–1374. [Google Scholar]
  55. Chen, P.; Liu, S.; Zhao, H.; Jia, J. Distilling knowledge via knowledge review. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 5008–5017. [Google Scholar]
  56. Chen, D.; Mei, J.P.; Zhang, H.; Wang, C.; Feng, Y.; Chen, C. Knowledge Distillation with the Reused Teacher Classifier. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11933–11942. [Google Scholar]
  57. Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
  58. Cohen, J. A power primer. In Methodological Issues and Strategies in Clinical Research, 4th ed.; Kazdin, A.E., Ed.; American Psychological Association: Washington, DC, USA, 2016; pp. 279–284. [Google Scholar]
Figure 1. Overall procedure of the proposed method.
Figure 1. Overall procedure of the proposed method.
Fractalfract 09 00460 g001
Figure 2. Overall architecture of the proposed KDS-Net.
Figure 2. Overall architecture of the proposed KDS-Net.
Fractalfract 09 00460 g002
Figure 3. The structure of the SCM and FAFM in the KDS-Net. The top image represents the SCM, and the bottom image represents the FAFM. GAP and GMP represent global average pooling and global max pooling, respectively.
Figure 3. The structure of the SCM and FAFM in the KDS-Net. The top image represents the SCM, and the bottom image represents the FAFM. GAP and GMP represent global average pooling and global max pooling, respectively.
Fractalfract 09 00460 g003
Figure 4. The structure of the proposed EMG.
Figure 4. The structure of the proposed EMG.
Fractalfract 09 00460 g004
Figure 5. Examples of original images (left) and corresponding motion-blurred images (right) from (a) CamVid and (b) KITTI databases.
Figure 5. Examples of original images (left) and corresponding motion-blurred images (right) from (a) CamVid and (b) KITTI databases.
Fractalfract 09 00460 g005
Figure 6. Graphs of training accuracy (red) and loss (purple) and validation accuracy (blue) and loss (orange) of the KDS-Net with (a) CamVid and (b) KITTI databases.
Figure 6. Graphs of training accuracy (red) and loss (purple) and validation accuracy (blue) and loss (orange) of the KDS-Net with (a) CamVid and (b) KITTI databases.
Fractalfract 09 00460 g006
Figure 7. Examples of semantic segmentation result images obtained by proposed and SOTA methods. (a) Examples 1 and (b) Examples 2. The input image is a motion-blurred image.
Figure 7. Examples of semantic segmentation result images obtained by proposed and SOTA methods. (a) Examples 1 and (b) Examples 2. The input image is a motion-blurred image.
Fractalfract 09 00460 g007
Figure 8. Comparisons of Grad-CAM by proposed method with those by SOTA methods in case of (a) pole and (b) pedestrian classes.
Figure 8. Comparisons of Grad-CAM by proposed method with those by SOTA methods in case of (a) pole and (b) pedestrian classes.
Fractalfract 09 00460 g008
Figure 9. Examples of semantic segmentation result images obtained by the proposed and SOTA methods. (a) Examples 1 and (b) Examples 2.
Figure 9. Examples of semantic segmentation result images obtained by the proposed and SOTA methods. (a) Examples 1 and (b) Examples 2.
Fractalfract 09 00460 g009
Figure 10. Comparisons of Grad-CAM by proposed method with those by SOTA methods in case of (a) pedestrian and (b) bicyclist on the KITTI database.
Figure 10. Comparisons of Grad-CAM by proposed method with those by SOTA methods in case of (a) pedestrian and (b) bicyclist on the KITTI database.
Fractalfract 09 00460 g010
Figure 11. Jetson TX2 embedded system.
Figure 11. Jetson TX2 embedded system.
Fractalfract 09 00460 g011
Figure 12. T-test results of the segmentation accuracy achieved by our proposed method (KDS-Net) and the second-best method (SDAN-MD). (a) PA, (b) mPA, (c) FW IoU, and (d) mIoU.
Figure 12. T-test results of the segmentation accuracy achieved by our proposed method (KDS-Net) and the second-best method (SDAN-MD). (a) PA, (b) mPA, (c) FW IoU, and (d) mIoU.
Fractalfract 09 00460 g012
Figure 13. Examples of error cases by proposed KDS-Net. (a) CamVid, and (b) KITTI.
Figure 13. Examples of error cases by proposed KDS-Net. (a) CamVid, and (b) KITTI.
Fractalfract 09 00460 g013
Figure 14. FD estimation results for four semantic classes. (ad) represent the classes of car, bicyclist, sidewalk, and sign symbol, respectively. In each subfigure, the leftmost image shows the input RGB image, the first and second columns of middle images present the binary masks of the ground truth and KDS-Net prediction, respectively, and the corresponding log-log plots for each are shown on the right. The input is a motion-blurred image.
Figure 14. FD estimation results for four semantic classes. (ad) represent the classes of car, bicyclist, sidewalk, and sign symbol, respectively. In each subfigure, the leftmost image shows the input RGB image, the first and second columns of middle images present the binary masks of the ground truth and KDS-Net prediction, respectively, and the corresponding log-log plots for each are shown on the right. The input is a motion-blurred image.
Fractalfract 09 00460 g014aFractalfract 09 00460 g014b
Table 1. Description of the KDS-Net encoder. The modules marked with an asterisk (*) in Stages 3 and 4 utilize identical filter sizes (excluding the number of input channels) and stride as Stage 2, with only the number of filters doubled. DB refers to the dense block illustrated in Figure 2.
Table 1. Description of the KDS-Net encoder. The modules marked with an asterisk (*) in Stages 3 and 4 utilize identical filter sizes (excluding the number of input channels) and stride as Stage 2, with only the number of filters doubled. DB refers to the dense block illustrated in Figure 2.
StageLayerFilter
(Number of Filters, Size)
StridePaddingOutput
1Input---224 × 224 × 3
DB (X0,0)64, 7 × 7 × 323 × 3112 × 112 × 64
DB (X1,0)64, 1 × 1 × 64
64, 3 × 3 × 64
256, 1 × 1 × 64
21 × 156 × 56 × 256
2SCM32, 3 × 3 × 3
64, 1 × 1 × 32
64, 3 × 3 × 64
125, 1 × 1 × 64
128, 1 × 1 × 128
256, 3 × 3 × 128
256, 3 × 3 × 256
11 × 156 × 56 × 256
FAFM256, 1 × 1 × 512
16, 1 × 1 × 16
256, 1 × 1 × 256
11 × 156 × 56 × 256
DB (X2,0)128, 1 × 1 × 512
128, 3 × 3 × 128
512, 1 × 1 × 128
128, 1 × 1 × 512
128, 3 × 3 × 128
11 × 128 × 28 × 512
3SCM *--1 × 128 × 28 × 512
FAFM *---28 × 28 × 512
DB (X3,0)--1 × 114 × 14 × 1024
4SCM *--1 × 114 × 14 × 1024
FAFM *---14 × 14 × 1024
DB (X4,0)--1 × 17 × 7 × 2048
Table 2. Description of the proposed KDS-Net decoder (Filters with stride and padding below all indicate the filters used in DB). DB refers to the dense block illustrated in Figure 2.
Table 2. Description of the proposed KDS-Net decoder (Filters with stride and padding below all indicate the filters used in DB). DB refers to the dense block illustrated in Figure 2.
LayerFilter
(Number of Filters, Size)
Output
Upsample (X1,0)-112 × 112 × 256
Concatenate (X0,0, X1,0)-112 × 112 × 320
DB (X0,1)64, 3 × 3 × 320
64, 3 × 3 × 64
112 × 112 × 64
Upsample (X2,0)-56 × 56 × 512
Concatenate (X1,0, X2,0)-56 × 56 × 768
DB (X1,1)256, 3 × 3 × 768
256, 3 × 3 × 256
56 × 56 × 256
Upsample (X3,0)-28 × 28 × 1024
Concatenate (X2,0, X3,0)-28 × 28 × 1536
DB (X2,1)512, 3 × 3 × 1536
512, 3 × 3 × 512
28 × 28 × 512
Upsample (X4,0)-14 × 14 × 2048
Concatenate (X3,0, X4,0)-14 × 14 × 3072
DB (X3,1)256, 3 × 3 × 3072
256, 3 × 3 × 256
14 × 14 × 256
Upsample (X1,1)-112 × 112 × 256
Concatenate (X0,0, X0,1, X1,1)-112 × 112 × 384
DB (X0,2)64, 3 × 3 × 384
64, 3 × 3 × 64
112 × 112 × 64
Upsample (X2,1)-56 × 56 × 512
Concatenate (X1,0, X1,1, X2,1)-56 × 56 × 1024
DB (X1,2)256, 3 × 3 × 1024
256, 3 × 3 × 256
56 × 56 × 256
Upsample (X3,1)-28 × 28 × 256
Concatenate (X2,0, X2,1, X3,1)-28 × 28 × 1280
DB (X2,2)128, 3 × 3 × 1280
128, 3 × 3 × 128
28 × 28 × 128
Upsample (X1,2)-112 × 112 × 256
Concatenate (X0,0, X0,1, X0,2, X1,2)-112 × 112 × 448
DB (X0,3)64, 3 × 3 × 448
64, 3 × 3 × 64
112 × 112 × 64
Upsample (X2,2)-56 × 56 × 128
Concatenate (X1,0, X1,1, X1,2, X2,2)-56 × 56 × 896
DB (X1,3)64, 3 × 3 × 896
64, 3 × 3 × 64
56 × 56 × 64
Upsample (X1,3)-112 × 112 × 64
Concatenate (X0,0, X0,1, X0,2, X0,3, X1,3)-56 × 56 × 896
DB (X0,4)32, 3 × 3 × 320
32, 3 × 3 × 32
112 × 112 × 32
DB (X0,5)16, 3 × 3 × 32
16, 3 × 3 × 16
12, 3 × 3 × 16
224 × 224 × 12
Table 3. Description of the proposed EMG.
Table 3. Description of the proposed EMG.
LayerFilter
(Number of Filter, Size, Stride)
PaddingInput SizeOutput Size
Input layer--224 × 224 × 1224 × 224 × 1
1st Conv layer16, 3 × 3 × 1, 11 × 1224 × 224 × 1224 × 224 × 16
2nd Conv layer32, 3 × 3 × 16, 11 × 1224 × 224 × 16224 × 224 × 32
3rd Conv layer64, 3 × 3 × 32, 11 × 1224 × 224 × 32224 × 224 × 64
4th Conv layer2, 1 × 1 × 64, 11 × 1224 × 224 × 64224 × 224 × 2
Table 4. Summary of experimental databases containing motion-blurred images. # indicates the number of images.
Table 4. Summary of experimental databases containing motion-blurred images. # indicates the number of images.
Database# ImagesSizeClass Names (Ratio %)
CamVidSubset A351320 × 240Sky (15.81), Building (24.10), Pole (0.99), Road (29.33), Sidewalk (6.69), Tree (11.19), Sign symbol (1.08), Fence (1.43), Car (4.64), Pedestrian (0.64), Bicyclist (0.53), Unlabeled (3.57)
Subset B350
KITTISubset A223512 × 176Sky (5.89), Building (20.91), Road (17.01),
Sidewalk (7.22), Fence (3.14), Tree (34.58), Pole (0.51), Car (7.34), Sign symbol (0.36), Pedestrian (0.07), Bicyclist (0.12), Unlabeled (2.85)
Subset B222
Table 5. Training time per epoch for different datasets and models (unit: seconds).
Table 5. Training time per epoch for different datasets and models (unit: seconds).
DatabaseModelTraining Time per Epoch
CamVidTeacher model82.98
Student model57.32
KITTITeacher model49.68
Student model32.51
Table 6. Cases for the first ablation study according to the usage of proposed modules in KDS-Net. The indicates that the module is included.
Table 6. Cases for the first ablation study according to the usage of proposed modules in KDS-Net. The indicates that the module is included.
MethodSCMFAFMEMGKDVQ
Case 1
Case 2
Case 3
Case 4
Case 5
Case 6 (proposed)
Table 7. Performance comparisons for the first ablation study (unit: %).
Table 7. Performance comparisons for the first ablation study (unit: %).
MethodPAmPAFW IoUmIoU
Case 190.7266.2783.5459.19
Case 291.1069.9384.2461.25
Case 391.6272.7885.1863.86
Case 492.3775.4186.4066.62
Case 593.5679.1388.3371.66
Case 6 (proposed)93.6979.8088.5672.42
Table 8. Cases for the second ablation study according to the number of layers where feature maps are extracted in EMG. The indicates that the layer is included.
Table 8. Cases for the second ablation study according to the number of layers where feature maps are extracted in EMG. The indicates that the layer is included.
Method1st Layer2nd Layer3rd Layer
Case 1
Case 2
Case 3
Case 4
Case 5
Case 6 (proposed)
Table 9. Performance comparisons for the second ablation study (unit: %).
Table 9. Performance comparisons for the second ablation study (unit: %).
MethodPAmPAFW IoUmIoU
Case 192.8877.7887.2469.65
Case 293.1777.6687.6870.21
Case 393.1877.1887.6670.03
Case 493.3077.7187.8770.04
Case 593.4179.2688.1171.58
Case 6 (proposed)93.6979.8088.5672.42
Table 10. Cases for the third ablation study according to the usage of dense block (DB) of the encoder in KDS-Net. The indicates that the block is included.
Table 10. Cases for the third ablation study according to the usage of dense block (DB) of the encoder in KDS-Net. The indicates that the block is included.
MethodDB (X1,0)DB (X2,0)DB (X3,0)DB (X4,0)
Case 1
Case 2
Case 3
Case 4
Case 5 (proposed)
Table 11. Performance comparisons for the third ablation study (unit: %).
Table 11. Performance comparisons for the third ablation study (unit: %).
MethodPAmPAFW IoUmIoU
Case 192.7375.2286.9167.74
Case 292.9175.6887.2268.58
Case 392.9476.7687.2369.27
Case 493.0576.4587.4369.41
Case 5 (proposed)93.5679.1388.3371.66
Table 12. Comparison of semantic segmentation accuracies using the proposed and SOTA methods on the CamVid database (unit: %). W restoration refers to “with restoration,” W/O restoration refers to “without restoration,” and W KD refers to “with knowledge distillation.”.
Table 12. Comparison of semantic segmentation accuracies using the proposed and SOTA methods on the CamVid database (unit: %). W restoration refers to “with restoration,” W/O restoration refers to “without restoration,” and W KD refers to “with knowledge distillation.”.
MethodPAmPAFW IoUmIoU
W restorationDeblurGAN-V2 [34]91.5471.7585.0464.69
MPRNet [41]91.0172.1083.9064.67
HINet [42]91.5472.8584.8365.47
MIMO-UNet [43]90.4169.4482.9361.97
MIMO-UNet-Plus [43]90.6670.1783.3462.76
DeepRFT [44]92.0874.0385.7366.64
DeepRFT-Plus [44]92.0075.8586.9868.62
Attentive deep [45]90.9170.4783.7763.28
MHNet [46]90.5870.0183.2662.67
SDAN-MD [3]92.8976.8287.1069.58
W/O restorationPSPNet [47]88.9864.5680.6556.04
ICNet [11]88.2962.3979.6054.27
DeeplabV3-Plus [14]89.4567.1881.5159.25
UperNet [48]91.0171.0384.0063.15
Alpha blending [15]88.9365.6880.7257.95
SPNet [49]92.7075.0586.8567.91
HRNet [16]92.6376.2786.7268.81
SegFormer [50]92.7875.3886.9667.95
DDRNet [51]91.6573.7385.9666.21
W/O restoration and W KDTeacher94.3380.7289.5574.76
Student92.3775.4186.4066.62
Fitnet [52]92.3072.3186.1465.85
AT [53]91.2369.2984.3763.11
SP [54]92.2074.0386.1066.80
ReviewKD [55]92.4074.3486.3467.57
SimKD [56]92.3774.3686.2867.44
KDS-Net (ours)93.6979.8088.5672.42
Table 13. Comparison of semantic segmentation accuracy (Class IoU) using the proposed and SOTA methods according to classes on the CamVid database (unit: %).
Table 13. Comparison of semantic segmentation accuracy (Class IoU) using the proposed and SOTA methods according to classes on the CamVid database (unit: %).
MethodSkyBuildingPoleRoadSidewalkTreeSignFenceCarPedestrianBicyclist
W
restoration
DeblurGAN-V2 [34]90.8584.8714.3594.7576.0776.4842.7853.9182.5633.6261.34
MPRNet [41]89.2683.1418.0795.6279.2167.9241.5748.8283.7440.1663.91
HINet [42]89.6184.6417.1795.4778.8672.3742.4051.9983.9738.6065.05
MIMO-UNet [43]89.0382.3215.0694.9076.4668.1038.1141.1481.8134.5660.17
MIMO-UNet-Plus [43]89.3682.6515.7095.1377.4468.5238.9642.8182.3835.8361.54
DeepRFT [44]90.4085.7517.3595.5779.3475.0343.4455.3484.8540.4765.60
DeepRFT-Plus [44]91.1687.0818.5195.9880.9778.0345.8560.6085.9443.1367.55
Attentive deep [45]89.2583.4616.2194.9977.1570.4338.2648.3282.6235.2860.12
MHNet [46]89.0682.7615.6494.8176.3269.5637.5946.2981.6834.6361.02
SDAN-MD [3]90.8887.0920.9996.3482.4576.6448.8660.8486.7745.3769.21
W/O
restoration
PSPNet [47]88.2580.402.6292.7568.8771.4623.3842.5371.5624.2650.32
ICNet [11]88.2478.452.3792.1566.7570.3521.3443.2371.2317.3441.51
DeeplabV3-Plus [14]86.5281.5111.5893.5972.7469.7532.9840.6977.9430.2054.20
UperNet [48]89.1383.729.3194.8376.5674.0834.5056.0078.7434.8658.89
Alpha blending [15]87.9580.0412.0392.7269.8869.7928.4941.0975.2629.7550.48
SPNet [49]90.3986.9215.4696.1882.1478.0642.9365.4284.1939.9965.32
HRNet [16]90.9186.5623.2895.9581.4477.7845.1462.4583.0543.6266.74
SegFormer [50]91.1386.9112.5596.0181.4179.2846.4264.0183.4741.3964.94
DDRNet [51]90.3585.7815.3395.8580.6076.8741.0159.2282.0238.9963.12
W/O
Restoration and
W KD
Teacher91.6089.8425.6497.0886.5382.5662.3473.1786.8453.1473.65
Student86.9586.348.7995.5781.3177.8250.9960.0079.7941.7563.51
Fitnet [52]90.9085.6618.4695.9980.8277.9547.7558.6780.1030.2957.81
AT [53]90.1583.0216.0195.5079.2976.5846.7650.6071.5531.4753.29
SP [54]90.7886.0221.4896.1481.7076.0350.2350.7879.5641.3860.70
ReviewKD [55]90.7586.5824.5695.8681.0377.1851.2053.6679.7642.3760.28
SimKD [56]91.0386.4725.0495.7380.7377.3750.5754.7678.8541.9359.34
KDS-Net (ours)91.5788.4927.4296.7784.6380.7754.0067.2886.4248.3270.94
Table 14. Comparison of semantic segmentation accuracies using the proposed and SOTA methods on the KITTI database (unit: %).
Table 14. Comparison of semantic segmentation accuracies using the proposed and SOTA methods on the KITTI database (unit: %).
MethodPAmPAFW IoUmIoU
W restorationDeblurGAN-V2 [34]84.4856.5374.0648.73
MPRNet [41]84.5655.5073.6648.33
HINet [42]85.5960.7075.2952.29
MIMO-UNet [43]84.1253.4872.2246.00
MIMO-UNet-Plus [43]84.1754.6473.0347.22
DeepRFT [44]85.2059.7074.6050.82
DeepRFT-Plus [44]86.8662.0577.1953.82
Attentive deep [45]82.8752.3271.1644.70
MHNet [46]84.0255.9172.8848.03
SDAN-MD [3]87.2763.5677.8455.59
W/O restorationPSPNet [47]82.7552.1571.5143.58
ICNet [11]77.9845.6965.3036.54
DeeplabV3-Plus [14]83.3953.5771.9945.77
UperNet [48]87.6661.1678.4553.69
Alpha blending [15]78.9752.1165.5743.02
SPNet [49]88.0759.5379.0252.69
HRNet [16]88.0762.3779.3354.31
SegFormer [50]87.6755.9977.7348.99
DDRNet [51]87.1859.6477.7152.14
W/O restoration and W KDTeacher93.4466.9187.8462.10
Student89.2962.6780.9855.60
Fitnet [52]89.3764.0681.3255.96
AT [53]89.4062.7581.2555.41
SP [54]89.6262.9581.6256.05
ReviewKD [55]89.6067.1981.7157.74
SimKD [56]89.7662.9581.7756.23
KDS-Net (ours)90.1066.6182.3659.29
Table 15. Comparison of semantic segmentation accuracy (Class IoU) using the proposed and SOTA methods according to classes on the KITTI database (unit: %).
Table 15. Comparison of semantic segmentation accuracy (Class IoU) using the proposed and SOTA methods according to classes on the KITTI database (unit: %).
MethodSkyBuildingRoadSidewalkFenceTreePoleCarSignPedestrianBicyclist
W
restoration
DeblurGAN-V2 [34]87.1573.2479.0351.6235.9680.547.5070.9223.5816.0910.38
MPRNet [41]85.8071.0280.7653.0428.8979.9615.2772.9224.6414.305.05
HINet [42]83.9472.5582.9357.0336.9080.4515.5876.4327.8929.9511.60
MIMO-UNet [43]84.1270.3979.4950.6423.5479.138.9569.2819.0716.714.70
MIMO-UNet-Plus [43]85.2771.2580.0951.6525.2879.6310.0670.8519.8316.399.14
DeepRFT [44]85.9972.6281.5354.7031.7380.0114.2673.4026.3226.8111.35
DeepRFT-Plus [44]87.2375.8283.1357.1039.5482.4315.2774.1529.6930.6912.42
Attentive deep [45]81.7769.1378.2450.1926.1778.659.8765.3615.1014.772.52
MHNet [46]84.1370.2480.3753.1528.7379.2313.3870.6022.3020.825.38
SDAN-MD [3]87.8675.6684.1059.1339.5382.5318.1182.0831.1035.4215.95
W/O
restoration
PSPNet [47]79.6571.9574.7348.6730.5579.437.6967.8015.082.181.68
ICNet [11]81.1269.6256.7638.7814.5878.281.0355.716.010.000.00
DeeplabV3-Plus [14]80.4369.9979.4652.0329.1379.5810.5064.3319.6313.175.18
UperNet [48]87.6178.1986.1658.5342.6084.7414.7173.4428.5724.4014.17
Alpha blending [15]78.8465.7074.5743.5228.1869.1211.7762.3219.6911.468.09
SPNet [49]87.4578.2385.4560.0243.4484.5312.2076.2325.7612.6213.66
HRNet [16]88.4378.2285.4261.8942.1384.2921.3179.0327.1117.2812.16
SegFormer [50]87.9677.1383.6157.8539.2884.318.4972.2919.585.582.80
DDRNet [51]86.6876.6584.3559.6937.4783.2016.8775.8225.0016.5511.27
W/O
Restoration and
W KD
Teacher91.1286.3694.1581.5370.4989.6416.0988.5256.366.772.04
Student86.5178.6486.9364.5050.4886.2514.0382.1339.6913.568.82
Fitnet [52]87.5179.2287.4765.8848.2686.4513.4881.6540.6012.9912.10
AT [53]88.1079.4486.9765.4147.8686.3512.0882.0438.1610.3912.73
SP [54]88.1379.7887.9766.8848.2286.2218.3782.1839.1211.668.03
ReviewKD [55]88.5179.4587.7666.5049.9986.4222.5082.6042.7516.5312.08
SimKD [56]87.9279.6788.3767.5248.4286.2819.0892.7737.1512.858.54
KDS-Net (ours)89.8181.4587.6366.4250.8786.5323.2584.1438.9925.5117.58
Table 16. Comparison of the inference time per one image (unit: ms) and inference speed (unit: frames per second (fps)) by proposed and SOTA methods on the desktop and embedded system.
Table 16. Comparison of the inference time per one image (unit: ms) and inference speed (unit: frames per second (fps)) by proposed and SOTA methods on the desktop and embedded system.
MethodInference Time/Inference Speed
DesktopJetson Embedded System
W restorationDeblurGAN-V2 [34]40.94/24.43198.98/5.03
MPRNet [41]47.52/21.04551.79/1.81
HINet [42]18.55/53.9192.39/10.82
MIMO-UNet [43]21.04/47.53101.18/9.88
MIMO-UNet-Plus [43]35.63/28.07308.77/3.24
DeepRFT [44]37.62/26.58420.53/2.38
DeepRFT-Plus [44]77.87/12.841400.37/0.71
Attentive deep [45]66.65/15.00704.97/1.42
MHNet [46]56.01/17.85387.12/2.58
SDAN-MD [3]48.68/20.54648.44/1.54
W/O restorationPSPNet [47]8.45/118.3441.63/24.02
ICNet [11]13.48/74.1857.39/17.42
DeeplabV3-Plus [14]7.47/133.8741.46/24.12
UperNet [48]8.24/121.3642.52/23.52
Alpha blending [15]7.45/134.2341.46/24.12
SPNet [49]17.93/55.7774.89/13.35
HRNet [16]40.61/24.62154.00/6.49
SegFormer [50]6.29/158.9831.64/31.61
DDRNet [51]6.94/144.0927.01/37.02
Proposed (KDS-Net)6.29/158.9832.18/31.08
Table 17. Comparison of number of parameters, GPU memory, FLOPs of KDS-Net, and SOTA methods. # indicates the number of parameters.
Table 17. Comparison of number of parameters, GPU memory, FLOPs of KDS-Net, and SOTA methods. # indicates the number of parameters.
Method# Parameters
(Unit: Mega)
GPU Memory Requirement
(Unit: Mega Byte)
FLOPs
(Unit: Giga)
W restorationDeblurGAN-V2 [34] 44.85678.81102.4
MPRNet [41] 59.891306.021982.4
HINet [42] 128.43678.81464.16
MIMO-UNet [43] 46.571149.11205.52
MIMO-UNet-Plus [43] 55.872024.62423.82
DeepRFT [44] 49.312009.6248.82
DeepRFT-Plus [44] 62.734176.3849.70
Attentive deep network [45] 46.677497.51542.73
MHNet [46] 56.795276.01157.34
SDAN-MD [3]60.097719.862363.33
W/O
restoration
PSPNet [47] 48.76548.73115.57
ICNet [11]28.30186.2828.30
DeeplabV3-Plus [14]39.76333.5137.38
UperNet [48] 37.28349.1247.98
Alpha blending [15]39.76333.5137.38
SPNet [49] 60.61895.58155.16
HRNet [16]65.85622.5258.72
SegFormer [50] 3.72105.164.26
DDRNet [51] 5.7356.593.06
Proposed (KDS-Net)21.44607.74112.42
Table 18. R2, C and FD values from Figure 14.
Table 18. R2, C and FD values from Figure 14.
ResultsR2CFD
Figure 14aGround truth0.982260.991091.41403
Predict0.982110.991021.41657
Figure 14bGround truth0.972820.986321.32727
Predict0.974060.986951.33675
Figure 14cGround truth0.982440.991181.40099
Predict0.982730.991331.39811
Figure 14dGround truth0.988800.994381.47055
Predict0.991150.995561.47309
Table 19. Comparison of semantic segmentation accuracies using the proposed and SOTA methods on the CamVid database (unit: %).
Table 19. Comparison of semantic segmentation accuracies using the proposed and SOTA methods on the CamVid database (unit: %).
MethodPAmPAFW IoUmIoU
DeblurGAN-V2 [34]91.5471.7585.0464.69
MPRNet [41]91.0172.1083.9064.67
HINet [42]91.5472.8584.8365.47
MIMO-UNet [43]90.4169.4482.9361.97
MIMO-Unet-Plus [43]90.6670.1783.3462.76
DeepRFT [44]92.0874.0385.7366.64
DeepRFT-Plus [44]92.0075.8586.9868.62
Attentive deep [45]90.9170.4783.7763.28
MHNet [46]90.5870.0183.2662.67
SDAN-MD [3]92.8976.8287.1069.58
KDS-Net (ours)93.6979.8088.5672.42
Table 20. Comparison of semantic segmentation accuracies using the proposed method and restoration-based SOTA methods on the KITTI database (unit: %).
Table 20. Comparison of semantic segmentation accuracies using the proposed method and restoration-based SOTA methods on the KITTI database (unit: %).
MethodPAmPAFW IoUmIoU
DeblurGAN-V2 [34] 84.4856.5374.0648.73
MPRNet [41] 84.5655.5073.6648.33
HINet [42] 85.5960.7075.2952.29
MIMO-Unet [43] 84.1253.4872.2246.00
MIMO-Unet-Plus [43] 84.1754.6473.0347.22
DeepRFT [44] 85.2059.7074.6050.82
DeepRFT-Plus [44] 86.8662.0577.1953.82
Attentive deep [45] 82.8752.3271.1644.70
MHNet [46] 84.0255.9172.8848.03
SDAN-MD [3]87.2763.5677.8455.59
KDS-Net (ours)90.1066.6182.3659.29
Table 21. Comparison of number of parameters, GPU memory, FLOPs of KDS-Net, and restoration-based SOTA methods.
Table 21. Comparison of number of parameters, GPU memory, FLOPs of KDS-Net, and restoration-based SOTA methods.
Method# Parameters
(Unit: Mega)
GPU Memory Requirement
(Unit: Mega Byte)
FLOPs
(Unit: Giga)
DeblurGAN-V2 [34]44.85678.81102.4
MPRNet [41]59.891306.021982.4
HINet [42]128.43678.81464.16
MIMO-UNet [43]46.571149.11205.52
MIMO-UNet-Plus [43]55.872024.62423.82
DeepRFT [44]49.312009.6248.82
DeepRFT-Plus [44]62.734176.3849.70
Attentive deep network [45]46.677497.51542.73
MHNet [46]56.795276.01157.34
SDAN-MD [3]60.097719.862363.33
Proposed (KDS-Net)21.44607.74112.42
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jeong, S.I.; Jeong, M.S.; Park, K.R. Estimation of Fractal Dimension and Semantic Segmentation of Motion-Blurred Images by Knowledge Distillation in Autonomous Vehicle. Fractal Fract. 2025, 9, 460. https://doi.org/10.3390/fractalfract9070460

AMA Style

Jeong SI, Jeong MS, Park KR. Estimation of Fractal Dimension and Semantic Segmentation of Motion-Blurred Images by Knowledge Distillation in Autonomous Vehicle. Fractal and Fractional. 2025; 9(7):460. https://doi.org/10.3390/fractalfract9070460

Chicago/Turabian Style

Jeong, Seong In, Min Su Jeong, and Kang Ryoung Park. 2025. "Estimation of Fractal Dimension and Semantic Segmentation of Motion-Blurred Images by Knowledge Distillation in Autonomous Vehicle" Fractal and Fractional 9, no. 7: 460. https://doi.org/10.3390/fractalfract9070460

APA Style

Jeong, S. I., Jeong, M. S., & Park, K. R. (2025). Estimation of Fractal Dimension and Semantic Segmentation of Motion-Blurred Images by Knowledge Distillation in Autonomous Vehicle. Fractal and Fractional, 9(7), 460. https://doi.org/10.3390/fractalfract9070460

Article Metrics

Back to TopTop