1. Introduction
In smart urban traffic management, illegal parking detection is an important technical support for improving road traffic efficiency and ensuring public safety [
1,
2]. Illegal parking can not only easily cause traffic congestion but also occupy critical areas such as fire lanes, thereby creating potential safety hazards. Meanwhile, license plates serve as identifiers of vehicle identity, and their detection accuracy is directly related to the reliability of violation determination and law-enforcement evidence collection [
3]. Therefore, in practical applications, illegal behavior recognition and license plate information acquisition are not isolated tasks; rather, they constitute a complete closed loop of behavior determination and identity confirmation. This also imposes higher requirements on the accuracy and real-time performance of detection systems [
4].
Existing studies have explored illegal parking detection, license plate detection and recognition, and lightweight object detection from different technical perspectives. However, these research lines are often developed separately. Illegal parking detection methods usually focus on vehicle localization, parking-state judgment, or trajectory-based behavior analysis, while license plate recognition methods mainly emphasize plate localization and character recognition under complex imaging conditions. In practical traffic enforcement, however, behavior determination and vehicle identity confirmation are two connected components of the same decision-making process. Treating them as isolated tasks may lead to redundant computation and insufficient coordination between violation detection and identity confirmation.
In addition, edge-side deployment imposes strict constraints on model size, computational cost, and inference latency. Directly combining illegal parking detection and license plate recognition may increase model complexity, while excessive lightweight compression can weaken feature representation, especially for small license plate regions. Therefore, an effective model should not only improve detection and recognition performance but also maintain a favorable accuracy–efficiency trade-off under resource-constrained edge deployment conditions.
Based on the above analysis, this study proposes RKF-YOLO, a lightweight YOLO-based dual-task framework for illegal parking detection and license plate recognition on edge devices. The proposed framework is designed from the perspective of an edge-side traffic enforcement pipeline, where illegal parking detection and vehicle identity confirmation are treated as two connected components rather than isolated tasks. By introducing structural optimization, feature compensation, training-strategy enhancement, and task-oriented localization improvement, the proposed method aims to achieve a better balance among detection accuracy, recognition reliability, model complexity, and deployment efficiency.
The main contributions of this study are summarized as follows:
A unified lightweight framework for traffic enforcement tasks is proposed. Unlike existing studies that usually treat illegal parking detection and license plate recognition as independent tasks, this study formulates them as two connected components of an edge-side traffic enforcement pipeline. A shared lightweight backbone is used to extract common traffic-scene features, while task-oriented detection branches are used to support illegal parking detection and license-plate-oriented localization.
A structure-level accuracy–efficiency optimization strategy is designed for edge deployment. To address the contradiction between computational reduction and feature degradation, re-parameterized convolution is integrated into a Rep-CSP structure, and asymmetric channel reduction is applied in the feature fusion stage. Adaptive group convolution and global attention are further introduced to compensate for the loss of local fine-grained and global contextual features.
A training and localization enhancement strategy is developed for lightweight dual-task detection. A knowledge-transfer-inspired training strategy is adopted to improve the convergence stability of the compressed model, while Focal-CIoU is introduced to enhance the learning of low-IoU license plate samples and improve small-target localization.
A complete experimental and edge-deployment validation is conducted. The proposed method is evaluated on a self-constructed illegal parking dataset and the CCPD license plate dataset and is further deployed on the RK3588 platform. The deployment results show that RKF-YOLO achieves 16.8 ms inference time and 59.5 FPS for illegal parking detection while maintaining 95.1% overall recognition accuracy and 98.4% character accuracy in the license plate recognition pipeline.
2. Related Work
2.1. Illegal Parking Detection in Intelligent Transportation
Illegal parking detection is an important task in intelligent traffic management because it provides behavioral evidence for traffic enforcement. Beyond the parking-management and real-time detection studies introduced in the Introduction, deep-learning-based parking violation detection has also been combined with temporal modeling, feature tracking, aerial perception, and distributed sensing. Sharma et al. used YOLOv8 and tracking algorithms to estimate parking time violations, connecting vehicle detection with parking-duration judgment [
5]. Alwafi et al. combined YOLO with good-feature-to-track methods to enhance illegally parked vehicle detection in surveillance scenes [
6]. Bin et al. proposed a parking-vehicle detection framework for oblique UAV images, extending parking-related detection to aerial visual scenes [
7]. Luan et al. developed a data-driven crowdsensing framework for parking violation detection, showing that parking enforcement can also benefit from distributed sensing sources [
8].
These studies demonstrate the feasibility of deep-learning-based parking violation detection. However, most of them focus mainly on behavior detection, vehicle localization, or parking-state judgment. Vehicle identity confirmation, which is usually required in real enforcement scenarios, is often treated as a separate downstream process. Therefore, a unified framework that connects illegal parking detection with license-plate-oriented localization and recognition remains necessary for practical edge-side enforcement systems.
2.2. License Plate Detection and Recognition
License plate detection and recognition provide vehicle identity information for traffic enforcement and intelligent transportation applications. Wang et al. combined an improved YOLOv5s detector with LPRNet to improve license plate recognition in complex scenarios [
9]. Zhang and Yu proposed a lightweight license plate recognition method based on YOLOv8, showing the potential of lightweight YOLO models for plate-related tasks [
10]. Chung et al. introduced YOLO-SLD with an attention mechanism for license plate detection, indicating that attention-based feature enhancement can improve key-region representation [
11]. Agarwal and Bansal explored automatic number plate detection and recognition using YOLO-World, extending plate-related detection with open-vocabulary detection capability [
12]. Tao et al. developed a real-time license plate detection and recognition model for unconstrained scenarios, emphasizing robustness under complex imaging conditions [
13]. Zhu et al. improved YOLOv8n for license plate detection, further demonstrating the importance of lightweight detection design for plate localization [
14]. Satya et al. optimized YOLOv8 for automatic license plate recognition on resource-constrained devices, showing that lightweight detection combined with OCR-based recognition remains an important solution for edge-side license plate processing [
15].
Although these methods improve license plate detection or recognition performance, they are generally optimized independently of illegal parking detection. In illegal parking enforcement, the quality of license plate localization directly affects subsequent identity confirmation. Therefore, license plate detection should be considered together with the preceding illegal parking detection task, especially when both tasks are deployed on the same edge device.
2.3. Integrated Parking-License Plate Pipelines
Some studies have begun to connect illegal parking detection with automatic license plate recognition. Araneta et al. developed a real-time illegal parking detection system with automatic license plate number recognition, showing that behavior detection and identity recognition can be combined in one enforcement pipeline [
16]. This work is closely related to the application scenario of the present study because it treats illegal parking detection and license plate recognition as connected enforcement tasks.
However, such system-level integration usually focuses more on the application workflow than on the model-level accuracy–efficiency trade-off. In particular, the lightweight design of a unified detector, the balance between vehicle-level detection and small license plate localization, and the deployment constraints of edge devices are not sufficiently addressed. In contrast, RKF-YOLO is designed as a lightweight dual-task framework that connects illegal parking detection with license-plate-oriented localization and recognition within an edge-side traffic enforcement pipeline.
2.4. Lightweight Object Detection and Edge Deployment
Lightweight object detection aims to reduce model complexity while maintaining acceptable accuracy, which is important for edge deployment. Yang et al. investigated model compression for real-time object detection using rigorous gradation pruning, showing that pruning strategies can reduce computational cost [
17]. Shen et al. proposed LDDFSF-YOLO11 with lightweight multi-scale feature fusion, demonstrating the role of efficient feature fusion in lightweight detection [
18]. Liu et al. proposed LFN-YOLO with a lightweight reparameterized design for small object detection, indicating that reparameterization can improve efficiency while maintaining representation ability [
19]. Guo et al. proposed an efficient reparameterized small-object detection transformer for thermal infrared images, showing the potential of efficient structural design in challenging perception tasks [
20]. Cao et al. developed LKD-YOLOv8 based on knowledge distillation for infrared object detection, suggesting that training strategies can help lightweight models preserve detection performance [
21]. Zhao et al. proposed IDD-YOLOv7 for efficient feature extraction in defect detection, further confirming that lightweight YOLO variants are widely used in resource-constrained visual detection tasks [
22].
These studies provide useful structural and training ideas for lightweight detection. However, most of them are developed for general object detection, infrared detection, underwater detection, or defect detection rather than for integrated illegal parking and license plate recognition. The proposed RKF-YOLO differs from these works by focusing on an edge-side traffic enforcement pipeline that must simultaneously handle large vehicle targets and small license plate regions.
2.5. Robustness Under Degraded Visual Conditions
Outdoor traffic enforcement scenes are often affected by degraded visual conditions such as rain, fog, low illumination, motion blur, and occlusion. Chen et al. systematically reviewed object detection for autonomous vehicles under adverse weather conditions, emphasizing that degraded visual environments remain a key challenge for perception reliability [
23]. Du et al. proposed MLE-YOLO for robust vehicle and pedestrian detection under adverse weather, showing that feature enhancement and lightweight detection heads can improve robustness [
24]. Liu et al. proposed MASFNet with multiscale adaptive sampling fusion for object detection in adverse weather [
25]. Hu et al. developed a lightweight adverse-weather detection framework based on dual-teacher feature alignment, which improves robustness without adding inference cost [
26]. Zhang et al. reduced weather-related spurious correlations through feature decorrelation and independence learning [
27].
These studies indicate that adverse-weather robustness is important for real-world traffic perception. However, RKF-YOLO is not designed as a specialized adverse-weather detection or restoration model. In this study, degraded visual conditions are discussed as a limitation and future research direction, and multi-modal sensing or weather-aware feature enhancement is considered as a potential extension.
3. Materials and Methods
3.1. Overall Structure of RKF-YOLO
Although YOLOv11n exhibits favorable accuracy and efficiency in general object detection tasks, it still faces several challenges in illegal parking detection and license plate recognition scenarios, including dense vehicle distributions, limited computational resources at the edge, and the difficulty of accurately localizing small license plate targets. These issues make it difficult for the model to achieve a favorable balance between detection accuracy and real-time performance. To address these problems, this study proposes RKF-YOLO, a lightweight dual-task detection model based on YOLOv11n. The model is collaboratively optimized from three aspects: structural design, training strategy, and loss function, thereby enabling efficient collaboration between illegal parking detection and license plate recognition within a unified detection framework.
At the structural level, a Rep-CSP collaborative optimization architecture is constructed. By introducing re-parameterized convolution, the architecture achieves structural decoupling between multi-branch training and single-path inference, improving feature representation without increasing inference cost. Meanwhile, asymmetric channel reduction is performed during the feature fusion stage to reduce redundant computation, while cross-scale feature interaction and the C2PSA attention mechanism are incorporated to enhance responses in key regions, thereby alleviating feature degradation caused by lightweight design. At the training level, a knowledge-transfer-enhanced training strategy (KTET) is proposed. By transferring effective training experience from the teacher model through optimizer selection and learning rate scheduling, the optimization process of the lightweight model is improved, enabling more stable convergence in complex scenarios and helping the model avoid local optima. At the loss-function level, considering the high proportion of small targets and low-
samples in license plate detection, the Focal mechanism is introduced into the CIoU loss function to construct an adaptively weighted bounding-box regression loss for hard samples. This strengthens the model’s localization ability for license plates under complex conditions such as illumination variation and image blur. The overall structure of the improved network is shown in
Figure 1.
3.2. Dual-Task Framework and Inference Pipeline
The proposed RKF-YOLO supports two closely related traffic-enforcement tasks: illegal parking detection and license-plate-oriented recognition. In practical applications, illegal parking detection determines whether a vehicle violates parking rules, while license plate recognition provides vehicle identity information for subsequent enforcement. Therefore, these two tasks form a continuous decision-making pipeline rather than two isolated procedures.
In RKF-YOLO, the two tasks share a lightweight backbone for extracting common visual features from traffic scenes. The shared features are then processed by the feature fusion neck and task-oriented detection branches. The illegal parking detection branch focuses on vehicle-level violation judgment and determines whether a parking violation exists in the target scene. The license plate branch focuses on small-scale plate regions and provides accurate plate localization for subsequent recognition. The cropped license plate regions are then used to compute whole-plate recognition accuracy and character-level accuracy in the recognition evaluation.
In the proposed enforcement pipeline, RKF-YOLO is used as the front-end perception model to detect illegal parking vehicles and localize license plate regions. The localized plate regions are then cropped and fed into a lightweight LPRNet-based recognition module for character-level recognition. During all comparative experiments, the recognition module is kept unchanged, so the differences in whole-plate recognition accuracy and character-level accuracy mainly reflect the influence of different detection models on license plate localization quality.
This design enables the model to reuse shared traffic-scene features while preserving task-specific detection capability. Compared with two completely independent models, the unified framework reduces redundant feature extraction and is more suitable for deployment on edge devices. At the same time, the task-oriented branches allow the model to handle the different scale characteristics of vehicles and license plates.
3.3. Design Rationale of RKF-YOLO
The design of RKF-YOLO is motivated by the accuracy–efficiency contradiction in edge-side traffic perception. Illegal parking detection requires robust vehicle-level semantic representation, whereas license plate localization relies on fine-grained features of small targets. Directly compressing the baseline YOLOv11n model can reduce computational cost, but it may also weaken feature representation and degrade small-target detection performance. Therefore, the proposed framework is designed around three principles: reducing redundant computation, preserving multi-scale feature representation, and compensating for feature degradation caused by lightweight compression.
First, re-parameterized convolution is introduced into the CSP-based feature structure to enhance feature learning during training while maintaining a single-path inference structure after re-parameterization. Second, asymmetric channel reduction is applied mainly in the feature fusion stage, where channel redundancy is relatively high, rather than aggressively compressing the backbone. This strategy reduces FLOPs while preserving fundamental shared features. Third, adaptive group convolution and global attention are retained as feature compensation mechanisms to enhance local fine-grained representation and global contextual modeling. Finally, Focal-CIoU is introduced in the license plate branch to increase the optimization weight of low-IoU samples, which are common in blurred, tilted, and small-scale license plate regions.
Therefore, the proposed improvements are not independent module replacements but are designed to address specific challenges in lightweight dual-task traffic perception.
3.4. Collaborative Optimization Design at the Structural Level
Considering the different feature-scale requirements of illegal parking detection and license plate recognition, as well as the lightweight constraints caused by the limited computational resources of edge devices, this study constructs a collaborative optimization architecture consisting of “re-parameterized feature extraction, asymmetric lightweight compression, and degraded-feature compensation” at the structural level. Without increasing inference cost, the proposed architecture simultaneously strengthens the global semantic features of large-scale vehicle targets and the fine-grained features of small-scale license plate targets, thereby addressing the accuracy–efficiency trade-off in unified dual-task modeling.
3.4.1. Design of the Re-Parameterized Convolution Unit
In application scenarios with limited edge-computing resources, there is often a significant contradiction between model complexity and feature representation capability. On the one hand, illegal parking detection relies on modeling multi-object relationships in complex scenes; on the other hand, license plate detection requires accurate characterization of small-scale fine-grained features. Conventional lightweight designs tend to weaken feature representation, thereby affecting the collaborative performance of the dual tasks. Although existing methods can improve representation capability through deeper networks or multi-branch structures, their computational cost makes it difficult to meet the real-time deployment requirements of edge devices. Therefore, from the perspective of structural re-parameterization, this study introduces a re-parameterized convolution (RepConv) unit.
The RepConv unit employs a structural re-parameterization mechanism. During training, a multi-branch structure is used to enhance feature representation; during inference, this structure is equivalently converted into a single-path convolution, thereby improving network performance without introducing additional inference cost. The 3 × 3 + 1 × 1 multi-branch structure in the training stage can simultaneously model global spatial contextual features, which are suitable for large-target illegal parking detection, and local inter-channel fine-grained features, which are suitable for small-target license plate detection. During inference, the branches are linearly fused into a single 3 × 3 convolution, introducing no additional computational cost and thus meeting the requirements of real-time edge deployment. The structure of the re-parameterized convolution is shown in
Figure 2. Compared with a conventional single convolution structure, RepConv collaboratively models local and cross-channel feature information through multi-scale branches during training. Specifically, the 3 × 3 convolution branch is mainly responsible for extracting spatial contextual information, while the 1 × 1 convolution branch strengthens inter-channel feature interaction. Their combination improves the diversity and discriminative capability of feature representations.
During training, RepConv adopts a multi-branch topology. The feature map
at layer
i can be expressed as
where
and
denote the feature maps at layer
i and layer
, respectively;
and
denote the weights of the
and
convolution kernels, respectively;
and
denote the corresponding batch normalization operations; and ⊗ denotes the convolution operation.
During inference, to avoid the additional computational cost caused by the multi-branch structure, a kernel fusion strategy is adopted to equivalently convert the multi-branch structure into a single convolution layer. Since convolution and batch normalization are both linear affine transformations and satisfy the superposition principle, fusion can be completed through a mathematically equivalent transformation. First, the convolution layer and its corresponding batch normalization layer are fused. For convolution weight
and the corresponding
parameters, the fused convolution kernel
and bias
are calculated as
where
and
are the scaling factor and offset of batch normalization, respectively;
and
are the running mean and variance, respectively; and
is a numerical stability term.
The 1 × 1 convolution branch is mapped to the 3 × 3 convolution kernel space through zero padding so that different branches are consistent in the spatial dimension. Cross-branch fusion is then performed to obtain the final equivalent convolution kernel:
Here, denotes zero padding, which pads the convolution kernel by one row or column on the top, bottom, left, and right sides.
3.4.2. Cross-Scale Feature Fusion and Lightweight Compensation
There is an inherent feature-scale contradiction between the two tasks of illegal parking detection and license plate recognition. Illegal parking vehicles are large-scale targets that rely on global semantic features with low resolution and large receptive fields, whereas license plates are small-scale targets that depend on high-resolution and fine-grained local texture features. Based on RepConv, this study further performs collaborative optimization from three aspects: feature fusion, channel compression, and representation compensation. The purpose is to improve the unified representation capability of shared features for both the large-target task of illegal parking detection and the small-target task of license plate detection. First, a RepConv_C3k2 unit is designed by incorporating the re-parameterization concept into the cross-scale feature pyramid CSP backbone. Second, asymmetric channel reduction is implemented in the feature fusion layer to reduce redundant computation. Finally, adaptive group convolution and the C2PSA global attention module are introduced to compensate for the degradation of feature representation caused by the lightweight design. The Rep-CSP cross-scale feature fusion structure is shown in
Figure 3.
Cross-scale feature pyramid fusion unit. The RepConv_C3k2 unit replaces the standard convolution in the original C3k2 module of YOLOv11n with the RepConv re-parameterized convolution designed in this study. It adopts a dual-branch structure, in which the input feature is divided into two branches through a
convolution. One branch passes through
n RepConv units for deep feature extraction, while the other branch is directly transmitted. The two branches are then concatenated and output through a fusion convolution. Without changing the module topology, this structure realizes feature enhancement during training and lightweight equivalent transformation during inference. For the input feature
, the calculation process is expressed as
where
denotes the cascade of
n RepConv units,
denotes the channel concatenation operation, and
denotes the output feature. While enhancing multi-scale information interaction, this structure enables the network to simultaneously consider the overall structural features of vehicle targets and the fine-grained feature representation of license plate regions, thereby providing unified feature support for collaborative dual-task detection.
Asymmetric channel reduction strategy. To address the large amount of computational redundancy in illegal parking detection scenarios, this study implements asymmetric channel reduction in the feature fusion layer. Channel reduction is performed only in the feature fusion Neck, while the number of channels in the shared Backbone remains unchanged. The Backbone is responsible for extracting basic features shared by the two tasks, and excessive compression may cause feature degradation. In contrast, the multi-scale fusion process in the Neck contains substantial channel redundancy, and targeted compression can achieve a lightweight design with minimal accuracy loss. The channel scaling coefficient
e is reduced from the conventional value of 0.5 to 0.375. The intermediate channel number
is calculated as
where
denotes the number of output channels, and
denotes the floor operation. When
, the number of intermediate channels is approximately three-eighths of the number of output channels, which can reduce computational cost while limiting accuracy degradation. In this task, this value provides a better trade-off between computational cost and accuracy under resource constraints. While reducing computational redundancy, this strategy helps maintain dual-task inference efficiency under edge deployment conditions and provides computational resource support for integrated detection.
Representation compensation: adaptive group convolution and global attention. To alleviate the degradation of feature representation caused by extreme lightweight design, this study introduces an adaptive group convolution mechanism inside RepConv and retains the C2PSA global attention module at the end of the Backbone. Adaptive group convolution can construct diversified feature subspaces under low-channel dimensions, thereby alleviating the loss of fine-grained license plate features caused by channel reduction. The C2PSA global attention module models long-range contextual dependencies, strengthens the global semantic features of illegally parked vehicles, and suppresses complex background interference. These two modules are complementary and jointly adapt to the differentiated feature requirements of the two tasks. The adaptive group convolution dynamically calculates the optimal number of groups
G according to the intermediate channel dimension
:
The candidate group numbers in
are arranged in descending order, so that a larger group number is preferentially selected under the condition of divisibility to improve parallelism. When
cannot be divided by a larger group number,
G automatically degenerates to a smaller group number, thereby constructing diversified feature subspaces in a low-dimensional space. The global attention module adopts an unbalanced design of global down-weighting and local enhancement. The attention weight
is calculated as
where
,
,
, and
denote the query matrix, key matrix, value matrix, and attention-weight matrix, respectively, and
denotes the dimension of the key vector. Considering multi-object scenarios in illegal parking detection and low-resolution conditions in license plate detection, retaining C2PSA helps compensate for the loss of global contextual information caused by channel reduction, strengthens responses in key regions, and realizes collaborative optimization for targets at different scales.
3.4.3. Placement and Hyperparameter Settings of C2PSA and Adaptive Group Convolution
To improve reproducibility, the exact placement and hyperparameter settings of the C2PSA attention module and adaptive group convolution are clarified as follows. Adaptive group convolution is embedded inside the Rep-CSP structure after asymmetric channel reduction. This position is selected because channel redundancy has been reduced at this stage, and adaptive grouping can construct diversified local feature subspaces under compressed channel dimensions. The candidate group set is
, and the final group number
G is selected according to the divisibility rule defined in Equation (
10). The convolution kernel size is
, and the stride is 1.
The C2PSA module is placed at the end of the backbone before the feature fusion neck. This placement allows C2PSA to operate on high-level semantic features and model long-range contextual dependencies before multi-scale feature fusion. Placing C2PSA too early would increase computation on high-resolution feature maps, while placing it in the detection head may interfere with task-specific prediction. Therefore, C2PSA is used as a global semantic compensation module before the neck.
3.5. Knowledge-Transfer-Enhanced Training Strategy
After completing the lightweight collaborative optimization design at the structural level, the number of model parameters and the computational cost are significantly reduced, enabling the model to adapt to the computational constraints of edge devices. However, the decrease in network capacity caused by lightweight design also makes the model prone to insufficient feature learning, large fluctuations during convergence, and local optima under the imbalanced data distribution of complex traffic scenarios. Relying solely on structural optimization makes it difficult to fully release the feature learning potential of the model while maintaining an extremely lightweight structure. Therefore, an adaptive optimization scheme for lightweight models is required at the training-strategy level.
Knowledge distillation is a mainstream technique for enhancing the performance of lightweight models. It transfers effective knowledge from a high-performance teacher model to a lightweight student model, thereby significantly improving the detection accuracy and generalization ability of the student model without increasing computational cost during inference. In this study, the teacher model is the uncompressed YOLOv11n baseline model, while the student model is the lightweight RKF-YOLO model optimized by the Rep-CSP structure and asymmetric channel reduction. Traditional knowledge distillation realizes knowledge transfer by constraining the output of the student model to be close to that of the teacher model. Its typical form is
where
denotes the cross-entropy loss,
denotes the KL divergence, and
and
denote the output distributions of the student model and the teacher model, respectively.
However, in illegal parking detection, the data distribution is obviously imbalanced and complex. Directly imitating the teacher’s output may cause the student model to inherit the teacher model’s local optimum. Therefore, this study proposes a knowledge-transfer-enhanced training strategy (KTET). Instead of taking output imitation as the objective, KTET transfers effective experience from the training process of the teacher model and reconstructs the training strategy of the student model from two aspects: optimizer selection and learning rate scheduling. The KTET framework is shown in
Figure 4.
For optimizer selection, the teacher model adopts the SGD optimizer to obtain a stronger generalization ability, while KTET adopts AdamW for training the student model. The SGD optimizer used by the teacher model has strong global optimization capability and favorable generalization, making it suitable for fully training large-capacity models. In contrast, the lightweight student model has reduced network capacity, increased gradient noise and variance, and significantly higher sensitivity to hyperparameters. AdamW provides adaptive step sizes and decoupled weight decay, enabling smoother parameter updates and effectively suppressing the overfitting risk of lightweight models. As a result, it brings better detection accuracy and convergence speed in illegal parking detection. The AdamW update process can be written as
where
= 0.9,
= 0.999,
, and the weight decay coefficient is
.
For learning rate scheduling, KTET sets the initial learning rate of the student model to one-tenth of that of the teacher model:
where
. This setting not only prevents oscillation and divergence of the lightweight model caused by an overly large learning rate, but also avoids slow convergence and local optima caused by an excessively small learning rate. It therefore adapts to the reduced capacity and more sensitive training dynamics of the lightweight model, making the optimization process smoother and improving convergence stability. The learning rate is decayed using cosine annealing:
where
,
t denotes the current epoch, and
T denotes the total number of epochs. Furthermore, the final decay factor is set to
, reducing the learning rate at the end of training to 1% of the initial learning rate to improve convergence precision.
3.6. Improved Complete Intersection over Union Loss Function
In the small-target license plate recognition task, bounding-box regression accuracy directly affects the reliability of subsequent character recognition and law-enforcement evidence collection. Since license plates are small targets and are easily affected by blur, illumination variation, and other factors, large deviations often occur between the predicted boxes and the ground-truth boxes, resulting in a high proportion of low-IoU samples during training. For such samples, traditional regression loss functions provide relatively small gradient contributions during optimization, which causes the model to focus preferentially on easy samples. This leads to insufficient learning of hard samples and slows down overall convergence. In this study, the Focal-CIoU loss is applied only to the bounding-box regression of the license plate detection branch, while the illegal parking detection branch still adopts the CIoU loss. This avoids the negative optimization effect that excessive weighting may cause in large-target detection.
The CIoU loss function comprehensively considers three terms: overlapping area, center-point distance, and aspect-ratio consistency:
where
denotes the intersection over union;
and
denote the center points of the predicted box and the ground-truth box, respectively;
denotes the Euclidean distance; c denotes the diagonal length of the enclosing box; and v denotes the aspect-ratio consistency metric:
where
w and
h denote the width and height of the predicted box, respectively, and
and
denote the width and height of the ground-truth box, respectively.
Although CIoU has advantages in geometric constraint modeling, its loss weight is mainly linearly modulated by . Therefore, it has limited ability to distinguish hard samples in the low-IoU region and is insufficient for improving the localization ability of the model for blurred and small-scale license plate targets. To address this problem, this study introduces a focal weighting mechanism to strengthen the gradient contribution of hard samples from the perspective of optimization. Existing studies related to Focal-CIoU are mostly designed for general object detection scenarios. In contrast, this study optimizes the modulation form of focal weights according to the characteristics of license plate detection, where small targets account for a high proportion and localization deviations have a substantial impact on subsequent tasks. This makes the loss function more suitable for license plate detection in complex traffic scenarios.
By introducing the focal weight
, the improved focal complete intersection over union loss is defined as
where
denotes the focusing parameter, which is usually set to 2. This mechanism dynamically adjusts the loss weight according to
and assigns higher weights to low-IoU samples, namely hard samples, thereby accelerating the learning of difficult scenarios.
From the perspective of the optimization mechanism, this method changes the gradient distribution through nonlinear reweighting of the loss function and accelerates model convergence in complex scenarios. For the license plate detection task, this strategy can effectively improve bounding-box regression accuracy under small-target, low-resolution, and complex illumination conditions, thereby enhancing license plate recognition accuracy.
3.7. Experimental Environment and Hyperparameter Settings
The experiments in this study were conducted on the Ubuntu operating system. The GPU used was an NVIDIA GeForce RTX 4090, with CUDA 11.8, Python 3.10, and PyTorch 2.1.2. YOLOv11n was adopted as the baseline model. During all ablation experiments and comparative experiments, the experimental environment and hyperparameter settings were kept consistent. The detailed experimental parameter settings are shown in
Table 1.
3.8. Datasets and Preprocessing
The experiments were validated on a self-constructed illegal parking detection dataset and the CCPD license plate dataset to support the collaborative dual-task framework for illegal parking detection and license plate recognition proposed in this study. To improve the adaptability of the model in practical complex traffic scenarios, this study screened, cleaned, and relabeled public road images to construct a standardized illegal parking detection dataset for multi-scenario violation behavior recognition. Representative illegal parking scenarios in practical traffic management were specifically supplemented, including illegal parking in fire lanes, parking lots, pedestrian crossings, and roadside areas. Normal parking samples were also included to enhance the completeness and discriminability of the data distribution.
The dataset covers the above scenarios and also contains license plate regions in the images, thereby providing unified scene support for both illegal parking detection and license plate recognition in the dual-task framework. The final dataset contains 11,174 images, with two annotated categories: “right” and “illegal”. All samples were unified in image format and processed with bounding-box annotation and data normalization according to the YOLO detection task standard. All images were randomly divided into training, validation, and test sets at a ratio of 7:2:1. Specifically, the training set contains 7822 images, the validation set contains 2235 images, and the test set contains 1117 images. During training, data augmentation strategies such as random flipping, MixUp, and HSV color-space perturbation were adopted to improve the adaptability of the model in complex environments.
Considering that the license plate detection task requires verification of the model’s capability for low-IoU samples, the CCPD dataset was further introduced as a supplementary data source. It contains 21,352 license plate images covering various sample types, including clear, blurred, tilted, occluded, and unevenly illuminated license plates. The dataset was also divided into training, validation, and test sets at a ratio of 7:2:1.
3.9. Evaluation Metrics
Precision (
P), recall (
R), average precision (
), mean average precision (
), number of parameters (Params), floating-point operations (FLOPs), and inference speed (FPS) were used as evaluation metrics. These metrics are calculated as follows:
where
denotes the number of positive samples correctly predicted as positive,
denotes the number of negative samples incorrectly predicted as positive,
denotes the number of positive samples incorrectly predicted as negative, and
n denotes the number of categories.
represents the mean average precision when the intersection over union between the detected bounding box and the ground-truth box is greater than or equal to 0.5.
represents the mean average precision averaged over IoU thresholds from 0.5 to 0.95 with a step size of 0.05.
4. Results and Discussion
4.1. Fine-Grained Ablation Experiments
To further clarify the contribution of each component, a fine-grained ablation experiment was designed on the illegal parking detection dataset. YOLOv11n was used as the baseline. Rep-CSP, asymmetric channel reduction, adaptive group convolution, C2PSA, and KTET were progressively introduced. Since Focal-CIoU is mainly designed for the license plate detection branch, its influence is further analyzed in the license plate loss comparison experiment. The detailed results are shown in
Table 2.
The results show that Rep-CSP slightly improves detection accuracy, indicating that the re-parameterized structure enhances feature representation. After asymmetric channel reduction, the model complexity is significantly reduced, but the detection accuracy decreases because of feature degradation. Adaptive group convolution and C2PSA partly recover the lost accuracy by enhancing local fine-grained features and global contextual information. After KTET is introduced, the compressed model achieves a stable performance improvement, suggesting that the training strategy is important for releasing the potential of the lightweight structure. Overall, RKF-YOLO achieves a better accuracy–efficiency trade-off than the baseline model.
4.2. Stability Analysis with Different Random Seeds
To evaluate whether the performance improvement is caused by random training fluctuations, the baseline YOLOv11n and RKF-YOLO were trained under three different random seeds, namely 42, 123, and 2026. All runs used the same dataset split, input resolution, training epochs, and basic hyperparameter settings. The mean and standard deviation were calculated for and .
As shown in
Table 3, RKF-YOLO maintains higher mean mAP values than YOLOv11n under the three random seeds while using fewer parameters and FLOPs. The standard deviations remain within a limited range, indicating that the observed improvement is not solely caused by random initialization or a single favorable training run. These results demonstrate that the proposed lightweight optimization strategy has stable performance under different initialization conditions.
4.3. Ablation Study on the Placement of C2PSA and Adaptive Group Convolution
To further evaluate the influence of module placement, a compact placement ablation was designed for C2PSA and adaptive group convolution. The compared settings focus on the most relevant design choices: placing C2PSA in the neck or detection head versus the end of the backbone and placing adaptive group convolution before or after asymmetric channel reduction inside Rep-CSP. This experiment is intended to verify whether the final placement provides a better accuracy–efficiency trade-off.
As shown in
Table 4, the final placement achieves a slightly better accuracy–efficiency trade-off than the alternative settings. Placing C2PSA in the neck provides competitive accuracy but increases FLOPs, while placing it in the detection head provides limited improvement because the prediction branch is more task-specific. For adaptive group convolution, applying it after channel reduction achieves comparable or better accuracy with lower computational cost than applying it before channel reduction. Therefore, RKF-YOLO adopts C2PSA at the end of the backbone and adaptive group convolution after channel reduction inside Rep-CSP.
4.4. Ablation Experiments on the Knowledge-Transfer-Enhanced Training Strategy
To further verify the effectiveness of each component in the KTET training strategy, ablation experiments on optimizer selection and learning rate scheduling were conducted under the condition that the lightweight structure remained fixed. The experimental results are shown in
Table 5.
In
Table 5, F denotes the teacher model, namely YOLOv11n, which was trained using the SGD optimizer and a learning rate of 0.01 as the reference. G denotes the student-model baseline, which directly follows the training configuration of the teacher model. Compared with the teacher model,
decreased by 1.4 percentage points to 83.5%, and
decreased by 1.3 percentage points to 64.8%. This reflects the performance loss of the lightweight structure under the same training strategy.
H replaces the optimizer with AdamW on the basis of the student baseline G while keeping the learning rate unchanged. Compared with G, increased by 0.7 percentage points to 84.2%, and increased by 0.7 percentage points to 65.5%. Precision and recall increased by 0.7 and 0.5 percentage points, respectively. This indicates that the adaptive step size and decoupled weight decay mechanism of AdamW are more suitable for training lightweight models, enabling smoother parameter updates and more stable convergence under conditions with greater gradient noise.
Here, I represents the complete KTET configuration. On the basis of AdamW, the initial learning rate was adjusted from 0.01 to 0.001, and cosine annealing decay was adopted. Compared with G, increased by 2.0 percentage points, and increased by 2.4 percentage points. Compared with H, increased by 1.3 percentage points, increased by 1.7 percentage points, and precision and recall increased by 1.9 and 1.6 percentage points, respectively. Under the KTET configuration, both and of the student model exceeded those of the teacher model. This demonstrates that the proposed strategy can effectively transfer the optimization experience of the teacher model and achieve better convergence results under the lightweight structure, enabling the student model to achieve better validation performance than the teacher baseline under the current experimental setting.
Overall, KTET effectively adapts to the reduced capacity and more sensitive training dynamics of lightweight models, significantly improving the convergence quality and final accuracy of the student model.
4.5. Comparative Experiments with Different Loss Functions
To verify the effectiveness of the Focal-CIoU loss function in the license plate detection stage of the license plate recognition task, Focal-CIoU was compared with CIoU, which served as the baseline, as well as SIoU, PIoU [
28], and EIoU [
29], based on the structurally and training-optimized model. The experimental results are shown in
Table 6.
As shown in
Table 6 and
Figure 5, SIoU improves
by 0.2 percentage points and
by 0.3 percentage points compared with the CIoU baseline by introducing an angle penalty mechanism, but the improvement is limited. EIoU separates the width and height loss terms on the basis of CIoU, improving
by 0.1 percentage points; however, its optimization for license plate detection scenarios remains insufficient. PIoU introduces an exponential gradient activation term and a non-monotonic attention mechanism, improving
by 0.4 percentage points and
by 0.6 percentage points, thus outperforming SIoU and EIoU.
The proposed Focal-CIoU achieves the best performance across all metrics. Compared with the CIoU baseline, and are improved by 1.0 and 1.3 percentage points, respectively, while precision and recall are both improved by 1.0 percentage points. Compared with the second-best PIoU, Focal-CIoU still improves by 0.6 percentage points and by 0.7 percentage points. These results indicate that, in scenarios where low-IoU hard samples account for a relatively high proportion due to license plate blur, poor illumination, and other factors, Focal-CIoU dynamically assigns the optimization focus to hard samples through the focal weighting mechanism, thereby effectively improving bounding-box regression quality and overall detection accuracy.
The improvement of Focal-CIoU is more evident in than in , indicating that the proposed loss mainly improves localization quality rather than merely increasing coarse detection accuracy. This is consistent with the characteristics of license plate detection, where small-scale and low-IoU samples are common. By assigning larger optimization weights to low-IoU samples, Focal-CIoU helps the model learn more effectively from blurred, tilted, and difficult license plate targets.
4.6. Evaluation with Representative Detector Architectures
To evaluate RKF-YOLO against representative detectors with different architectural paradigms, several non-YOLO object detectors were included in the illegal parking detection experiment. Faster R-CNN [
30] was selected as a classical two-stage detector; SSD-MobileNetV2 [
31,
32] and EfficientDet-D0 [
33] were selected as lightweight one-stage detectors; and RT-DETR-R18 [
34] was selected as a representative real-time DETR-based detector. In addition, YOLOv3-tiny [
35], YOLOv5n [
36], YOLOv8n [
37], YOLOv10n [
38], and YOLOv11n [
39] were included as YOLO-series baselines. All representative detectors were evaluated under the same dataset split, input resolution, training epochs, and preprocessing settings to ensure a fair comparison among different architectural paradigms in the illegal parking detection task.
As shown in
Table 7, RT-DETR-R18 achieves slightly higher
than RKF-YOLO, but its parameter size and computational cost are much higher. Faster R-CNN also obtains competitive accuracy but is not suitable for lightweight edge deployment because of its large computational burden. SSD-MobileNetV2 and EfficientDet-D0 have relatively low computational complexity, but their detection accuracy is lower in complex illegal parking scenes. Compared with YOLOv11n, RKF-YOLO improves
and
while reducing parameters and FLOPs by 38.2% and 38.1%, respectively. These results indicate that RKF-YOLO provides a better accuracy–efficiency trade-off for edge-side illegal parking detection.
4.7. Comparison with YOLO-Series Models in the License Plate Recognition Pipeline
For the license plate recognition task, the comparison is kept as a controlled YOLO-family comparison because the reported overall accuracy and character accuracy depend on the complete license plate recognition pipeline rather than on object detection alone. The purpose of this comparison is to evaluate how different YOLO-series detector backbones affect license plate localization quality and subsequent recognition performance under the same LPRNet-based recognition setting.
As shown in
Table 8, RKF-YOLO achieves an overall recognition accuracy of 95.7% and a character accuracy of 98.9%, which are slightly higher than those of YOLOv11n. This indicates that the improved localization quality of RKF-YOLO contributes to more accurate license plate recognition while maintaining a lightweight model structure. Since overall and character accuracies are affected by both plate localization and downstream recognition, the cross-architecture detector comparison is provided separately for the illegal parking detection task, whereas this subsection focuses on a controlled end-to-end license plate recognition pipeline comparison.
4.8. Robustness Discussion Under Degraded Visual Conditions
Motivated by recent studies on adverse-weather object detection, robustness under degraded visual conditions was further discussed. Outdoor traffic enforcement scenes are often affected by illumination variation, rain, fog, motion blur, and partial occlusion. These factors may reduce the localization quality of both vehicle targets and license plate regions, thereby influencing illegal parking detection and subsequent license plate recognition. Unlike specialized adverse-weather detectors or weather-adaptation networks, RKF-YOLO is not specifically designed for weather restoration. Therefore, this study treats adverse-weather robustness as an important limitation and future research direction rather than claiming a complete adverse-weather benchmark. This discussion, therefore, identifies degraded visual conditions as an important robustness challenge and motivates future work on systematic adverse-weather evaluation and multi-modal perception.
4.9. Edge Deployment and Quantization Analysis
To further evaluate deployment feasibility, RKF-YOLO and YOLOv11n were first exported to ONNX, converted into RKNN format, and deployed on the RK3588 platform. The model was deployed on an ELF2 development board equipped with RK3588 for inference testing. The inference process was based on the RKNN Toolkit inference framework, and the built-in NPU was used for forward computation. Post-training INT8 quantization and operator fusion were applied to improve inference efficiency. The input size was fixed at 640 × 640, and the batch size was set to 1. The calibration images were selected from the training set without overlapping with the test set. Inference time and FPS were calculated as the average performance on the RK3588 platform after warm-up. The quantization accuracy drop was calculated by comparing the RKNN INT8 results with the PyTorch FP32 results. The illegal parking detection deployment results are shown in
Table 9, and the license plate recognition deployment results are shown in
Table 10.
The PyTorch FP32 rows are used as pre-conversion accuracy references for calculating quantization degradation. Since these rows were not executed through the RKNN INT8 deployment pipeline on the RK3588 NPU, inference time and FPS are not reported and are denoted by “—”.
The RK3588 deployment setup is shown in
Figure 6.
After RKNN conversion and INT8 quantization, RKF-YOLO maintains a small accuracy drop while significantly improving inference speed. Compared with YOLOv11n, the proposed model reduces inference time from 24.6 ms to 16.8 ms and increases FPS from 40.7 to 59.5 on the RK3588 platform. In terms of accuracy, the quantized RKF-YOLO still achieves an
of 84.6% and an
of 66.8%, which are 0.8 and 1.4 percentage points higher than those of the baseline model, respectively. As shown in
Table 10 and
Figure 7, in the license plate recognition task, RKF-YOLO achieves an overall recognition accuracy of 95.1% and a character accuracy of 98.4% on the RK3588 platform, which are 0.5 and 0.3 percentage points higher than those of YOLOv11n, respectively. These results demonstrate that the proposed lightweight design is effective not only in theoretical computational complexity but also in practical edge deployment.
4.10. Discussion
The experimental results indicate that the proposed RKF-YOLO does not merely improve detection accuracy by increasing model complexity. Instead, it improves the accuracy–efficiency trade-off under edge-computing constraints. Compared with YOLOv11n, RKF-YOLO achieves higher mAP while significantly reducing parameters and FLOPs. This result is mainly attributed to the combination of asymmetric channel reduction and feature compensation. Channel reduction reduces redundant computation in the feature fusion stage, while adaptive group convolution and C2PSA compensate for the loss of local and global feature representation.
The proposed RKF-YOLO should be understood as an application-driven methodological integration rather than a simple accumulation of existing YOLO modules. Its design is derived from the specific conflict between edge-side efficiency and dual-scale feature requirements in traffic enforcement. Illegal parking detection requires robust vehicle-level semantic representation, whereas license plate localization depends on fine-grained small-target features. Direct lightweight compression can reduce computational cost but may damage the features required for accurate plate localization. Therefore, asymmetric channel reduction is applied mainly in the feature fusion stage, while adaptive group convolution and C2PSA are used to compensate for local and global feature degradation. In addition, Focal-CIoU is applied to the license plate localization branch because low-IoU small plate samples are more sensitive to bounding-box errors than large vehicle targets. This task-specific design logic explains why the proposed combination improves the accuracy–efficiency trade-off instead of merely producing a marginal improvement through arbitrary module replacement.
The comparison with representative non-YOLO detectors in the illegal parking detection task further demonstrates the deployment-oriented advantage of the proposed method. Although Faster R-CNN and RT-DETR-R18 achieve competitive detection accuracy, their computational costs are much higher than those of RKF-YOLO. Lightweight detectors such as SSD-MobileNetV2 and EfficientDet-D0 have lower complexity but insufficient detection accuracy in complex traffic scenes. Therefore, RKF-YOLO provides a more balanced solution for edge-side illegal parking detection. For the license plate recognition pipeline, a controlled YOLO-family comparison is retained to avoid conflating detector architecture comparison with independent character recognition model comparison.
For license plate detection, Focal-CIoU improves more clearly than , indicating that the proposed loss mainly enhances localization quality. This is important because inaccurate license plate localization directly affects subsequent recognition accuracy. The deployment results on RK3588 further confirm that the proposed lightweight design is effective in practical embedded environments.
5. Conclusions
This study proposes RKF-YOLO, a lightweight YOLO-based dual-task framework for illegal parking detection and license plate recognition on edge devices. By integrating Rep-CSP structural optimization, asymmetric channel reduction, feature compensation, a knowledge-transfer-inspired training strategy, and Focal-CIoU loss, the proposed model improves the accuracy–efficiency trade-off under edge-computing constraints.
Experimental results show that RKF-YOLO reduces the number of parameters and FLOPs by 38.2% and 38.1%, respectively, compared with YOLOv11n, while improving and by 0.6 and 1.1 percentage points, respectively. In the license plate recognition pipeline, RKF-YOLO achieves 95.7% overall recognition accuracy and 98.9% character accuracy. Deployment on the RK3588 platform achieves 16.8 ms inference time and 59.5 FPS for illegal parking detection while maintaining 95.1% overall recognition accuracy and 98.4% character accuracy after RKNN INT8 quantization.
Overall, RKF-YOLO provides a practical solution for edge-side traffic enforcement scenarios that require both illegal parking detection and vehicle identity confirmation.
6. Limitations and Future Work
Although RKF-YOLO achieves a favorable balance between accuracy and efficiency, several limitations remain. First, the current framework mainly relies on visual information. Under extreme weather, severe occlusion, strong illumination changes, or heavily blurred license plates, the detection and recognition performance may still degrade. Second, the self-constructed illegal parking dataset covers several typical traffic scenarios, but its scale and geographic diversity remain limited. More data from different cities, road structures, and camera viewpoints are needed to further validate the generalization ability of the model.
RKF-YOLO is not specifically designed as an adverse-weather detection model. Future work will explore weather-aware feature enhancement, temporal tracking, and multi-modal sensing to improve robustness under extreme rain, fog, low illumination, and severe occlusion. In addition, when visual license plate recognition becomes unreliable because of blur, occlusion, or poor illumination, conjoint vehicle identification systems that combine visual recognition with auxiliary technologies such as RFID or Bluetooth may provide a more fault-tolerant solution for real-world traffic enforcement [
40].
Future work will focus on three aspects. First, multi-modal perception, such as vehicle-road cooperative sensing, RFID, or temporal tracking, may be integrated to improve robustness under challenging conditions. Second, larger and more diverse traffic-scene datasets will be constructed to evaluate cross-scene generalization. Third, power-aware model optimization and hardware-aware neural architecture design will be explored to further improve deployment efficiency on embedded platforms.