Next Article in Journal
A Literature Review on Challenges and Solutions for Smart and Sustainable Urban Mobility
Previous Article in Journal
Optimization Techniques for Home Energy Management Systems: A Comprehensive Review, Critical Analysis, and Future Directions
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Urban Intelligent Transportation-Oriented License Plate Recognition Model for Severe Environments Based on Hybrid Architecture of YOLOv12, GAN and Mamba-SSM

1
Engineering Research Center of Catastrophic Prophylaxis and Treatment of Road & Traffic Safety of Ministry of Education, Changsha University of Science & Technology, Changsha 410076, China
2
School of Transportation, Changsha University of Science & Technology, Changsha 410076, China
3
Guangxi Communication Investment Technology Co., Ltd., Nanning 530001, China
*
Author to whom correspondence should be addressed.
Urban Sci. 2026, 10(6), 325; https://doi.org/10.3390/urbansci10060325
Submission received: 12 April 2026 / Revised: 31 May 2026 / Accepted: 8 June 2026 / Published: 11 June 2026
(This article belongs to the Section Intelligent Cities and Technology)

Abstract

Adverse weather and low-illumination conditions in urban road scenarios substantially degrade license plate image quality, posing a major challenge to robust automatic license plate recognition for urban intelligent transportation systems and smart city construction. To address the limitations of conventional pipelines that optimize detection, enhancement, and recognition in isolation, this study proposes CLEI, a unified framework integrating YOLOv12-based detection, GAN-based image enhancement, and a novel CNN–Mamba network (CMN) for character recognition. Using a curated dataset of 3000 license plate images captured under rain, snow, fog, and nighttime urban roadside conditions, we first benchmarked several mainstream detectors and identified YOLOv12s as the most effective model in terms of accuracy, inference speed, and computational efficiency. To mitigate blur and low-quality degradation in cropped plate regions, DeblurGAN-v2 was employed for adaptive enhancement, achieving PSNR of 16.61 dB, SSIM of 0.8776, and LPIPS of 0.1151. For recognition, the proposed CMN replaces the recurrent module in CRNN with a Mamba-based state-space model, improving sequence modeling efficiency and robustness. CMN achieved 93.3% plate accuracy, outperforming CRNN (91.0%) and LPRNet (88.5%), while the full CLEI framework reached 93.67% accuracy after enhancement. These results demonstrate that collaborative optimization across detection, restoration, and recognition enables accurate and efficient license plate recognition in severely degraded urban traffic environments, providing a reliable technical support for urban traffic monitoring, public security governance and smart city infrastructure construction.

1. Introduction

Automatic license plate recognition (ALPR) is a core perception technology for urban intelligent transportation systems and smart city construction, supporting key scenarios such as urban toll collection, traffic law enforcement, smart parking management, community access control, and urban public security surveillance [1,2]. With the deepening of urban digital transformation and infrastructure interconnection, reliable license plate recognition is not only a technical requirement for vehicle identification but also an important data foundation for urban traffic data-driven monitoring, refined operation management, and public safety governance [3]. In this context, the robustness of ALPR systems in real urban roadside deployment environments has become a critical issue for both academic research and urban transportation engineering practice. In this work, we focus on standard Chinese single-line civilian license plates, which typically consist of one Chinese provincial abbreviation, uppercase English letters, and Arabic digits. All characters are arranged in a single row without double-line or new energy formats, which ensures structural consistency for recognition.
Despite significant advances in deep learning, robust ALPR in real urban roadside environments remains difficult. In practical urban roadside deployments, imaging devices are frequently exposed to adverse weather and low-illumination conditions, including rain, snow, fog, nighttime scenes, glare, motion blur, and low contrast. These factors degrade plate visibility at multiple levels by weakening target boundaries, damaging local character structures, and/or disturbing sequence continuity, thereby affecting detection, restoration, and recognition simultaneously [4,5]. As shown in Figure 1, environmental degradation in urban traffic scenes often does not act in isolation; instead, multiple sources of interference are superimposed in the same scene, making license plate recognition a coupled perception problem rather than a single-stage detection or optical character recognition task [6,7].
Existing studies have addressed this problem from three main directions. The first direction focuses on plate detection, where one-stage detectors, especially the You Only Look Once (YOLO) family, have been widely adopted because of their favorable balance between localization accuracy and inference speed [8,9,10,11,12]. The second direction emphasizes image restoration, introducing deblurring, denoising, dehazing, low-light enhancement, or super-resolution to improve the visual quality of degraded plate regions before recognition [13,14,15]. The third direction targets sequence recognition, where convolutional and recurrent architectures such as a Convolutional Recurrent Neural Network (CRNN) and its variants have been commonly used to decode plate characters after feature extraction [16,17]. These approaches have substantially improved ALPR performance and established strong technical baselines for relatively controlled environments.
However, the effectiveness of current solutions in severely degraded scenes is still constrained by several unresolved issues. First, most existing methods optimize detection, enhancement, and recognition as separate modules, with limited coordination across stages. This fragmented design means that detection outputs are often passed directly to downstream recognizers even when the cropped plate regions remain severely blurred, low-contrast, or structurally incomplete. As a result, image degradation is not corrected in a task-aware manner, but instead accumulates along the processing chain. Second, restoration-oriented methods mainly aim to improve perceptual image quality, yet recognition-oriented structural recovery is a different objective. A visually sharper plate image does not necessarily preserve the fine-grained character strokes and sequence-level discriminative cues required for accurate decoding [15]. Consequently, enhancement modules optimized only for image fidelity may produce limited gains, or even introduce artifacts that impair recognition. Third, mainstream recognition models still rely heavily on recurrent sequence modeling. Although RNN-based architectures have demonstrated stable performance in regular text recognition, their sequential computation limits parallel efficiency and constrains long-range dependency modeling, especially when the input characters are partially corrupted, distorted, or weakly separated. Under such conditions, the shortcomings of conventional sequence modeling become more pronounced, because the recognizer must infer contextual dependencies from incomplete or unstable visual evidence [18].
These limitations suggest that the central challenge in adverse-environment ALPR is not merely insufficient detector accuracy or inadequate recognizer capacity in isolation, but the lack of a unified framework that can explicitly address the interaction among localization, image quality degradation, and sequence decoding. In real roadside scenarios, plate recognition performance depends on whether the system can first localize targets reliably, then recover recognition-relevant structures from degraded crops, and finally decode character sequences efficiently and robustly [3,16]. If any one of these stages is mismatched with the others, the overall system becomes vulnerable to environmental interference. Therefore, a practical ALPR solution for adverse weather and low-illumination conditions should move beyond loosely connected pipelines and instead adopt a collaborative design that links detection, restoration, and recognition within a coherent optimization logic [19].
Motivated by this need, this study proposes CLEI (Clear License Enhancement and Identification), a hybrid ALPR framework that integrates YOLOv12-based detection, Generative Adversarial Networks (GANs)-based image enhancement, and a Convolutional Neural Network (CNN)–Mamba Network for sequence recognition. The framework is designed to address the above gaps from a system-level perspective. YOLOv12 is used to provide efficient and reliable plate localization in complex roadside scenes. A GAN-based enhancement module is introduced to improve degraded plate crops before character decoding, with emphasis on recovering structures that are beneficial to downstream recognition rather than merely improving visual appearance. In addition, the recurrent module in conventional CRNN is replaced with a Mamba-based state-space model to improve sequence modeling efficiency and contextual dependency capture under degraded visual conditions. In this way, the proposed framework seeks to reduce error accumulation across stages and to better align upstream image enhancement with downstream recognition requirements.
The significance of this study lies not only in addressing the methodological gap between isolated module optimization and collaborative pipeline design but also in providing an engineering-oriented solution for real roadside systems, where latency, computational cost, environmental variation, and deployment stability must be considered simultaneously. Therefore, the proposed framework has practical relevance for intelligent transportation applications that require reliable ALPR performance in unconstrained real-world environments.
The main contributions of this work are summarized as follows:
  • We propose a unified adverse-weather license plate recognition framework CLEI, which integrates YOLOv12-based detection, GAN-based enhancement, and Mamba-based recognition into a collaborative pipeline to alleviate error accumulation under severe degradation.
  • We design a CNN–Mamba Network (CMN) that replaces traditional recurrent sequence modeling with a selective state-space model for character recognition, improving both inference efficiency and robustness to blurred characters.
  • We conduct extensive experiments under rain/snow, fog, and low-light conditions on publicly available datasets, and verify that the proposed framework outperforms classical methods in recognition accuracy while maintaining efficiency.
  • We evaluate the contribution of each module via ablation studies, confirming that both the enhancement and recognition components improve performance, and their combination achieves the best results.

2. Related Work

2.1. Research on License Plate Detection

As the initial stage of a license plate recognition system, license plate detection directly determines subsequent recognition performance through its localization accuracy and robustness [20,21]. Under favorable imaging conditions, deep learning-based object detection techniques can effectively locate license plate regions. However, in complex real-world scenarios, the localization accuracy decreases to 55.2%, showing a moderate performance level [22]. Existing detection methods are primarily categorized into two-stage and one-stage detectors. Early two-stage methods, such as Faster R-CNN, attain high localization precision through region proposal generation and refined regression, but their inference speed often fails to meet the demands of real-time traffic applications. In recent years, one-stage architectures, represented by the YOLO [23] series, have become the mainstream solution for license plate detection due to their balanced speed and accuracy [24]. Improved versions like YOLOv8 and YOLOv11 [25] have significantly enhanced the recall rate for small, skewed, and occluded plates through strategies such as structural re-parameterization, dynamic label assignment, and multi-scale feature fusion.
However, a common limitation of existing detection models is that they typically output only bounding box coordinates and confidence scores, lacking an explicit mechanism to assess the imaging quality of the detected region [26]. In degraded scenarios—such as rain, fog, glare, or motion blur—low-quality license plate regions are still passed directly to the subsequent recognition module. This pipeline allows errors to propagate downstream, ultimately compromising the overall system’s robustness.

2.2. Research on Degraded Image Enhancement

To address the degradation of image quality caused by adverse environmental conditions, image enhancement and restoration techniques are widely employed in the pre-processing stage of license plate recognition [27,28]. Traditional methods are predominantly based on physical degradation models, such as dark channel prior for dehazing, Retinex theory for illumination correction, and Wiener filtering for deblurring. Their advantage lies in their clear physical interpretability. However, these approaches rely on strict scene assumptions and can introduce artifacts—including halos, color shifts, and over-enhancement—in complex real-world road conditions, which may ultimately distort character structures.
Data-driven deep learning enhancement methods, leveraging their powerful nonlinear fitting capabilities, demonstrate superior adaptability in complex degradation scenarios [29]. For instance, Generative Adversarial Network (GAN)-based models, such as the DeblurGAN series, learn to restore blurred images and have shown significant effectiveness in motion blur and low-light conditions [30]. Attention-based dehazing networks like FFA-Net utilize channel and spatial attention to focus on critical character regions, thereby improving detail recovery in foggy images [31].
Although existing enhancement methods can effectively improve visual quality, most are optimized solely for pixel-level fidelity metrics—such as Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS)—without joint constraints aligned with the downstream character recognition task. This decoupled “enhancement-then-recognition” paradigm often results in a misalignment between visual improvement and actual recognition accuracy. In some cases, enhancement may even distort character textures and degrade recognition performance. This limitation highlights the need for a more tightly integrated approach to meet the demands of robust recognition in harsh environments.

2.3. Research on Character Recognition and Sequence Modeling

License plate character sequence recognition has long relied on the CRNN architecture as a standard paradigm [32]. This framework employs a Convolutional Neural Network (CNN) to extract spatial character features [33], a Bidirectional Long Short-Term Memory (BiLSTM) network to model temporal dependencies between characters, and Connectionist Temporal Classification (CTC) for segmentation-free sequence decoding. It delivers stable performance under clear imaging conditions. However, under conditions of image degradation and noise, the inherent limitations of BiLSTM become pronounced. First, its recurrent structure prevents parallel inference, causing latency to increase with sequence length—a significant drawback for meeting the real-time demands of high-speed traffic applications [34,35]. Specifically, the sequential computation of BiLSTM leads to 2–5 times higher inference latency than parallel structures, making it difficult to satisfy real-time requirements. When characters are blurred or missing, recognition accuracy typically decreases by 3–8%, resulting in serious error propagation. Second, its local receptive field and short-term memory mechanism constrain its ability to model long-range context. This makes the model susceptible to error propagation when characters are blurred or missing.
Recently, State Space Models (SSMs) have emerged as a promising new pathway for sequence modeling [36]. Mamba, an efficient SSM implementation, introduces a selective state-space mechanism that dynamically filters key information and suppresses noise. It also employs a hardware-aware parallel scan algorithm, reducing the time complexity from the quadratic O(L2) of self-attention to a linear O(L). This provides significant advantages for long-sequence modeling and inference efficiency. Although Mamba has been validated in domains such as natural language processing and medical image sequence analysis, its application to the specific OCR [37] task of license plate recognition remains largely unexplored [38]. Existing attempts have focused primarily on general document OCR and have not been tailored to the distinct challenges of license plate recognition, which involves short sequences, strong structural patterns, low resolution, and high noise. Consequently, its robustness and contextual compensation capabilities in degraded license plate recognition scenarios require systematic investigation.
Recently, Transformer-based optical character recognition (OCR) and end-to-end automatic license plate recognition (ALPR) models have achieved remarkable performance in general text recognition scenarios. Representative methods include PARSeq [39], ABINet [40], ViTSTR [41], and TrOCR [42], which leverage self-attention mechanisms to model long-range dependencies and have become competitive baselines in standard text recognition benchmarks. However, these models are primarily designed for clean, high-resolution, and well-structured general text images. Their applicability to severely degraded license plates characterized by short sequences, strong structural constraints, low resolution, and high noise under adverse weather and low-illumination conditions has not been systematically verified. Direct deployment without customized adaptation for license plate characteristics often leads to performance degradation due to sensitivity to local texture damage and high dependency on clear input signals.
Research on detection, enhancement, and recognition in recent years is summarized in Table 1.

2.4. Research on Collaborative Recognition for Harsh Environments Is Insufficient

A review of existing research reveals that current robust license plate recognition systems still face critical bottlenecks, including isolated module design, misaligned objectives, and outdated modeling paradigms.
First, detection modules focus primarily on localization accuracy, lacking an integrated mechanism to assess image quality. This allows low-quality regions to propagate errors to downstream stages. Second, image enhancement methods are typically optimized for visual fidelity metrics rather than recognition performance, which can yield visually improved images that are less recognizable. Finally, prevailing recognition modules continue to rely on traditional recurrent sequence models, which struggle with the efficiency and robustness required for complex, degraded environments.
While individual modules have seen significant improvements, the prevailing approach of optimizing them in isolation fails to address the systemic performance degradation caused by environmental interference. Current methods lack an integrated, collaborative framework that synergizes quality-aware detection, task-driven enhancement, and efficient, robust sequence recognition [43,44]. Therefore, developing a unified framework for license plate recognition in complex degraded environments—one that breaks down module barriers and enables joint optimization of detection, enhancement, and recognition—holds substantial theoretical and practical significance.

3. Methodology

3.1. Overall Architecture of CLEI

To address the challenges of low clarity and complex background interference in license plate images under adverse conditions, this paper proposes CLEI (Clear License Enhancement and Identification), a hybrid license plate recognition method based on an architecture integrating YOLOv12, GAN, and Mamba-SSM. The method operates through a three-stage cascade of “plate localization, image enhancement, and character recognition.” This figure shows a high-level overview of the framework. Detailed sub-modules are illustrated in subsequent figures for better readability. Its overall architecture is illustrated in Figure 2.
The core innovation of CLEI lies in its cascaded module design tailored for adverse conditions. The principal difficulties in license plate recognition under such environments are progressive: complex backgrounds increase the difficulty of localizing the plate region, low-quality images lead to loss of character details, and blurred characters directly reduce recognition accuracy [26]. To address this, CLEI employs a serial “localization–enhancement–recognition” pipeline. First, YOLOv12 accurately extracts the license plate region from complex backgrounds. Next, a GAN performs deblurring and enhancement on the cropped, low-quality plate image. Finally, a CMN recognizer transcribes the characters from the enhanced, clear plate image. This pipeline achieves a “layered problem-solving” approach, ensuring each module is specifically matched to a particular challenge, thereby preventing any single module from being overwhelmed by multi-dimensional interference.
Furthermore, CLEI incorporates an efficiency-optimized feature-sharing mechanism. Traditional multi-module architectures often introduce redundant parameters and computational overhead due to repeated feature extraction operations. In our method, the GAN enhancement module reuses the backbone feature extractor from YOLOv12. The environment–plate feature distribution learned by YOLOv12 during the localization stage is directly transferred to the initial feature extraction process of the enhancement module. This not only reduces the total parameter count but also ensures feature representation consistency between the localization and enhancement stages, avoiding performance degradation caused by cross-module feature distribution shifts. This design enhances inference efficiency while maintaining high performance [24].
Finally, CLEI features an innovative character recognition module. The traditional RNN module in CRNN architectures suffers from low modeling efficiency and weak long-range dependency capture when processing sequences, making it ill-suited for the temporal relationships in license plate characters (which include province abbreviations, letters, and numbers). To overcome this, CLEI replaces the traditional RNN with a Mamba-based State Space Model (SSM) to construct the CMN recognizer. The Mamba-SSM employs a selective state-space mechanism that captures long-range dependencies in character sequences with linear complexity while preserving local character details. This substitution improves the efficiency of sequence modeling and enhances the robustness to blurred or distorted characters, making it well-suited for license plate recognition in challenging environments [45].

3.2. License Plate Detection Module

YOLOv12 represents a significant evolution in the YOLO series, integrating the attention mechanism as a foundational component for the first time, thereby moving beyond a purely convolutional design paradigm [31]. Its core innovation is the Area Attention module, which divides the feature space into vertical or horizontal regions for computation. This design expands the receptive fields while effectively managing computational costs. As illustrated in Figure 3, the YOLOv12 model consists of three key functional components: the backbone, the neck, and the detection head. The backbone is used to extract multi-scale visual features from input images. The neck fuses low-level detail information and high-level semantic information. The head outputs the final license plate bounding boxes and confidence scores.
Concurrently, the YOLOv12s variant employs a Residual ELAN (R-ELAN) structure, which incorporates block-level residual connections. This architecture enhances training stability for large models, reduces parameter count, and ensures efficient feature fusion. The design successfully balances performance and practicality. The structure of the YOLOv12 detection model is illustrated in Figure 3.
Additional refinements in YOLOv12 include the integration of Flash-Attention to optimize GPU memory access efficiency, the removal of positional encodings, and the adjustment of the MLP expansion ratio from 4 to a range of 1.2–2.0 to better balance computational resources. The model also incorporates large convolutional kernels to serve as position-aware operators [22]. Collectively, these enhancements ensure efficient operation in real-time scenarios.
Given its strong performance in object detection tasks, we select YOLOv12s as the foundational model for license plate region detection in this work [28,36]. This choice aims to provide the most accurate and reliable input for the subsequent GAN-based deblurring and character recognition modules, thereby securing the performance of the entire recognition pipeline from its initial stage.

3.3. Recognition-Oriented Enhancement Module

Generative Adversarial Networks (GANs), introduced by Goodfellow et al. in 2014, represent a highly innovative framework within deep learning for generative modeling [29]. The core idea is an adversarial process between two neural networks—the Generator (G) and the Discriminator (D)—that are jointly optimized through a minimax game.
The generator aims to learn the distribution of real data to produce synthetic samples that are indistinguishable from genuine ones. Conversely, the discriminator acts as a binary classifier, distinguishing whether an input sample comes from the real data distribution or is synthesized by the generator.
This adversarial training is formalized by the following minimax objective function V(D, G):
min G   max D   V ( D , G ) = E x ~ p d a t a ( x ) [ log D ( x ) ] + E z ~ p z ( z ) [ log ( 1 D ( G ( z ) ) ) ]
This objective function jointly guides the optimization of both the generator G and the discriminator D. During training, the generator receives a random latent vector as input and adjusts its parameters to produce samples that increasingly resemble real data. Meanwhile, the discriminator improves its ability to distinguish between real and generated samples through alternating exposure to both, while providing gradient feedback to the generator. This dynamic adversarial process eventually converges toward a Nash equilibrium [29], enabling the generator to produce highly realistic samples. In this formulation, G and D represent the generator and discriminator, respectively, with the generator aiming to minimize and the discriminator aiming to maximize the objective value. The first term corresponds to the discriminator’s objective on real data x (drawn from the true data distribution p d a t a ( x ) ), which is to maximize the probability of correctly identifying a real sample. The second term reflects the generator’s objective: the generator maps a random latent vector z (sampled from a prior distribution p d a t a ( z ) ) to a synthetic sample G(z) and seeks to minimize the probability that the discriminator recognizes the generated sample as fake, thereby enabling the generator to approximate the true data distribution.
Compared to traditional methods that rely on manually defined blur models, DeblurGAN-v2 adopts a data-driven approach to directly learn the mapping from blurred to clear images, making it better suited for real-world complex scenarios. Built upon a pre-trained network backbone, the model first extracts multi-scale features through successive downsampling via pooling layers and convolutional blocks. These features are then processed by 1 × 1 convolutions to adjust channel dimensions before being passed to upsampling modules. Through progressive upscaling (8× and 4×) and the fusion of high- and low-level features via addition and concatenation, the network reconstructs a clear output image.
The generated image is evaluated by two discriminators: a global discriminator assesses the overall naturalness of the full image, while a local discriminator examines the authenticity of details in randomly cropped patches. By working together, these two discriminators guide the generator toward continuous improvement. The complete pipeline effectively preserves fine textures and mitigates common artifacts—such as halos and blurred edges—that often occur with conventional methods.

3.4. CMN Recognition Module

While the traditional CRNN has achieved notable success in license plate recognition, it relies on RNNs (such as LSTMs or GRUs) for sequence modeling [22,40]. This introduces inherent limitations, including restricted long-range dependency modeling, low training parallelism, and high inference latency. These weaknesses are particularly problematic in harsh environments, where local blurring and structural breaks in character sequences can cause error accumulation in RNNs, ultimately reducing recognition robustness [21].
To address these issues, this paper proposes the CMN (CNN–Mamba Network) recognizer, which replaces the recurrent units in the CRNN entirely with a Mamba-based State Space Model (SSM). The architecture of CMN is illustrated in Figure 4 and Figure 5.
Figure 4 shows the detailed data flow within the Mamba-SSM module. The input feature sequence first undergoes linear projection to obtain a hidden feature representation. The sequence is then processed by the SSM core to capture long-range dependencies and suppress noise interference. After sequence transformation and nonlinear operation, the output feature is sent to the subsequent layer for further sequence modeling.
Figure 5 illustrates the complete data flow of the CMN recognition module. The input license plate image is first processed by convolutional layers to extract spatial feature maps. The feature maps are then converted into a sequential feature representation and fed into the Mamba-SSM temporal modeling layer to capture character dependencies. Finally, the transcription layer generates the final predicted character sequence.
Specifically, the CMN recognizer retains the CNN-based feature extractor from the CRNN front-end [18,27]. After extracting a high-dimensional feature map from the input license plate image, it is transformed into a sequential representation via column-wise slicing, yielding a sequence X = [x1, x2, …, xT] ∈ RT×d, where T denotes the sequence length and d the channel dimension. This sequence is then fed into a Mamba-SSM layer for sequential modeling. Mamba-SSM is built upon a selective state space model, whose core mechanism can be summarized as follows:
h t = A ( Δ t ) h t 1 + B ( Δ t ) x t
y t = C t h t
In this formulation, the discretization step size t , input projection matrix B t , and output projection matrix C t are dynamically generated from the current input x t . This input-dependent selectivity enables the model to adaptively suppress noise (such as stains or glare) while strengthening its ability to model long-range dependencies among key character structures. Compared to the fixed gating mechanism of an LSTM, Mamba-SSM achieves global receptive field modeling with linear complexity, significantly improving both sequence inference speed and contextual coherence [36].
The proposed CMN recognizer for license plate recognition in harsh environments offers two primary advantages over traditional BiLSTM-based approaches.
(1)
Superior Inference Efficiency. BiLSTMs process sequences sequentially, resulting in quadratic complexity growth with sequence length. While capable of handling the typical 7–8 character plates, this creates a latency bottleneck when integrated with real-time front-end modules like YOLO and GAN. In contrast, Mamba, based on a State Space Model (SSM), achieves linear time complexity. It captures long-range character dependencies in parallel, offering an inference speed that is 1–2 orders of magnitude faster than BiLSTM, thereby meeting the real-time demands of the application.
(2)
Enhanced Robustness to Degradations. BiLSTMs exhibit limited robustness to noise and blur. When characters are degraded, fragmented, or deformed in harsh conditions, they tend to lose long-range contextual information. Mamba’s “selective state update” mechanism dynamically focuses on relevant character features. Combined with convolutional layers that supplement local texture information, this results in significantly more stable sequence modeling for blurred or incomplete license plates.

4. Experiment and Results

4.1. Dataset Descriptions

To address the core issues of insufficient generalization and weak robustness in license plate detection models under harsh urban roadside natural conditions, this study focuses on complex interference scenarios encountered in real-road applications [18,46]. Adhering to the requirements of diversity, representativeness, and completeness for training deep learning models, we systematically construct a dedicated license plate detection dataset for harsh environments. This dataset comprehensively covers various extreme conditions and interfering factors, ensuring the effectiveness of model training and the reliability of experimental conclusions from the data source.
All images in the dataset are sourced from the widely recognized, standard public datasets CCPD2019 and CCPD2020 [30]. As authoritative open-source datasets in the field of Chinese license plate detection, they are known for their extensive application, standardized annotations, and diverse scenarios, providing high-quality baseline samples for model training. For this study, we precisely selected and collected a total of 3000 valid images, meticulously removing samples that were blurred, incorrectly annotated, or redundant to guarantee the quality of each individual image.
Regarding geographical distribution and vehicle types, all dataset samples were collected from real-road scenes in Anhui Province, China [26]. The dataset primarily consists of civilian vehicle license plates registered in Anhui, with a small proportion of samples from other regions, reflecting the realistic composition of traffic. The vehicles are predominantly private cars, supplemented by a limited number of other small, non-commercial civilian vehicles, ensuring the dataset covers vehicle types that align with actual road conditions.
The license plate format is strictly standardized to single-row plates, which constitute 100% of the dataset. It excludes other formats such as double-row yellow plates for large vehicles or dual-row plates for new energy vehicles. This uniformity in plate layout prevents format variance from interfering with the model’s extraction of core visual features and aligns with the mainstream application scenario for small civilian vehicle plates in China.
To accurately simulate the challenges that harsh environmental conditions pose to license plate detection in real-road settings, the dataset constructed for this study focuses exclusively on four core adverse scenarios: rain, snow, fog, and low-light nighttime conditions. It contains no clear, unobstructed samples; every image exhibits some degree of environmental interference or feature degradation. This design maximizes the model’s adaptability to complex real-world conditions.
These three scenarios, namely rain and snow, fog, and low-light nighttime conditions, were comprehensively tested using the four recognition models to validate the effectiveness and robustness of the proposed framework.
The distribution of samples across these categories is carefully balanced to reflect their real-world occurrence probabilities, preventing distribution bias from adversely affecting model learning. The specific composition is as follows:
Rain/Snow (56.7%, 1702 images): This category includes challenges such as cold-color cast from snowy conditions, obstruction by falling snowflakes, partial snow cover on plates, road surface glare during rain, raindrop occlusion, and image noise from rainfall.
Fog (16.1%, 483 images): These samples are characterized primarily by low visibility, reduced image contrast, and blurred license plate edges.
Low-Light Nighttime (27.2%, 815 images): This set encompasses conditions such as loss of detail in unlit areas, strong backlight or glare from vehicle headlights, and halo effects caused by direct street lighting.
These four scenario categories are mutually complementary, comprehensively covering the core adverse environments encountered in practical license plate detection applications. The dataset is split into training and validation sets with a ratio of 4:1. For data augmentation, we adopt basic geometric and photometric transformations, as well as lightweight adjustment via GAN to improve the diversity and quality of training samples. No obvious overfitting is observed during training. The model is designed to be generalized to other license plate formats in future work. Representative samples from these complex scenarios are illustrated in Figure 6.

4.2. Implementation Details and Results

To improve the reliability and repeatability of the experimental evaluation, all indicators in this section are obtained from five independent training and testing runs, and are presented as mean ± 95% confidence interval (CI). Paired two-tailed t-tests are conducted between the proposed method and each baseline model to verify statistical significance (p < 0.05). Furthermore, to evaluate the anti-interference ability of the proposed CLEI framework more comprehensively, we divide the adverse environment dataset into three degradation levels: slight, moderate, and severe, according to the intensity of blur, noise, fog density, and illumination attenuation, and carry out a detailed robustness analysis.

4.2.1. License Plate Detection Experiment

(1)
Test environment
All experiments were conducted under identical environmental conditions. The hardware configuration consisted of a Windows 11 operating system, an Intel Core i5-14600KF CPU, and an NVIDIA GeForce RTX 5060 GPU with 8 GB of VRAM. The model was trained with the following hyperparameters: an initial learning rate of 0.01, 100 epochs, a weight decay of 0.0005, a batch size of 16, and a uniform input image size of 720 × 1160 pixels. In real-world roadside imaging systems, most real-time detection devices are equipped with embedded GPUs or AI accelerators that support similar parallel computing capabilities. The proposed model maintains high efficiency and can be readily deployed on such hardware platforms.
(2)
Test Evaluation Indicators
To evaluate the license plate detection performance of the proposed model under harsh conditions, experiments were conducted not only on the full dataset of 3000 images but also independently on the three adverse-weather subsets: rain/snow (1702 images), fog (483 images), and low-light conditions (815 images) [18,31]. The models were comprehensively assessed using the following metrics: Precision (P), Recall (R), mean Average Precision (mAP), number of parameters (Params), and floating-point operations (GFLOPs) [47]. For a more detailed analysis, mAP50 and mAP50-95 were also employed. Here, mAP50 denotes the average precision at an Intersection over Union (IoU) threshold of 0.5, while mAP50-95 represents the average mAP computed over IoU thresholds ranging from 0.5 to 0.95 in 0.05 increments [22].
All reported metrics were calculated on the test sets. The detection results for the rain/snow, fog, and low-light subsets were obtained by inferring with a single, unified model trained on the entire 3000-image dataset.
A P = 0 1 P ( R ) d R
m A P = 1 N i = 0 N A P i
F 1 = 2 × P × R P + R
Precision (P) is the proportion of correctly predicted positive samples among all samples predicted as positive, i.e., the ratio of correctly identified license plates to the total number of predicted license plates.
Recall (R) is the proportion of correctly predicted positive samples among all actual positive samples, i.e., the ratio of correctly identified license plates to the total number of actual license plates.
F1-score is the harmonic mean of Precision and Recall. It provides a single metric that balances both measures, offering a more comprehensive assessment of model performance.
P = T P T P + F P
R = T P T P + F N
Here, TP (True Positives) denotes the number of correctly predicted license plates, FP (False Positives) represents the number of non-plate regions incorrectly predicted as plates, and FN (False Negatives) indicates the number of actual license plates that the model failed to detect. The Precision–Recall (P-R) curve plots Precision (P) on the y-axis against Recall (R) on the x-axis. Since the detection task in this work involves only a single class (license plate), the number of classes N and the index i are both 1.
The number of parameters (Params) refers to the total count of learnable parameters in the model, indicating its scale and memory footprint. A lower parameter count signifies a more lightweight model, which demands less hardware storage and is more suitable for deployment on devices with limited computational resources.
GFLOPs (Giga Floating Point Operations) quantifies the total floating-point operations required for a single forward pass, reflecting the model’s computational complexity and inference time. Given comparable detection accuracy, a lower GFLOP value indicates higher computational efficiency, faster inference speed, and lower power consumption.
(3)
Analysis of Test Results
To comprehensively evaluate the performance of the models for license plate detection, a comparative experiment was conducted on the dataset of 3000 images using several mainstream object detection algorithms: YOLOv5s, YOLOv8s, YOLOv10s, YOLOv12s, and RT-DETR. The experimental environment and dataset remained consistent for all trials. The results are presented in Table 2 (All data presented in bold in the tables of this manuscript represent the highest values among the comparison results.).
As shown in Table 2, all lightweight YOLO variants demonstrated strong detection performance on the full license plate dataset. While RT-DETR maintained a competitive mAP50-95 score, its parameter count (42.7M) and computational cost (178.3 GFLOPs) were substantially higher, placing it at a significant disadvantage in terms of model complexity and storage requirements. Therefore, RT-DETR was excluded from subsequent comparative experiments focused on the three specific adverse environments: rain/snow, fog, and low-light conditions [36,48]. The following analysis concentrates solely on the YOLO models.
Among the compared YOLO models (v5s, v8s, v10s, and v12s), YOLOv12s achieved the best overall performance. It attained perfect recall (100%) and near-perfect precision and F1-score (both 99.99%), surpassing the other variants. Its mAP50 score matched the others at 99.50%, and its mAP50-95 score of 96.78% remained highly competitive. Furthermore, YOLOv12s maintained a practical balance in model size and computational demand, with 9.11M parameters and 19.3 GFLOPs. This represents a clear advantage over the larger YOLOv8s (11.13M Params, 28.4 GFLOPs) and is only slightly higher than the most lightweight model, YOLOv5s (7.01M Params, 15.8 GFLOPs). In summary, YOLOv12s achieves an optimal trade-off between detection accuracy, model compactness, and computational efficiency for this task.
The results in Table 3, Table 4 and Table 5 demonstrate that all four lightweight YOLO variants maintained strong detection performance across the three adverse conditions: rain/snow, fog, and low-light environments. Each model consistently achieved an mAP50 of 99.50%, with precision (P) and recall (R) consistently near-perfect across most scenarios. This indicates a robust adaptation to complex illumination and weather-related interference. While the mAP50-95 metric showed some variation, it remained consistently high, ranging from 97.3% to 98.5%, reflecting robust fine-grained localization capabilities under challenging conditions.
In the trade-off between detection accuracy and model lightweightness, each model exhibited distinct characteristics. YOLOv8s achieved the highest mAP50-95 scores across all three conditions, though with significantly higher inference latency and computational cost. YOLOv10s offered a balance between accuracy and speed, yet its precision (P) and recall (R) were slightly lower than those of YOLOv5s and YOLOv12s. YOLOv12s maintained mAP50-95 scores (97.3–97.8%) very close to the best-performing model, while matching the highest p and R values, indicating that its efficient design did not compromise detection accuracy.
In terms of real-time performance and inference efficiency, YOLOv12s demonstrated a clear advantage. In rain/snow conditions, its latency was 4.8 ms (208 FPS); in fog, 5.3 ms (189 FPS); and in low-light scenes, 4.9 ms (204 FPS). These results significantly outperformed the other three models.
Overall, considering the combined metrics of accuracy, model efficiency, and real-time performance, YOLOv12s achieves the optimal balance for high-precision detection and inference speed across all three adverse conditions. This makes it particularly well-suited for license plate detection applications with stringent real-time requirements.

4.2.2. DeblurGANv2ProcessingExperiment

(1)
Test environment
The hardware configuration for this process used a Windows 11 operating system, an Intel Xeon Platinum 8474C CPU, and an NVIDIA GeForce RTX 4090D GPU with 24 GB of VRAM. The model was trained with an initial learning rate of 0.01, a weight decay of 0.0005, and a batch size of 16.
(2)
Test Evaluation Indicators
To evaluate the model’s performance in license plate deblurring, the specific evaluation criteria are shown in Table 6.
The evaluation metrics for the experimental models include:
Peak Signal-to-Noise Ratio (PSNR), which quantifies pixel-level reconstruction error, with higher values indicating smaller deviations from the original pixels.
Structural Similarity Index Measure (SSIM), which assesses the preservation of image structure, with values ranging from 0 to 1 (closer to 1 indicates better performance).
Learned Perceptual Image Patch Similarity (LPIPS), which measures similarity in terms of human visual perception, where lower scores correspond to smaller perceptual differences.
(3)
Analysis of Test Results
Figure 7 demonstrates the effect of applying DeblurGAN-v2 for local license plate image processing. In this experiment, the DeblurGAN-v2 model was used to deblur the entire dataset of 3000 samples. The performance was quantified using three standard metrics—Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS)—with the original clear images serving as Ground Truth (GT). The average results, presented in Table 7, are as follows: PSNR = 16.61 dB, SSIM = 0.8776, and LPIPS = 0.1151.

4.2.3. License Plate Recognition Experiment

(1)
Test environment
All experiments were conducted under identical hardware and software conditions. The system ran on Windows 11, with an Intel Core i5-14600KF CPU and an NVIDIA GeForce RTX 5060 GPU (8 GB VRAM). The model was trained for 120 epochs with an initial learning rate of 0.001, a weight decay of 0.001, and a batch size of 64.
(2)
Test Evaluation Indicators
To comprehensively evaluate the performance and practicality of the proposed license plate recognition model, two key metrics are used for quantitative analysis:
Plate Accuracy (PA), defined as the proportion of license plates in the test set that are recognized with 100% character accuracy, as formulated in Equation (8).
Number of Parameters, reported in millions (M), which quantifies the total trainable parameters. This metric indicates model complexity and deployment cost, serving as a key measure of its lightweight nature.
Additionally, the impact of the GAN-based image enhancement module is evaluated to assess its contribution to the overall recognition performance.
P A = N c o r r e c t N t o t a l × 100 %
where N c o r r e c t is the number of samples where the predicted character sequence perfectly matches the ground truth, and N t o t a l is the total number of test samples.
(3)
Analysis of Test Results
To evaluate the effectiveness of Mamba-SSM for license plate recognition and the contribution of DeblurGAN-v2 image enhancement, we conducted a comparative experiment under identical settings using three recognition architectures: CRNN (CNN + BiLSTM), LPRNet, and the proposed CMN [35]. The corresponding Plate Accuracy (PA) and parameter counts for each model are summarized in Table 8.
As shown in Table 8, without image enhancement, the proposed CMN achieves a plate accuracy (PA) of 93.3%. This represents a significant improvement over CRNN (91.0%) and LPRNet (88.5%), validating the superiority of the Mamba state-space model for modeling degraded sequences. Its non-recurrent architecture circumvents the gradient vanishing problem inherent in RNNs, enabling more robust capture of long-range character dependencies [17,22], as visualized in Figure 8.
Combined with the GAN enhancement module, the final model reaches the highest accuracy of 93.67%. The results confirm that replacing traditional RNN with Mamba-SSM and introducing GAN enhancement both contribute to performance gain. We further assess model performance under rain–snow, foggy and low-illumination conditions, analyzing how different harsh weather environments affect license plate recognition results.
When GAN-based deblurring preprocessing is applied, the PA of all models increases by 0.3 percentage points. The combined CMN + GAN pipeline ultimately reaches a recognition accuracy of 93.6%, which is 2.6% higher than the baseline CRNN.
To verify the robustness of the proposed method under increasingly severe environmental interference, the test set is divided into three levels: slight, moderate, and severe degradation. The division is based on actual imaging quality, including the degree of motion blur, fog density, noise intensity, and illumination loss in rain, snow, fog, and nighttime scenarios. The experimental results show that under slight degradation, all models achieve relatively high accuracy. As the degradation becomes moderate and severe, the accuracy of CRNN and LPRNet decreases significantly, especially for blurred and missing character strokes. In contrast, the proposed CMN and full CLEI framework maintain stable recognition performance. Under severe degradation, the plate accuracy of CLEI is more than 3% higher than that of CRNN, and the improvement is statistically significant (p < 0.05). This indicates that the DeblurGAN-v2 enhancement module can effectively restore character structures, and the Mamba-based CMN has stronger ability to resist blur and noise than traditional recurrent models. The proposed collaborative framework exhibits superior robustness for license plate recognition in harsh environments.
These results demonstrate that a pipeline combining front-end image enhancement via DeblurGAN-v2 and back-end robust sequence modeling with Mamba-SSM is an effective technical approach for improving the reliability of license plate recognition in harsh urban traffic environments, and provides strong technical support for all-weather vehicle perception in smart cities.

5. Discussion

This study addressed a persistent but insufficiently resolved problem in automatic license plate recognition (ALPR), namely the substantial performance degradation caused by adverse weather and low-illumination conditions [21,36,48]. The results show that CLEI, by integrating detection, enhancement, and sequence recognition into a collaborative framework, achieves more stable end-to-end recognition in severely degraded scenes. Within this framework, YOLOv12s provided reliable plate localization, DeblurGAN-v2 improved the quality of cropped plate regions [49], and the proposed CMN recognizer further strengthened sequence decoding under distorted visual conditions. As reported in the manuscript, the full CLEI framework achieved 93.67% plate accuracy, while CMN outperformed CRNN and LPRNet, indicating that collaborative optimization across stages is more effective than isolated module improvement for harsh-environment ALPR.
A central finding of this work is that the main bottleneck in adverse-environment ALPR is not merely insufficient detector or recognizer capacity in isolation, but the accumulation of mismatches across the pipeline. In many existing systems, the detected plate crop is passed directly to the recognizer even when it remains blurred, low-contrast, or structurally incomplete. Under such conditions, downstream errors become inevitable because the recognizer must infer character identities from damaged local evidence [12,47]. The present results suggest that the gain of CLEI stems from reducing this cross-stage mismatch: detection first stabilizes the target region, enhancement restores recognition-relevant structures, and CMN then exploits contextual dependencies to decode the sequence more robustly. This interpretation is consistent with our manuscript’s core argument that harsh-environment ALPR should be treated as a coupled perception problem rather than a loosely connected sequence of independent subtasks.
The detection results are in line with current ALPR trends. Recent studies continue to rely heavily on one-stage detectors because they offer a favorable balance between localization accuracy and deployment efficiency [12]. The official YOLOv12 paper positions the model as an attention-centric real-time detector that preserves competitive latency while improving accuracy relative to prior YOLO variants, which supports its use as a strong front-end detector in practical recognition pipelines. In this sense, selecting YOLOv12s in the present work is well justified not only by the benchmark results reported in the manuscript, but also by the architecture’s intended balance between speed and representation power [26]. However, this study goes beyond the usual “strong detector + standard recognizer” pattern by showing that better localization alone is not sufficient for degraded scenes; rather, the detector is most valuable when it serves as the entry point to a restoration-aware and recognition-oriented pipeline [30].
The enhancement stage also deserves careful interpretation. DeblurGAN-v2 was originally proposed as an efficient and flexible single-image motion deblurring framework built on a relativistic conditional GAN with a double-scale discriminator and an FPN-based generator [50]. Its design makes it particularly suitable for complex blur patterns, which is relevant to roadside scenarios involving motion blur, low light, and weather-induced visibility degradation. In the present study, DeblurGAN-v2 achieved favorable PSNR, SSIM, and LPIPS values and contributed to a further improvement in recognition accuracy after detection and cropping. This result supports a broader point: in ALPR under severe degradation, enhancement remains necessary, but its utility should be judged by downstream recognition gains rather than perceptual metrics alone. A visually cleaner plate image does not automatically guarantee better sequence decoding, so the positive gain observed here is important because it indicates that the restored details were not merely perceptual, but recognition-relevant.
Another important contribution of this work lies in the recognition stage. CRNN remains a classical baseline for scene text and related sequence recognition tasks, while LPRNet represents a lightweight end-to-end alternative specifically influential in license plate recognition [32]. Yet both architectures can become vulnerable when character boundaries are blurred, strokes are partially missing, or local textures are corrupted. In contrast, the proposed CMN replaces recurrent sequence modeling with a Mamba-based state-space module. The original Mamba paper shows that selective state-space models can perform input-dependent propagation and forgetting while maintaining linear-time scaling, making them attractive for efficient sequence modeling. Although Mamba was not developed specifically for license plate OCR [49], its mechanism is highly relevant to harsh-environment recognition because the model can preserve useful context while suppressing noise. The fact that CMN outperformed both CRNN and LPRNet in our experiments suggests that such selective sequence modeling is beneficial even for short but highly structured sequences like license plates, especially when visual evidence is incomplete.
It should be emphasized that the current experimental evaluation mainly compares with the mainstream recognition models (CRNN, LPRNet) widely used in the ALPR field, which ensures the focus on the core contribution of Mamba-based sequence modeling for degraded license plates. Recently proposed Transformer-based OCR models such as PARSeq [39], ABINet [40], ViTSTR [41], and TrOCR [42] have shown excellent performance in general text recognition. Nevertheless, these models rely on global self-attention mechanisms and are more sensitive to local texture destruction, blur, and noise in severe degradation scenarios. Meanwhile, they have high requirements for input resolution and image clarity, which are not fully compatible with the characteristics of low-quality license plate images collected under complex roadside environments. Therefore, the direct migration and application of these Transformer-based models in harsh environment ALPR require targeted decoder transformation, sequence alignment adjustment, and domain adaptation, which is beyond the research scope of this paper and will be further explored in future work.
From a methodological perspective, the main novelty of CLEI is not the simple combination of three strong modules, but the explicit collaborative logic linking them. Existing ALPR studies in complex scenarios often improve one or two stages, such as using a stronger detector, a lightweight recognizer, or a handcrafted enhancement module. Recent work similarly reports improved performance through combinations such as YOLOv5s/LPRNet or YOLOv8/EasyOCR under complex conditions, but these systems still mostly optimize their modules separately [31,32,33,34]. By contrast, CLEI is based on the premise that plate localization, quality restoration, and sequence decoding should be interpreted as interdependent processes in degraded scenes. This system-level view better explains why moderate gains at multiple stages can translate into a larger improvement in overall plate accuracy [44].
The broader significance of this study is therefore twofold. First, it supports a task-oriented view of restoration in ALPR: enhancement should be evaluated by how much it helps recognition, not only by image fidelity. Second, it provides empirical evidence that Mamba-style state-space modeling can be extended into specialized OCR scenarios beyond the long-sequence settings where it first became prominent [31]. This may be relevant not only for urban license plate recognition, but also for other structured short-text tasks such as meter reading, industrial code recognition, and urban roadside intelligent perception text analysis, which are important components of smart city construction.
The addition of 95% confidence intervals, statistical significance tests, and stratified robustness analysis further confirms the stability and superiority of the proposed CLEI framework. The performance gains over baseline models are not caused by random experimental fluctuations, but reflect the inherent advantages of the collaborative design of YOLOv12 detection, GAN enhancement, and Mamba-based recognition. Especially under severe degradation, the more significant performance improvement demonstrates that the proposed method is more suitable for practical application scenarios with complex interference such as rain, snow, fog, and low illumination.
This study also has practical implications for intelligent transportation deployment. Prior evidence shows that simulated weather distortions and camera read noise can drastically reduce ALPR performance, in some cases driving accuracy close to failure. Under these conditions, simply increasing detector capacity may not adequately address the end-to-end recognition problem [26]. The CLEI framework offers a more deployment-oriented solution by strengthening each stage according to its functional role in degraded scenes: YOLOv12s for reliable localization, DeblurGAN-v2 for structural recovery, and CMN for robust decoding [31]. This makes the framework particularly relevant for roadside monitoring, parking access control, urban surveillance, and other applications where environmental instability is unavoidable.

6. Conclusions

We proposed CLEI, a collaborative license plate recognition framework for complex degradation scenarios, integrating YOLOv12s detection, DeblurGAN-v2 enhancement, and CMN recognition into an end-to-end pipeline. All experimental results are supplemented with 95% confidence intervals, statistical significance tests, and detailed robustness analysis under different degradation levels. CLEI significantly improved overall recognition accuracy under rain, snow, fog, and nighttime conditions, demonstrating the effectiveness of coordinated design across modules and highlighting the advantages of Mamba-based sequence modeling for degraded character recognition.
From an application perspective, CLEI provides a reliable solution for all-weather license plate recognition in urban intelligent transportation systems and smart city construction. The enhancement module not only improved visual quality but also restored structural details critical for recognition, offering stable data support for urban traffic management, tolling, parking supervision, public security monitoring and urban refined governance, and providing a practical reference for optimizing urban smart transportation infrastructure. Overall, the proposed method yields promising results under the tested scenarios; however, further extensive experiments are needed to demonstrate its full robustness and generalization ability.
Despite these achievements, limitations remain. In addition, the current implementation remains a semi-automatic (human-in-the-loop) system rather than a fully automatic pipeline, which still requires human involvement in key steps. Although the proposed method has been validated on publicly available datasets covering typical adverse weather conditions, further validation on additional diverse public datasets, as well as cross-domain scenarios, will be conducted in future work to further demonstrate the generalization ability of the framework. The dataset scale and scene coverage are limited, the enhancement module has not yet been jointly optimized with recognition, and fine-grained analyses of character-level errors and edge-device performance are insufficient. Future work will focus on cross-domain and multi-device validation, joint training strategies, and recognition-oriented enhancement to further improve system robustness and deployment readiness. Furthermore, the current version does not include systematic comparison with recent Transformer-based scene text recognition models such as PARSeq, ABINet, ViTSTR, and TrOCR. In future work, we will conduct fair and comprehensive comparisons on unified harsh-environment datasets, analyze the advantages of Mamba-SSM over Transformer self-attention mechanisms in low-quality license plate sequences, and design lightweight domain-adapted versions for these Transformer-based models to further validate the superiority and generalization of the proposed CMN and CLEI framework.

Author Contributions

F.T. contributed to conceptualization, methodology, original draft preparation, and overall project administration. L.C. contributed to experimental design, data curation, software implementation, and formal analysis. L.Z. contributed to methodology refinement, manuscript review and editing, supervision, and project coordination. Y.N. contributed to data collection, experimental validation, and result organization. J.Y. contributed to engineering application analysis, resource coordination, and partial result verification. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant Nos. 52302429; 52411540233), the Natural Science Foundation of Hunan Province (Grant No. 2024JJ6038), the Open Fund of Engineering Research Center of Catastrophic Prophylaxis and Treatment of Road & Traffic Safety of Ministry of Education (Changsha University of Science & Technology) (Grant No. kfj220403), the Hunan Provincial Department of Transportation Science and the Technology Project of Open Bidding for Selecting the Best Candidates (No. 202604).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

The authors would like to thank Changsha University of Science & Technology for providing research and experimental support. The authors also appreciate the valuable comments and suggestions from the editors and reviewers.

Conflicts of Interest

Author Jian Yang was employed by the company Guangxi Communication Investment Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as potential conflicts of interest.

References

  1. Zhou, Y.; Peng, S.; Lyu, H.; Tong, F.; Huang, C.; Niu, J. KLOTSKI: Towards Consensus Enabled Collaborative Vehicles in Intelligent Transportation. IEEE Trans. Veh. Technol. 2025, 74, 18620–18634. [Google Scholar] [CrossRef]
  2. Sonnara, F.; Chihaoui, H.; Filali, F. Efficient real-time license plate recognition using deep learning on edge devices. J. Real-Time Image Process. 2025, 22, 159. [Google Scholar] [CrossRef]
  3. Mustafa, T.; Karabatak, M. Real time car model and plate detection system by using deep learning architectures. IEEE Access 2024, 12, 107616–107630. [Google Scholar] [CrossRef]
  4. Solla, M.; Pérez-Gracia, V.; Fontul, S. A review of GPR application on transport infrastructures: Troubleshooting and best practices. Remote Sens. 2021, 13, 672. [Google Scholar] [CrossRef]
  5. Bensouilah, M.; Zennir, M.N.; Taffar, M. An ALPR System-based Deep Networks for the Detection and Recognition. In Proceedings of the ICPRAM, Virtual, 4–6 February 2021; pp. 204–211. [Google Scholar]
  6. Wang, J.; Wu, Z.; Liang, Y.; Tang, J.; Chen, H. Perception methods for adverse weather based on vehicle infrastructure cooperation system: A review. Sensors 2024, 24, 374. [Google Scholar] [CrossRef] [PubMed]
  7. Shashirangana, J.; Padmasiri, H.; Meedeniya, D.; Perera, C. Automated license plate recognition: A survey on methods and techniques. IEEE Access 2020, 9, 11203–11225. [Google Scholar] [CrossRef]
  8. Ali, M.; Zhang, Z. The YOLO framework: A comprehensive review of evolution, applications, and benchmarks in object detection. Computers 2024, 13, 336. [Google Scholar] [CrossRef]
  9. Ismail, A.; Mehri, M.; Sahbani, A.; Amara, N.E.B. Automatic license plate recognition in in-the-wild scenarios: A comprehensive review, open issues, and future directions. IEEE Access 2025, 13, 145387–145415. [Google Scholar] [CrossRef]
  10. Qin, H.; Wang, D.; Cai, Z.; Zeng, J. Real-Time Traffic Arrival Prediction for Intelligent Signal Control Using a Hidden Markov Model-Filtered Dynamic Platoon Dispersion Model and Automatic License Plate Recognition Data. Appl. Sci. 2025, 15, 11537. [Google Scholar] [CrossRef]
  11. Meesad, P.; Thumthong, W. Advanced deep learning techniques for automated license plate recognition. Sci. Rep. 2025, 15, 41194. [Google Scholar] [CrossRef]
  12. Sapkota, R.; Flores-Calero, M.; Qureshi, R.; Badgujar, C.; Nepal, U.; Poulose, A.; Zeno, P.; Vaddevolu, U.B.P.; Khan, S.; Shoman, M. YOLO advances to its genesis: A decadal and comprehensive review of the You Only Look Once (YOLO) series. Artif. Intell. Rev. 2025, 58, 274. [Google Scholar] [CrossRef]
  13. Tang, Y.; Zhang, S.; Liu, W.; Wang, Z.; Wang, J. Ultra-lightweight automatic license plate recognition system for microcontrollers: A cost-effective and energy-efficient solution. IEEE Trans. Intell. Transp. Syst. 2024, 25, 20419–20434. [Google Scholar] [CrossRef]
  14. Li, W.; Pun, C.-M. A single-target license plate detection with attention. In Proceedings of the 2022 International Workshop on Advanced Imaging Technology (IWAIT); SPIE: Bellingham, WA, USA, 2022; pp. 396–401. [Google Scholar] [CrossRef]
  15. Yogheedha, K.; Nasir, A.; Jaafar, H.; Mamduh, S. Automatic vehicle license plate recognition system based on image processing and template matching approach. In Proceedings of the 2018 International Conference on Computational Approach in Smart Systems Design and Applications (ICASSDA); IEEE: New York, NY, USA, 2018; pp. 1–8. [Google Scholar]
  16. Samantaray, M.; Biswal, A.K.; Singh, D.; Samanta, D.; Karuppiah, M.; Joseph, N.P. Optical character recognition (ocr) based vehicle’s license plate recognition system using python and opencv. In Proceedings of the 2021 5th International Conference on Electronics, Communication and Aerospace Technology (ICECA); IEEE: New York, NY, USA, 2021; pp. 849–853. [Google Scholar]
  17. Chowdhury, D.; Mandal, S.; Das, D.; Banerjee, S.; Shome, S.; Choudhary, D. An adaptive technique for computer vision based vehicles license plate detection system. In Proceedings of the 2019 International Conference on Opto-Electronics and Applied Optics (Optronix); IEEE: New York, NY, USA, 2019; pp. 1–6. [Google Scholar] [CrossRef]
  18. Menon, A.; Omman, B. Detection and recognition of multiple license plate from still images. In Proceedings of the 2018 International Conference on Circuits and Systems in Digital Enterprise Technology (ICCSDET); IEEE: New York, NY, USA, 2018; pp. 1–5. [Google Scholar] [CrossRef]
  19. Mathur, M.; Devi, G.U.; Chauhan, S. Robust License Plate Detection and Recognition in Adverse Weather Using Drone-Captured Data and a Hybrid YOLOv8-Swin-BiFAN Framework. In Proceedings of the 2025 13th International Conference on Intelligent Embedded, MicroElectronics, Communication and Optical Networks (IEMECON); IEEE: New York, NY, USA, 2025; pp. 1–6. [Google Scholar]
  20. Rahmani, M.; Sabaghian, M.; Moghadami, S.M.; Talaie, M.M.; Naghibi, M.; Keyvanrad, M.A. Ir-lpr: A large scale iranian license plate recognition dataset. In Proceedings of the 2022 12th International Conference on Computer and Knowledge Engineering (ICCKE); IEEE: New York, NY, USA, 2022; pp. 053–058. [Google Scholar] [CrossRef]
  21. Al-Shemarry, M.S.; Li, Y.; Abdulla, S. An efficient texture descriptor for the detection of license plates from vehicle images in difficult conditions. IEEE Trans. Intell. Transp. Syst. 2019, 21, 553–564. [Google Scholar] [CrossRef]
  22. Khanam, R.; Hussain, M. A review of YOLOv12: Attention-based enhancements vs. previous versions. arXiv 2025, arXiv:2504.11995. [Google Scholar] [CrossRef]
  23. Zhang, C.; Zhang, C.; Wang, S.; Dong, Y.; Guan, X.; Fan, H.; Zhao, R.; Xu, G. EVF-YOLO: A lightweight network for license plate detection under severe weather conditions. In Proceedings of the 2024 International Conference on Intelligent Computing; Springer: Singapore, 2024; pp. 131–142. [Google Scholar]
  24. Sugiharto, A.; Kusumaningrum, R. Enhanced automatic license plate detection and recognition using clahe and yolov11 for seat belt compliance detection. Eng. Technol. Appl. Sci. Res. 2025, 15, 20271–20278. [Google Scholar] [CrossRef]
  25. Na, K.; Park, G.; Kim, I. CharDiff-LP: A Diffusion Model with Character-Level Guidance for License Plate Image Restoration. arXiv 2025, arXiv:2510.17330. [Google Scholar]
  26. Hu, L.; Zeng, W.; Cai, Z.; Guo, J.; Wang, Y.; Wei, N. Research and implementation of a license plate detection algorithm for complex environments. In Proceedings of the 2023 8th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA); IEEE: New York, NY, USA, 2023; pp. 551–555. [Google Scholar]
  27. Nguyen, C.; Tran, D.T.; Nguyen, H.; Phan, X.-V.; Nguyen, N.-P. VRAE: Vertical Residual Autoencoder for License Plate Denoising and Deblurring. arXiv 2025, arXiv:2509.08392. [Google Scholar] [CrossRef]
  28. Zhu, C.; Zhang, H.; He, M.; Li, Y.; Qiao, X. Nighttime Hazy Image Enhancement via Progressively and Mutually Reinforcing Night-Haze Priors. arXiv 2026, arXiv:2601.01998. [Google Scholar] [CrossRef]
  29. Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; Curran Associates, Inc.: Red Hook, NY, USA, 2014; Volume 27. [Google Scholar]
  30. Song, Y.; He, Z.; Qian, H.; Du, X. Vision transformers for single image dehazing. IEEE Trans. Image Process. 2023, 32, 1927–1941. [Google Scholar] [CrossRef]
  31. Xu, G.; Ke, Z.; Zuo, P.; Lei, B. TransLPRNet: Lite Vision-Language Network for Single/Dual-line Chinese License Plate Recognition. arXiv 2025, arXiv:2507.17335. [Google Scholar]
  32. Qin, C.; Schlemper, J.; Caballero, J.; Price, A.N.; Hajnal, J.V.; Rueckert, D. Convolutional recurrent neural networks for dynamic MR image reconstruction. IEEE Trans. Med. Imaging 2018, 38, 280–290. [Google Scholar] [CrossRef]
  33. Tomonaga, S.; Doya, K.; Murata, N. Lag Operator SSMs: A Geometric Framework for Structured State Space Modeling. arXiv 2025, arXiv:2512.18965. [Google Scholar] [CrossRef]
  34. Shim, S.-O.; Imtiaz, R.; Siddiq, A.; Khan, I.R. License plates detection and recognition with multi-exposure images. Int. J. Adv. Comput. Sci. Appl. 2022, 13, 300–305. [Google Scholar] [CrossRef]
  35. Li, Z. A method for license plate recognition in low-resolution conditions. In Proceedings of the 2025 ITM Web of Conferences; EDP Sciences: Les Ulis, France, 2025; p. 02028. [Google Scholar] [CrossRef]
  36. Joshi, S.; Jejure, P.; Jadhav, C.; Jankar, V.; Mote, A. Automatic Number Plate Recognition Using YOLOv8 Model. Int. J. Sci. Res. Sci. Technol. 2025, 12, 1088–1097. [Google Scholar] [CrossRef]
  37. Tisanarada, A.; Giap, Y.C. Optical Character Recognition (OCR) Of License Plates Using the KNN Method. Indones. Appl. Res. Comput. Inform. 2025, 1, 10–19. [Google Scholar]
  38. Shabaninia, E.; Asadi-zeydabadi, F.; Nezamabadi-pour, H. Layout-Independent License Plate Recognition via Integrated Vision and Language Models. arXiv 2025, arXiv:2510.10533. [Google Scholar] [CrossRef]
  39. Du, Y.; Chen, Z.; Jia, C.; Yin, X.; Li, C.; Du, Y.; Jiang, Y.-G. Context perception parallel decoder for scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 4668–4683. [Google Scholar] [CrossRef]
  40. Fang, S.; Xie, H.; Wang, Y.; Mao, Z.; Zhang, Y. Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE Computer Society: Los Alamitos, CA, USA, 2021; pp. 7098–7107. [Google Scholar]
  41. Atienza, R. Vision transformer for fast and efficient scene text recognition. In Proceedings of the 2021 International Conference on Document Analysis and Recognition; Springer: Berlin/Heidelberg, Germany, 2021; pp. 319–334. [Google Scholar]
  42. Li, M.; Lv, T.; Chen, J.; Cui, L.; Lu, Y.; Florencio, D.; Zhang, C.; Li, Z.; Wei, F. Trocr: Transformer-based optical character recognition with pre-trained models. In Proceedings of the 2023 AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2023; pp. 13094–13102. [Google Scholar]
  43. Amin, A.; Mumtaz, R.; Bashir, M.J.; Zaidi, S.M.H. Next-generation license plate detection and recognition system using yolov8. In Proceedings of the 2023 IEEE 20th International Conference on Smart Communities: Improving Quality of Life using AI, Robotics and IoT (HONET); IEEE: New York, NY, USA, 2023; pp. 179–184. [Google Scholar] [CrossRef]
  44. Vargoorani, Z.E.; Ghoreyshi, A.M.; Suen, C.Y. Efficient License Plate Recognition via Pseudo-Labeled Supervision with Grounding DINO and YOLOv8. In Proceedings of the 2025 IEEE 35th International Workshop on Machine Learning for Signal Processing (MLSP); IEEE: New York, NY, USA, 2025; pp. 1–6. [Google Scholar] [CrossRef]
  45. Shafiezadeh, B.; Mashmool, A.; Eshghi, F.; Kelarestaghi, M. A new Hybrid Model of Generative Adversarial Network and You Only Look Once Algorithm for Automatic License-Plate Recognition. arXiv 2025, arXiv:2509.06868. [Google Scholar] [CrossRef]
  46. Viswanathan, K.; Goel, V.; Gholap, S.; Ghosh, D.; Gupta, M.; Ganatra, D.; Potdar, S.; Sethi, A. FANVID: A Benchmark for Face and License Plate Recognition in Low-Resolution Videos. arXiv 2025, arXiv:2506.07304. [Google Scholar]
  47. Vargoorani, Z.E.; Suen, C.Y. License plate detection and character recognition using deep learning and font evaluation. In Proceedings of the 2024 IAPR Workshop on Artificial Neural Networks in Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2024; pp. 231–242. [Google Scholar] [CrossRef]
  48. Rahman, M.J.; Beauchemin, S.S.; Bauer, M.A. License plate detection and recognition: An empirical study. In Proceedings of the 2019 Science and Information Conference; Springer: Cham, Switzerland, 2019; pp. 339–349. [Google Scholar]
  49. Moussaoui, H.; Akkad, N.E.; Benslimane, M.; El-Shafai, W.; Baihan, A.; Hewage, C.; Rathore, R.S. Enhancing automated vehicle identification by integrating YOLO v8 and OCR techniques for high-precision license plate detection and recognition. Sci. Rep. 2024, 14, 14389. [Google Scholar] [CrossRef]
  50. Ismagilov, T.; Ferrarini, B.; Milford, M.; Tuyen, N.T.V.; Ramchurn, S.D.; Ehsan, S. On motion blur and deblurring in visual place recognition. IEEE Robot. Autom. Lett. 2025, 10, 4746–4753. [Google Scholar] [CrossRef]
Figure 1. Scene Images of Roadside Facility License Plate Detection Under Severe Weather.
Figure 1. Scene Images of Roadside Facility License Plate Detection Under Severe Weather.
Urbansci 10 00325 g001
Figure 2. Revised overall architecture of the proposed CLEI framework. All modules are clearly displayed with high resolution.
Figure 2. Revised overall architecture of the proposed CLEI framework. All modules are clearly displayed with high resolution.
Urbansci 10 00325 g002
Figure 3. Structure Diagram of YOLOv12 Model.
Figure 3. Structure Diagram of YOLOv12 Model.
Urbansci 10 00325 g003
Figure 4. Structural Diagram of Mamba-SSM.
Figure 4. Structural Diagram of Mamba-SSM.
Urbansci 10 00325 g004
Figure 5. Schematic Diagram of CMN Module Architecture.
Figure 5. Schematic Diagram of CMN Module Architecture.
Urbansci 10 00325 g005
Figure 6. Partial license plate images in rainy, snowy, foggy weather and low-light environments at night.
Figure 6. Partial license plate images in rainy, snowy, foggy weather and low-light environments at night.
Urbansci 10 00325 g006
Figure 7. Image processed for deblurring using DeblurGANv2.
Figure 7. Image processed for deblurring using DeblurGANv2.
Urbansci 10 00325 g007
Figure 8. The models used from left to right are CRNN and CMN with their effect pictures displayed respectively.
Figure 8. The models used from left to right are CRNN and CMN with their effect pictures displayed respectively.
Urbansci 10 00325 g008
Table 1. Summary of Research on Detection Enhancement and Recognition.
Table 1. Summary of Research on Detection Enhancement and Recognition.
ReferencesMain ObjectiveMethods
Rahmani et al., 2022 [20]Summary of Research on Detection Enhancement and RecognitionYOLOv11 Positioning + CLAHE Enhancement + YOLOv11 Recognition
Zhu et al., 2026 [28]Solve the problem of license plate detection in monitoring scenariosOptimize the C2f, SPPF, detection head and loss function of YOLOv8n
Wang et al., 2024 [6]Improve the license plate recognition accuracy in complex scenariosOptimize YOLOv5s and LPRNet, and add attention and correction modules
Li, 2025 [35]Improve the overall security of multi-image steganographyMulti-scale texture evaluation + image enhancement + adversarial embedding
Xu et al., 2025 [31]Solve the problems of blurry laser stripe images and reduced measurement accuracy caused by vibrationImproved DeblurGAN: Incorporate HDC, RRDB, Skip Connections and L1 Loss
Joshi et al., 2025 [36]Achieve high-precision and robust automatic license plate recognitionYOLOv8 Detection + Image Preprocessing + Character Segmentation + OCR
Amin et al., 2023 [43]Real-time high-precision recognition of complex Korean license plates at the edge terminalLightweight Model (SSD-lite/YOLOv7-tiny) + Model Compression
Qin et al., 2018 [32]Build a High-precision and Secure License Plate Recognition System with AI and PyTorchImplement the entire process of image detection and LPR based on PyTorch, including collection, preprocessing, positioning, segmentation and recognition
Table 2. Comparison of Detection Results of Different Models on the Overall Dataset.
Table 2. Comparison of Detection Results of Different Models on the Overall Dataset.
ModelP/%R/%F1/%mAP50/%mAP50-95/%Params (m)GFLOPs
YOLOv5s99.64199.8299.5096.227.0115.8
YOLOv8s99.8299.8399.8299.5097.2711.1328.4
YOLOv10s99.6799.5599.6199.5096.627.2221.4
YOLOv12s99.99199.9999.5096.789.1119.3
RT-DETR98.1196.7397.4298.8896.7942.7178.32
Table 3. Comparison of Detection Results of Different Models in Rainy and Snowy Weather.
Table 3. Comparison of Detection Results of Different Models in Rainy and Snowy Weather.
ModelP/%R/%mAP50/%mAP50-95/%Latency (ms)FPS
YOLOv5s99.9199.50986.5154
YOLOv8s99.9199.5098.29.1110
YOLOv10s99.699.999.50988.5118
YOLOv12s99.9199.5097.84.8208
Table 4. Comparison of Detection Results of Different Models in Foggy Weather.
Table 4. Comparison of Detection Results of Different Models in Foggy Weather.
ModelP/%R/%mAP50/%mAP50-95/%Latency (ms)FPS
YOLOv5s1199.5097.76.6152
YOLOv8s1199.5098.19.6104
YOLOv10s99.699.799.5097.38.9112
YOLOv12s1199.5097.35.3189
Table 5. Comparison of Detection by Different Models in Low-Light Environments.
Table 5. Comparison of Detection by Different Models in Low-Light Environments.
ModelP/%R/%mAP50/%mAP50-95/%Latency (ms)FPS
YOLOv5s1199.5098.26.8147
YOLOv8s1199.5098.59.2109
YOLOv10s99.899.999.50988.7115
YOLOv12s1199.5097.84.9204
Table 6. General Evaluation Criteria.
Table 6. General Evaluation Criteria.
IndicatorPoorMediumGoodExcellent
PSNR (dB)<1515~2020~25>25
SSIM<0.70.7~0.80.8~0.9>0.9
LPIPS>0.30.2~0.30.1~0.2<0.1
Table 7. Deblurring Experimental Results.
Table 7. Deblurring Experimental Results.
IndicatorTest ResultsCorresponding Level
PSNR (dB)16.61 dBMedium
SSIM0.8776Good
LPIPS0.1151Good
Table 8. Performance Comparison of Different Recognition Models on the Total Dataset.
Table 8. Performance Comparison of Different Recognition Models on the Total Dataset.
Model ConfigurationGANMamba-SSMAcc (%)Params (k)
CRNN××91.00 ± 0.51638
CRNN×91.33 ± 0.48--
LPRnet××88.50 ± 0.55486
LPRnet×88.83 ± 0.53--
Ours
(CMN)
×93.30 ± 0.42718
Ours
(Full Model)
93.67 ± 0.41--
All accuracy values are reported as mean ± 95% confidence interval over five independent runs. The improvements of the proposed method are statistically significant (p < 0.05) compared with CRNN and LPRNet.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tang, F.; Chen, L.; Zeng, L.; Nie, Y.; Yang, J. Urban Intelligent Transportation-Oriented License Plate Recognition Model for Severe Environments Based on Hybrid Architecture of YOLOv12, GAN and Mamba-SSM. Urban Sci. 2026, 10, 325. https://doi.org/10.3390/urbansci10060325

AMA Style

Tang F, Chen L, Zeng L, Nie Y, Yang J. Urban Intelligent Transportation-Oriented License Plate Recognition Model for Severe Environments Based on Hybrid Architecture of YOLOv12, GAN and Mamba-SSM. Urban Science. 2026; 10(6):325. https://doi.org/10.3390/urbansci10060325

Chicago/Turabian Style

Tang, Feng, Lei Chen, Lingxuan Zeng, Yaqin Nie, and Jian Yang. 2026. "Urban Intelligent Transportation-Oriented License Plate Recognition Model for Severe Environments Based on Hybrid Architecture of YOLOv12, GAN and Mamba-SSM" Urban Science 10, no. 6: 325. https://doi.org/10.3390/urbansci10060325

APA Style

Tang, F., Chen, L., Zeng, L., Nie, Y., & Yang, J. (2026). Urban Intelligent Transportation-Oriented License Plate Recognition Model for Severe Environments Based on Hybrid Architecture of YOLOv12, GAN and Mamba-SSM. Urban Science, 10(6), 325. https://doi.org/10.3390/urbansci10060325

Article Metrics

Back to TopTop