A Deep Learning Framework of Super Resolution for License Plate Recognition in Surveillance System

Tsai, Pei-Fen; Shiu, Jia-Yin; Yuan, Shyan-Ming

doi:10.3390/math13101673

Open AccessArticle

A Deep Learning Framework of Super Resolution for License Plate Recognition in Surveillance System

by

Pei-Fen Tsai

,

Jia-Yin Shiu

and

Shyan-Ming Yuan

^*

Institute of Computer Science and Engineering, Department of Electrical and Computer Engineering, National Yang Ming Chiao Tung University, Hsinchu Campus, Hsinchu 30010, Taiwan

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(10), 1673; https://doi.org/10.3390/math13101673

Submission received: 14 April 2025 / Revised: 14 May 2025 / Accepted: 16 May 2025 / Published: 20 May 2025

(This article belongs to the Section E1: Mathematics and Computer Science)

Download

Browse Figures

Versions Notes

Abstract

:

Recognizing low-resolution license plates from real-world scenes remains a challenging task. While deep learning-based super-resolution methods have been widely applied, most existing datasets rely on artificially degraded images, and common quality metrics poorly correlate with OCR accuracy. We construct a new paired low- and high-resolution license plate dataset from dashcam videos and propose a specialized super-resolution framework for license plate recognition. Only low-resolution images with OCR accuracy ≥5 are used to ensure sufficient feature information for effective perceptual learning. We analyze existing loss functions and introduce two novel perceptual losses—one CNN-based and one Transformer-based. Our approach improves recognition performance, achieving an average OCR accuracy of 85.14%.

Keywords:

license plate recognition (LPR); super resolution (SR); perceptual loss; optical character recognition (OCR)

MSC:

68T45

1. Introduction

In real-world scenarios, police rely on license plate recognition (LPR) [1,2] for crime investigations, such as identifying hit-and-run vehicles, tracking suspect cars, and other law enforcement applications. However, LPR accuracy is often compromised by motion blur and low-resolution images resulting from fast-moving vehicles [3]. To overcome these challenges, super-resolution techniques serve as a vital image preprocessing step, enhancing the clarity of license plate images and improving recognition performance. For instance, in hit-and-run accidents [4], LPR can be used to identify fleeing vehicles by extracting license plate details from traffic or security camera footage, even when the image is affected by degraded motion blur [5], poor lighting conditions [6] (low light, shadows, or overexposure), low camera resolution [7], oblique viewing angles [8], and weather conditions (rain, fog, or snow) [9]. Similarly, in suspect vehicle tracking [10], police surveillance systems equipped with LPR can cross-check captured license plates against criminal databases, enabling real-time alerts and swift action when a suspect car is detected in monitored areas in Figure 1.

Despite these advancements, current public datasets for license plate recognition primarily consist of synthetic low-resolution images with unrealistic noise, limiting their effectiveness in real-world applications. Additionally, the evaluation of super-resolution models for LPR remains constrained by traditional image quality assessment metrics, such as the Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) [11,12], which may not fully capture the impact of image enhancement on recognition accuracy.

To address these limitations, we propose a framework that leverages real-world black box video footage to construct a more realistic dataset. Specifically, we capture high-resolution images (I^HR) from near-distance recordings and low-resolution images (I^LR) from far-distance recordings, forming paired training data with license plate characters as ground truth labels (Figure 2). To ensure sufficient feature information for effective perceptual learning, only low-resolution images with OCR accuracy ≥5 are used.

Using this dataset, the super-resolution model SwinFIR [13] is trained to enhance low-resolution images by leveraging a combination of pixel-wise Mean Squared Error (MSE) loss and perceptual losses [14]. This approach preserves both structural details and visual fidelity during backpropagation. The restored high-resolution images are then processed by a pretrained CRNN-based OCR recognizer [15], improving license plate character extraction accuracy.

To train the super-resolution model for reconstructing license plate characters, five types of perceptual losses are utilized: multi-task MSE loss for OCR [16], Deep Image Structure and Texture Similarity (DISTS) [17] loss with VGG16, VGG loss with VGG19 [18], loss with Swin Transformer [19], and loss with a Convolutional Recurrent Neural Network (CRNN) [20], along with ensemble models combining these losses. Training with a single perceptual loss improved character recognition accuracy from 75.75% to 82.57% (a 7% increase). Furthermore, the ensemble of Swin Transformer and DISTS perceptual losses further boosted accuracy to 85.14%, achieving a 9.75% improvement.

This study makes four key contributions: 1. The framework of the surveillance system with an extendable dataset and license plate super resolution and license plate detection in Figure 3. 2. Introduction of a novel LR-HR paired license plate dataset sourced from real-world driving scenes, which can be shared for further research purposes. 3. Detailed analysis of various loss function performances in the super-resolution task tailored to improve license plate text recognition outcomes. 4. Demonstration of the significant enhancement in the overall performance of visualization and text recognition results on license plates through the effective combination of different loss functions.

2. Related Work

2.1. License Plates Dataset

There are various public license plate datasets, including LSV-LP [21], CCPD [22], SSIG-SegPlates [23], UFPR-ALPR [24], and RodoSol [25], each designed for different applications in license plate recognition (LPR) and automatic license plate recognition (ALPR). These datasets vary in terms of image quantity, geographic region, distance variability, tilt angle, motion blur, illumination conditions, and resolution.

In Table 1, we compare these datasets based on key attributes, such as the number of images, the country of origin, variability in distance and tilt angles, the presence of motion blur, illumination conditions, and image resolution. The LSV-LP dataset stands out due to its large dataset size, high variability in camera angles and distances, and challenging real-world conditions, making it a strong benchmark for super-resolution and LPR models.

From the comparison, we can see that LSV-LP provides the largest number of images and the most diverse range of real-world conditions, making it well-suited for evaluating deep learning-based super-resolution and LPR models. Unlike some datasets that rely on synthetically generated distortions, LSV-LP captures real-world variations in terms of distance, angle, and motion blur, which are crucial for practical deployment.

Distance variability: The dataset includes images captured from close range (e.g., toll booths) to long-range surveillance (e.g., highway cameras at 50+ meters).
Tilt angle variability: Plates are captured from various viewpoints, including frontal, oblique, extreme tilt angles, and even partially occluded perspectives due to real-world road conditions.
Motion blur: Unlike artificially blurred datasets, LSV-LP contains natural motion blur caused by high-speed vehicles, making it a realistic benchmark for LPR models operating under dynamic conditions.

These characteristics make LSV-LP particularly suitable for training and evaluating super-resolution techniques, ensuring that models are robust to real-world distortions encountered in traffic monitoring, law enforcement, and intelligent transportation systems.

2.2. License Plate Recognition with OCR Recognizer

License Plate Recognition (LPR) using Optical Character Recognition (OCR) heavily relies on high-resolution images, as they are crucial for accurate character recognition. Multi-task OCR models [26,27], which address tasks such as plate detection, character recognition, and plate classification, benefit significantly from high-resolution images. These images allow the models to capture finer details, leading to the better localization and recognition of characters, which is essential for LPR. Such models typically use shared CNN backbones for feature extraction, followed by task-specific branches for each task. The increased image resolution helps in extracting more precise features, enhancing the performance of both detection and recognition tasks. However, training these models requires substantial computational resources due to their complexity. PaddleOCR [28] is particularly effective with high-resolution images, as it can better handle multilingual and multi-oriented text, improving its overall performance in license plate recognition. Furthermore, CRNN [29] models are specifically designed for recognizing characters in sequences, making them ideal for OCR tasks. The combination of convolutional layers for feature extraction and recurrent layers for sequence modeling allows CRNNs to recognize text more accurately, especially when trained on high-resolution images that provide more detailed character information. In conclusion, OCR recognizers, including Multi-task OCR, PaddleOCR, and CRNN-based models, require high-resolution images for optimal character recognition, as they capture the necessary details for accurate LPR performance.

2.3. Licenese Plate Restoration with Super Resolution (SR)

Super-resolution techniques [30,31,32] strive to enhance the quality of images by generating higher-resolution versions from lower-resolution inputs. They employ both traditional methods, such as bilinear or bicubic interpolation, and more recent deep-learning approaches. Traditional techniques employ mathematical formulas to estimate pixel values between neighboring pixels with similar colors, whereas deep learning treats super resolution as a regression problem. In this paradigm, a deep neural network is trained to forecast high-resolution images from low-resolution inputs by learning from pairs of low-resolution (LR) and high-resolution (HR) images. The network endeavors to minimize the disparity between its predictions and the actual high-resolution images.

2.4. Perceptual Loss for Super Resolution

Perceptual loss is a technique used in image super resolution that focuses on high-level features rather than pixel-wise differences like MSE or MAE. Traditional pixel-wise loss functions often fail to capture perceptual quality, leading to blurry or unrealistic results. Perceptual loss compares feature maps extracted from pre-trained neural networks, such as VGG, which are capable of capturing complex structures and semantic content [28,29]. This helps maintain important textures and structures, improving the visual quality of the generated image.

Perceptual loss is often combined with other loss functions like pixel-wise loss to balance feature alignment and pixel accuracy. By using models like VGG, it ensures the generated image is perceptually closer to the ground truth [30].

In addition to traditional methods, advanced techniques like a CNN OCR model of multi-task MSE loss [16,26], Deep Image Structure and Texture Similarity (DISTS) loss [17] with VGG16 [33], VGG loss with VGG19 [33], loss with Swin Transformer [19], and loss with Convolutional Recurrent Neural Network (CRNN) [34] can also be used as perceptual loss functions. These methods help in preserving both high-level content and fine details across different image features.

The multi-task MSE loss function supports any OCR model for LPR, allowing the seamless integration of new models. The approach follows the multi-task model proposed by Gonçalves et al. [26] for its efficiency. By combining MSE and L1 loss, it balances structural preservation and error minimization—MSE maintains image integrity, while L1 enhances robustness and edge sharpness.
DISTS combines structure and texture similarity, preserving both high-level content and fine details.
Swin Transformer captures long-range dependencies, offering a strong perceptual loss by maintaining high-level semantics and global image structure.
CRNN loss is used for text recognition tasks, helping the model align with the structural features of characters, ensuring accurate reconstruction and recognition of textual information.

3. Methodology

3.1. Overview

Figure 4 illustrates the workflow of the SwinFIR super-resolution (SR) model training using a perceptual loss function. The process begins with a low-resolution input image (

I^{L R}

), which is fed into the SwinFIR model to generate a super-resolution output (

I^{S R}

). To ensure accurate training, the actual high-resolution image (

I^{H R}

) is used as a ground-truth reference, and the low-resolution image (

I^{L R})

is captured from a video frame with a cropped license plate region, and only frames with an OCR accuracy of ≥5 are used, ensuring the presence of sufficient feature information for effective perceptual feature extraction.

The perceptual loss function plays a crucial role in optimizing the model’s performance. Both

I^{S R}

and

I^{H R}

are processed through a pretrained network to extract high-level features, which are then compared using a loss computation module. This computed perceptual loss guides the training process, encouraging the model to generate high-quality images that closely resemble the ground truth. Unlike traditional pixel-wise losses such as mean squared error (MSE) or L1 loss, perceptual loss enhances image reconstruction quality by leveraging multi-level feature comparisons, leading to more visually realistic results.

3.2. Dataset Generation

China license plate dataset of LSV-LP [21] with seven characters.
Data with relative motion of: move vs. move, static vs. move.
Data pair of HR/LR images:
○
HR image: Frame of near distance.
○
LR image: Frame of far distance of image with OCR recognizer accuracy ≥5.
Total images: 783 pairs of LR/HR.
Rescale resolution: 110 × 40.
Train, validation, test split: 7, 2, 1.

Examples of image pair in Figure 5:

3.3. Aablation Study on PixelShuffle Three-Fold Attention Module

In this study, we evaluate the performance of the PixelShuffle Three-Fold Attention module proposed by Nascimento et al. (2023) [16] for license plate super resolution. The model applies artificial Gaussian noise to simulate low-resolution images, aiming to improve robustness to real-world distortions. It uses a combination of MSE and perceptual loss to balance pixel-level accuracy and visual quality, both of which are important for OCR performance. We conducted an ablation study using our own dataset and assessed OCR accuracy at different match levels: exact (7/7), near (6/7), and partial (5/7).

However, the results reveal two main limitations: (1) the model struggles with real-world noise, as synthetic Gaussian noise fails to capture real degradation patterns, and (2) the loss function focuses on multi-task objectives (e.g., detection, segmentation) rather than OCR-specific perceptual fidelity. This leads to low exact-match OCR accuracy (3.85%) and only moderate partial recognition (67.95%).

To address these issues, we adopt SwinFIR, a more advanced architecture for image restoration, and design a new OCR-oriented perceptual loss tailored specifically to enhance character-level details in license plate images. This approach aims to better bridge the gap between visual super-resolution and accurate text recognition.

3.4. Super Resolution Model of SwinFIR

The SwinFIR [13] super-resolution (SR) model shown in Figure 6 is specifically designed to enhance the quality of low-resolution license plate images, improving their clarity for Optical Character Recognition (OCR) tasks. SwinFIR adopts a hierarchical architecture based on the Swin Transformer, which captures both local textures and global structural information through shifted window self-attention. It integrates convolutional layers for fine-grained detail enhancement and leverages feature interaction across multiple scales to preserve edge sharpness and character structure—crucial for license plate readability.

As illustrated in the pipeline in Figure 7, the process begins with a low-resolution input image (I^LR), which passes through the SwinFIR network to produce a super-resolved output (I^SR). During training, the model uses a perceptual loss function to guide reconstruction, comparing the ISR to the high-resolution ground-truth image (I^HR). This loss function incorporates deep feature similarities to better retain semantic content and texture consistency, ensuring the SR output is visually coherent and OCR-friendly.

3.5. Perceptual Loss Calculation

Figure 8 illustrates the super-resolution framework for License Plate Recognition (LPR) using a perceptual loss function. The framework takes a high-resolution ground truth image (I^HR) and a super-resolved image (I^SR) as inputs and feeds them into a loss network composed of multiple pretrained models, including convolutional neural networks [35], including VGG-16, VGG-19, a transformer-based Swin transformer, and a recurrent neural network (CRNN). These networks extract deep feature representations from both images, which are then compared to compute the perceptual loss. The final loss is a combination of several components, including Multi-task-MSE, DISTS loss, VGG loss, Swin Transformer loss, and CRNN loss. This perceptual loss guides the super-resolution model to produce visually accurate and semantically meaningful reconstructions, thereby enhancing the performance of license plate recognition tasks.

The original SwinFIR model used Charbonnier loss [34], a smooth variant of L1 loss, for color image reconstruction. However, per-pixel losses failed to capture perceptual differences, leading to unsatisfactory results. To improve performance, we adopted perceptual loss, which leverages high-level features from a pre-trained network (Table 2) to better align with human visual perception. This approach enhanced the super-resolution training of SwinFIR.

3.5.1. Multi-Task MSE Loss

Figure 8 illustrates the Multi-Task MSE Loss computation described in the Appendix A.1 process for training the SwinFIR super-resolution model with a focus on license plate character recognition. The framework consists of the following key components in Figure 9.

By combining both terms, the Multi-Task MSE Loss balances pixel-level reconstruction accuracy with OCR consistency, making it particularly effective for license plate super resolution [41] and character recognition enhancement [42].

3.5.2. Deep Image Structure and Texture Similarity (DISTS) Loss

The DISTS loss [17] is calculated with Equation (A2) of Appendix A.2 for the summation of the texture similarity difference and structural similarity difference of the LR and HR images. The network extracted the feature outputs of specific layers in VGG16 [13], which are {conv1_2, conv2_2, conv3_3, conv4_3, and conv5_3} with weights {0.1, 0, 1, 1, 1, 1}, and

c_{1}

= 10⁻⁶,

c_{1}

= 10⁻⁶, where

{\tilde{x}}_{j}^{(i)}

denotes that the j-th feature map of the i-th convolutional layer of the super-resolved image, and

{\tilde{y}}_{j}^{(i)}

denotes that the j-th feature map of the i-th convolutional layer of the ground-truth image.

μ_{{\tilde{x}}_{j}}^{(i)}, μ_{{\tilde{y}}_{j}}^{(i)}, σ_{{\tilde{x}}_{j}}^{(i)}, σ_{{\tilde{y}}_{j}}^{(i)}

represent the global means and variance of

{\tilde{x}}_{j}^{(i)}

and

{\tilde{y}}_{j}^{(i)}

.

The extracted perceptual information of the LR and HR images is extracted from the pretrained VGG16 [13] on KADID-10k [40], as shown in Figure 10, as perceptual loss is used as a loss for back propagation to the SwinFIR SR model.

3.5.3. VGG Loss

The VGG loss [18] is a summation of perceptual loss and style loss from the pretrained VGG-19 network on the ImageNet-10k. The perceptual loss is extracted from the feature map from {conv1_2, conv2_2, conv3_4, conv4_4,conv5_4}, with weights {0.1, 0.1, 1, 1, 1} of L1 loss of the feature map from the i-th max-pooling layer and the j-th convolution layer of

I^{SR}

and

I^{HR}

in Equations (A5) and (A6) in Appendix A.3.

The extracted perceptual information of SR and HR images is extracted from the pretrained VGG19 network on ImageNet-10k, as shown in Figure 11, as perceptual loss that is used as a loss for the back propagation to the SwinFIR SR model.

3.5.4. Swin Transformer Loss

We introduce a novel perceptual loss named Swin Transformer (SwinT) [19] loss with Swin Transformer feature extraction with 96 channels in the initial stage and comprising four stages structured of 2,2,6,2 layers. In the experiment, we chose {layers.0.blocks.1, layers.1.blocks.1, layers.2.blocks.5, layers.3.blocks.1} with weights {0.1, 0.1, 1, 1} as the extracted layers of features.

The perceptual and style features are summed to obtain the key features of an image, as described in Equation (A7) of Appendix A.4, and the difference between the key features of the low-resolution (LR) and high-resolution (HR) images is defined as the SwinT loss in Equation (A8). These features are extracted using a pretrained Swin Transformer network on ImageNet-1K [26], as shown in Figure 12. The SwinT loss, computed as the difference in perceptual features between the LR-HR image pair, is then backpropagated to optimize the SwinFIR super-resolution (SR) model.

3.5.5. Convolutional Recurrent Neural Network (CRNN) Loss

A Convolutional Recurrent Neural Network (CRNN) [40] is effective for Optical Character Recognition (OCR) as it combines CNNs for feature extraction and RNNs [43] for sequential text modeling. The CNN captures character shapes and local patterns, while the RNN (e.g., LSTM [44] or GRU [45]) learns dependencies between characters. Unlike traditional CNN-based OCR models, CRNNs handle variable-length text without character segmentation and use Connectionist Temporal Classification (CTC) loss for alignment-free text prediction. This makes CRNNs ideal for recognizing scene text [46], handwriting [47], license plates [48], and historical manuscripts [49].

When configuring the CRNN loss, we evaluated various layer combinations drawn from both convolutional and ReLU activation stages. Using CRNN loss alone, the optimal set of layers was {relu1_4, relu2_2, relu3_2, relu4_2}. However, when pairing CRNN loss with other losses, the combination {conv3_2, conv4_1, conv4_2} yielded superior results—except in the case of DISTS loss, where {relu3_2, relu4_1, relu4_2} proved most effective. In all experiments, we assigned equal weight {1} to each selected layer. The CRNN perceptual L1 loss itself is computed as the element-wise L1 difference between corresponding high-resolution and super-resolution feature maps, as defined in Equations (A9) and (A10) of Appendix A.5.

The extracted perceptual information of LR and HR images is extracted from the pretrained CRNN network on the CCPD [20] and the CRPD [22] of Chinese license plate datasets on the GitHub [15], as shown in Figure 13, as differences in the features of the LR and HR image pairs is used as a loss for the back propagation to the SwinFIR SR model.

3.6. Validation and Test Flow

We partition the self-built dataset into training (70%), validation (20%), and testing (10%) sets. The training set is used to optimize the SwinFIR model by minimizing the perceptual loss. The validation set helps prevent overfitting by incorporating the Structural Similarity Index (SSIM) [50] maximization in Figure 14 for SwinT loss as an example. The maximum SSIM is at 5000 epochs and has the best OCR recognition result.

Finally, the test set evaluates the license plate reconstruction after super-resolution and measures the OCR character recognition accuracy. The test dataset distribution, illustrated in Figure 15, includes numbers (0–9), uppercase English letters (A–Z, excluding ‘O’), and Chinese province abbreviations of others.

The test flow assesses the SwinFIR super-resolution model in enhancing license plate recognition accuracy. A low-resolution image (

I^{L R}

) is processed by SwinFIR to generate a high-resolution output (

I^{S R}

), which is then analyzed by a pretrained CRNN OCR recognizer [10]. The recognized characters are compared to the ground truth to evaluate OCR accuracy, demonstrating SwinFIR’s effectiveness in improving license plate clarity in Figure 16.

4. Result and Discussion

4.1. OCR License Plate Recognition Accuracy by Tiers

To evaluate image super-resolution (SR) models for license plate restoration, we assessed OCR performance across four accuracy tiers, as shown in Table 3. These metrics reveal how well each model supports text extraction under challenging conditions, such as blurring or occlusion. This tiered analysis highlights both peak accuracy (7/7) and robustness in error-tolerant scenarios (6/7, 5/7), offering a practical view of model performance for real-world applications in transportation and surveillance.

We compare the OCR accuracy of different metrics in Figure 17 and Table 4 for observation:

Trend discovery reveals a clear performance scaling across tiers, with all models showing improved accuracy as tolerance increases from 7/7 to 6/7 to 5/7. This indicates that even models with lower full-plate accuracy retain partial character information that supports approximate recognition. For example, CRNN (grey line) improves from 25.64% at 7/7 accuracy to 61.54% at 6/7 and 82.05% at 5/7, demonstrating strong robustness despite slight degradations in image quality.
Model ranking and pattern clustering:
- Lower-tier models: Multi-task MSE and raw LR input fall behind.
  They demonstrate limited ability to enhance OCR-relevant features.
- Mid-tier models: DISTS, SwinT, and VGG.
  All CNN- or Transformer-based—offer reliable OCR improvement. Models leveraging perceptual or deep features, as described in Table 4 (e.g., CRNN, SwinT), group in the higher performance band.
- Top-tier models: CRNN + VGG and SwinT + DISTS.
  They consistently outperform others, indicating the value of multi-loss and hybrid-architecture training strategies.
Correlation with feature integration: The inclusion of semantic losses (e.g., CRNN, VGG, SwinT) or ensemble losses (e.g., CRNN + VGG, SwinT + DISTS) shows strong correlation with better OCR performance. This suggests that deeper representations and hybrid loss strategies better preserve text-relevant features.

This visual analysis enables data-driven decisions in model selection for OCR-based license plate restoration. The multi-loss ensemble models stand out as the most reliable, especially SwinT + DISTS, which offers the best balance of full and partial recognition performance. Clustering and trend analysis reinforce that feature-rich loss functions significantly boost OCR readiness, making them ideal for deployment in intelligent traffic systems, surveillance, and smart cities.

4.2. OCR Character Recognition Accuracy on Average

We analyzed the average OCR character recognition accuracy (Ri) to evaluate the overall effectiveness of each SR model in enhancing degraded license plates for OCR systems, as shown in Table 5:

LR as baseline:
- The LR baseline yields an average OCR accuracy of 75.57%, correctly recognizing about 5.29 characters per plate.
- Any model exceeding this benchmark illustrates the positive impact of super-resolution (SR) techniques on OCR capability.
Simple loss:
- Charbonnier loss leads to a notable performance drop (−6.89%), proving inadequate in capturing perceptual fidelity.
- Multi-task MSE, while slightly better, still falls below baseline, revealing the limitations of traditional pixel-based loss functions in restoring visually coherent character structures.
Individual perceptual loss:
- Models with CRNN, SwinT, DISTS, and VGG as perceptual losses all surpass the baseline: Indicates the importance of architectural alignment between the SR model and perceptual loss.
- VGG stands out with 82.57% accuracy and 5.78 recognized characters, emphasizing the power of deep CNNs in capturing semantic features for OCR. This phenomenon was proposed by Y. Liu et al. [51] that generic perceptual loss is applicable to structured learning, including super-resolution, style transfer, image segmentation.
Ensemble of Perceptual Losses:
- Combining perceptual loss functions further boosts OCR accuracy:
  ○
  CRNN + VGG: 83.43%, 5.84 characters.
  ○
  SwinT + DISTS: 85.14%, 5.96 characters, which is the best-performing combination.
- These combinations capitalize on complementary strengths:
  ○
  Spatial fidelity from CNNs and DISTS.
  ○
  Sequential structure from CRNN.
  ○
  Global context understanding from SwinT transformers

They closely mimic human perception, which integrates both detail and context for character recognition.

Incorporating perceptual losses significantly improves character recognition in license plate SR and OCR tasks. Ensembles like SwinT + DISTS and CRNN + VGG deliver the best results by combining spatial, sequential, and contextual strengths. Perceptual losses help restore structural details from degraded images affected by motion blur, low brightness, skew, and resolution, bringing performance closer to that of high-resolution ground truth.

4.3. Super-Resolution Images by Perceptual Losses

We applied trained super-resolution models with various perceptual losses to restore low-resolution images and assessed their impact on license plate recognition using an OCR recognizer, as illustrated in Figure 18. Notably, the VGG-based model and the CRNN + VGG ensemble achieved the highest OCR accuracy, demonstrating the effectiveness of perceptual supervision in improving both visual quality and character recognition. A detailed analysis of the other models can be found in Appendix B.

1.: LR image:

In low-resolution (LR) license plate images, character substitution is common with frequent misrecognitions, such as “1C” being interpreted as “T”, “D” as “0”, or Chinese characters, which are logographic symbols used in the Chinese writing system, like “甘” misread as “鲁”. These errors are especially prevalent in the initial region of the plate, where province codes are typically located—likely due to edge blur and resolution fall-off at the image boundaries. Interestingly, numbers are generally better preserved than letters, possibly because of their simpler and more distinct visual structures. Overall, semantic coherence is lacking, as no domain-specific correction or contextual filtering has been applied to refine or validate the OCR output based on plate syntax or geographic consistency.

2.: VGG loss:

Super resolution using VGG perceptual loss produces high-quality, visually pleasing images with enhanced texture and structural fidelity. It effectively reduces noise and artifacts, preserves character spacing, and maintains the overall layout and background consistency of license plates. However, in character-sensitive applications like license plate recognition, VGG loss may prioritize aesthetics over semantic accuracy. This can lead to misrecognitions, such as “8” being interpreted as “B” or “0” as “D” due to the lack of a character-level understanding. While the results appear sharp and realistic, the loss function lacks semantic awareness, meaning it does not account for the meaning or identity of the characters. To address this limitation, it is advisable to combine VGG loss with recognition-aware or character-level constraints to improve OCR robustness without sacrificing visual quality.

3.: Ensemble loss of SwinT + DISTS:

The SwinT model excels at reconstructing clean, well-defined character structures and preserving license plate layouts, including accurate spacing and background details. It enhances the sharpness of number sequences, improving readability, though it can misinterpret similar characters (e.g., “1” as “T”) due to a lack of semantic correction. DISTS maintains visual and structural integrity with a natural appearance and balanced detail but also struggles with character confusions like “8” as “B”, as it prioritizes perceptual quality over textual accuracy.

The SwinT + DISTS ensemble effectively combines SwinT’s structural clarity with DISTS’s perceptual refinement, producing sharp, uniform outputs with consistent layout and improved readability. While the ensemble mitigates some individual weaknesses, integrating a recognition-based loss such as CRNN could further enhance character-level accuracy for OCR tasks.

4.4. Character Recognition Accuracy for Single Loss on Each Position

We also analyzed the recognition accuracy of license plates from left to right in Figure 17 as positions 0 to 6 with different perceptual losses, as listed in Figure 19.

The character-wise OCR accuracy from Figure 20a–g reveals how different super-resolution loss models affect recognition across license plate positions. At position 0, SwinT leads with 76.92%, highlighting its strength in recovering early characters where the LR baseline struggles (28.77%). CRNN performs best at position 1 (91.03%), while VGG dominates mid-sequence positions (2–4), likely due to its texture-preserving perceptual loss. DISTS also peaks in central positions, reinforcing the importance of structure-based losses. Notably, Multi-Task MSE underperforms across all positions, possibly due to diluted focus on fine details.

At positions 5 and 6, performance trends shift—CRNN and LR baselines outperform super-resolved results, suggesting some models introduce artifacts that hurt recognition in later characters. Surprisingly, LR achieves the highest accuracy at position 6 (81.19%), while complex models like Multi-Task MSE drop sharply.

In summary, no single model excels across all positions: SwinT is strongest early, VGG in mid-sequence, and CRNN in later characters. These findings highlight the potential of position-aware or hybrid loss strategies for better overall OCR accuracy.

4.5. Evaluation Result on Other Datasets

We evaluated our model using a publicly available Chinese license plate generator [52], which creates synthetic images with a blue background, white alphanumeric characters, and a leading Chinese symbol. To simulate real-world conditions, the high-resolution (HR) images undergo augmentations such as perspective distortion, HSV adjustments, Gaussian noise, and stain artifacts, as shown in Figure 21 (“HR img”).

To generate low-resolution (LR) images, we added Gaussian noise to the HR images, downscaled them by a factor of two, and then upscaled them using bicubic interpolation. This simulates degradation from low-quality sensors or compression, resulting in blurred edges and distorted characters (Figure 21, “LR img”).

We evaluated two SwinFIR-based super-resolution (SR) models: one trained with VGG perceptual loss and the other with a combined Swin Transformer (SwinT) + DISTS loss. The VGG loss model improves the structure but retains artifacts, while the SwinT + DISTS model produces sharper, more accurate characters with fewer distortions (Figure 21, “SR img”). These results highlight the effectiveness of perceptual losses that balance structure and texture, improving OCR accuracy and visual clarity.

The SR (super-resolution) model demonstrates notable improvements over the LR (low-resolution) model in terms of OCR accuracy. The average OCR accuracy for SR is 88.14%, surpassing the 87.89% achieved by LR. Additionally, SR shows a slight but consistent advantage in character accuracy per license plate, with a score of 6.17 compared to LR’s 6.15. This indicates that the SR model is more effective at accurately recognizing characters, making it a stronger candidate for OCR tasks that require precision in character recognition in Table 6.

Beyond raw accuracy, the SR model stands out in its ability to perform advanced perceptual feature extraction, which is critical for handling variations and noise in the input data. The artificial noise introduced during data augmentation closely mimics real-world scenarios, where images often suffer from imperfections such as distortion, low contrast, and environmental interference. This alignment with real-world conditions enables the SR model to capture subtle details and better generalize to noisy environments. In OCR applications, where the input quality can fluctuate significantly due to lighting, camera quality, or weather conditions, this enhanced feature extraction helps the model understand the structure and context of the text more effectively, improving its real-world performance.

In terms of training efficiency and error reduction, the SR model also outperforms LR by achieving lower VGG Loss and SwinT + DISTS Loss scores. These reduced loss values indicate that the SR model is more proficient in learning essential visual features and minimizing image quality-related errors. The combination of perceptual loss-guided training and noise handling allows the SR model to recover critical details from degraded images, making it more reliable and accurate. Overall, the SR model’s ability to handle artificial noise that closely mirrors real-world degradation, coupled with improved feature extraction and training performance, makes it a far more robust solution for OCR tasks, especially in noisy or suboptimal environments, compared to LR.

5. Conclusions and Future Work

Our study presents a comprehensive approach to improving license plate recognition (LPR) under real-world conditions such as motion blur, poor lighting, and low resolution. We introduced a novel LR-HR paired dataset sourced from real surveillance footage, enabling more realistic super-resolution (SR) training. Our SwinFIR-based SR model, trained with diverse perceptual loss functions, significantly enhances degraded plate images and boosts OCR performance.

Extensive evaluations revealed key insights: the CRNN + VGG ensemble and SwinT + DISTS models achieved the highest OCR accuracy (89.74%) on full seven-character plates, demonstrating the effectiveness of combining sequential modeling with perceptual or attention-based loss strategies. On moderately degraded six-character plates, VGG, DISTS, and CRNN achieved strong OCR results—88.46%, 84.62%, and 82.05%, respectively—highlighting the critical role of perceptual loss selection in character restoration.

To ensure deployability, we applied model distillation [53,54] to transfer knowledge from high-capacity teacher models to lightweight student networks, reducing model size and computational demand while preserving accuracy. This makes our system suitable for real-time, on-device applications [55], such as traffic cameras and mobile patrol units.

In summary, we contribute a realistic dataset, an effective SR framework, and a deployable LPR pipeline. Future work will (1) investigate OCR performance for samples with initial LR accuracy below 5%, and (2) evaluate how such challenging samples respond to SR+OCR processing. We also aim to explore domain adaptation for varied weather and lighting conditions, and further compress the model to enable faster inference at the edge.

Author Contributions

Conceptualization, J.-Y.S. and P.-F.T.; methodology, J.-Y.S.; software, J.-Y.S.; validation, J.-Y.S. and S.-M.Y.; formal analysis, P.-F.T.; writing—original draft preparation, P.-F.T.; writing—review and editing, P.-F.T.; visualization, P.-F.T.; supervision, S.-M.Y.; project administration, S.-M.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Self built-dataset can be downloaded through NYCU dataverse through link: https://doi.org/10.57770/DLAQUN. Code can be downloaded on git: https://github.com/nctu-dcs-lab/A-super-resolution-framework-for-license-plate-recognition (accessed on 1 May 2025).

Acknowledgments

The authors would like to express their sincere appreciation to Tse-Kang Yang for his assistance in generating the new experimental data. His contributions were instrumental in enhancing the quality and depth of the revised manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

OCR	Optical character recognition
HR	High resolution
LR	Low resolution
SR	Super resolution
LPR	License plate recognition
MSE	Mean square error
CNN	Convolutional neural network
CRNN	Convolutional recurrent neural network
DISTS	Deep Image Structure and Texture Similarity
SwinT	Shifted windows Transformer
SwinFIR	SwinIR with fast Fourier transform

Appendix A. Formula for Different Perceptual Loss

Appendix A.1. Multi-Task MSE Loss Formula

l o s s_{m u l t i - t a s k M S E} = \frac{1}{N} (\sum_{n = 1}^{N} {(I_{n}^{H R} - I_{n}^{S R})}^{2} + \sum_{n = 1}^{N} | f_{M T} (I_{n}^{H R}) - f_{M T} (I_{n}^{S R}) |)

(A1)

where:

$I^{H R}$ and $I^{S R}$ represent the high-resolution ground truth and the super-resolved image for the n-th sample, respectively.
The first term corresponds to the Mean Squared Error (MSE) loss, ensuring pixel-wise similarity between the SR and HR images.
The second term captures the absolute difference in predictions from the multi-task OCR model, ensuring consistency in textual information between the SR and HR images.

Appendix A.2. Deep Image Structure and Texture Similarity (DISTS) Loss Formula

l_{D I S T S} (x, y; α, β) = 1 - \sum_{i = 0}^{m} \sum_{j = 1}^{n_{i}} (α_{i j} \cdot l ({\tilde{x}}_{j}^{(i)}, {\tilde{y}}_{j}^{(i)}) + β_{i j} \cdot s ({\tilde{x}}_{j}^{(i)}, {\tilde{y}}_{j}^{(i)}))

(A2)

l ({\tilde{x}}_{j}^{(i)}, {\tilde{y}}_{j}^{(i)}) = \frac{2 μ_{{\tilde{x}}_{j}}^{(i)} μ_{{\tilde{y}}_{j}}^{(i)} + c_{1}}{{(μ_{{\tilde{x}}_{j}}^{(i)})}^{2} + {(μ_{y_{j}}^{(i)})}^{2} + c_{1}}

(A3)

s ({\tilde{x}}_{j}^{(i)}, {\tilde{y}}_{j}^{(i)}) = \frac{2 σ_{{\tilde{x}}_{j} {\tilde{y}}_{j}}^{(i)} + c_{2}}{{(σ_{{\tilde{x}}_{j}}^{(i)})}^{2} + {(σ_{y_{j}}^{(i)})}^{2} + c_{2}}

(A4)

where

{\tilde{x}}_{j}^{(i)}

denotes the j-th feature map of the i-th convolutional layer of the super-resolved image, and

{\tilde{y}}_{j}^{(i)}

denotes that the j-th feature map of the i-th convolutional layer of the ground-truth image.

μ_{{\tilde{x}}_{j}}^{(i)}, μ_{{\tilde{y}}_{j}}^{(i)}, {(σ_{{\tilde{x}}_{j}}^{(i)})}^{2}, {(σ_{{\tilde{y}}_{j}}^{(i)})}^{2}

and

σ_{{\tilde{x}}_{j} {\tilde{y}}_{j}}^{(i)}

represent the global means and variance of

{\tilde{x}}_{j}^{(i)}

and

{\tilde{y}}_{j}^{(i)}

, and the global covariance between

{\tilde{x}}_{j}^{(i)}

and

{\tilde{y}}_{j}^{(i)}, respectively .

The texture similarity is described in Equation (A3) and the structural similarity is described in Equation (A4).

Appendix A.3. VGG Loss Formula

l_{V G G} = α_{p e r c e p t u a l} \cdot [\sum_{k = 1}^{K} λ_{i j} \cdot l_{\frac{V G G}{i, j}}^{φ, p e r c e p t u a l}] + α_{s t y l e} \cdot [\sum_{k = 1}^{K} λ_{i j} \cdot l_{\frac{V G G}{i, j}}^{φ, s t y l e}]

(A5)

$α_{p e r c e p t u a l}$ / $α_{s t y l e}$ : Weighting coefficients for the perceptual and style losses, respectively.
$λ_{i j}$ : Importance weights assigned to the j-th feature map (before activation) produced by the convolutional layer immediately preceding the i-th max-pooling layer in the VGG19 network.
$\cdot l_{\frac{V G G}{i, j}}^{φ, p e r c e p t u a l} / l_{\frac{V G G}{i, j}}^{φ, s t y l e}$ : Loss terms computed from feature maps of VGG layers, measuring content (perceptual) and style similarity.

$l_{\frac{V G G}{i, j}}^{φ} = \frac{1}{W_{i, j} H_{i, j}} \sum_{x = 1}^{W_{i, j}} \sum_{y = 1}^{H_{i, j}} | φ_{\frac{V G G}{i, j}} {(I^{H R})}_{x, y} - φ_{\frac{V G G}{i, j}} {(I^{S R})}_{x, y} |$

(A6)
$φ_{\frac{V G G}{i, j}}$ : Feature map extracted from the $(i, j)$ -th layer of the pretrained VGG network.
$W_{i, j}$ / $H_{i, j}$ : Width and height of the feature maps at layer $(i, j)$ .

Appendix A.4. Swin Transformer Loss Formula

l_{S w i n T} = α_{p e r c e p t u a l} \cdot [\sum_{k = 1}^{K} λ_{i j} \cdot l_{\frac{S w i n T}{i, j}}^{φ, p e r c e p t u a l}] + α_{s t y l e} \cdot [\sum_{k = 1}^{K} λ_{i j} \cdot l_{\frac{S w i n T}{i, j}}^{φ, s t y l e}]

(A7)

$α_{p e r c e p t u a l}$ / $α_{s t y l e}$ : Weighting coefficients for the perceptual and style losses, respectively.
$λ_{i j}$ : Importance weights for each SwinT layer $(i, j)$ , where $i$ indexes the stage and $j$ the block of the stage.
$l_{\frac{S w i n T}{i, j}}^{φ, p e r c e p t u a l} / \cdot l_{\frac{S w i n T}{i, j}}^{φ, s t y l e}$ : Loss terms computed from feature maps of SwinT layers, measuring content (perceptual) and style similarity.

$l_{\frac{S w i n T}{i, j}}^{φ} = \frac{1}{W_{i, j} H_{i, j}} \sum_{x = 1}^{W_{i, j}} \sum_{y = 1}^{H_{i, j}} | φ_{\frac{S w i n T}{i, j}} {(I^{H R})}_{x, y} - φ_{\frac{S w i n T}{i, j}} {(I^{S R})}_{x, y} |$

(A8)
$φ_{\frac{S w i n T}{i, j}}$ : Feature map extracted from the $(i, j)$ -th layer of the pretrained SwinT network.
$W_{i, j}$ / $H_{i, j}$ : Width and height of the feature maps at layer $(i, j)$ .

Appendix A.5. Convolutional Recurrent Neural Network (CRNN) Loss

l_{C R N N} = \sum_{k = 1}^{K} λ_{i j} \cdot l_{\frac{V C R N N}{i, j}}^{φ}

(A9)

$l_{\frac{V C R N N}{i, j}}^{φ}$ : Loss terms computed from feature maps of CRNN layers, measuring content (perceptual) and style similarity.
$λ_{i j}$ : Importance weights assigned to the j-th feature map (before activation) produced by the convolutional layer immediately preceding the i-th max-pooling layer in the CRNN network.

$l_{\frac{C R N N}{i, j}}^{φ} = \frac{1}{W_{i, j} H_{i, j}} \sum_{x = 1}^{W_{i, j}} \sum_{y = 1}^{H_{i, j}} | φ_{\frac{C R N N}{i, j}} {(I^{H R})}_{x, y} - φ_{\frac{C R N N}{i, j}} {(I^{S R})}_{x, y} |$

(A10)
$φ_{\frac{C R N N}{i, j}}$ : Feature map extracted from the $(i, j)$ -th layer of the pretrained CRNN network.
$W_{i, j}$ / $H_{i, j}$ : Width and height of the feature maps at layer $(i, j)$ .

Appendix B. Evaluation Result on Different Perceptual Loss

Appendix B.1. Multi-Task MSE Model Evaluation Result

The multi-task MSE model achieves a balance between visual reconstruction and text preservation but struggles with character-level fidelity in ambiguous or degraded areas. To improve, integrating more recognition-aware supervision or attention-based enhancements would help.

Strengths:
○
The overall plate layout and structure are preserved well.
○
Produces relatively smooth and clean character forms.
○
Background color and plate texture are consistent with the original.
Issues:
○
Multiple character misrecognitions: 6 → 8, likely due to visual similarity and low-resolution confusion. 3 → S, a common OCR issue due to overlapping shapes.
○
These errors suggest that while MSE maintains pixel-level detail, it may lack robustness to fine-grained character disambiguation.
○
The auxiliary task influence is insufficient to fully correct these errors—pointing to the need for stronger character-level loss or attention mechanisms.

Appendix B.2. CRNN Loss Evaluation Result

The CRNN loss-based model shows solid potential for text-sensitive super-resolution tasks, particularly in OCR-related scenarios. However, its output may suffer in perceptual quality, and character ambiguity (e.g., 7 → Y) may still persist without auxiliary losses (e.g., perceptual or transformer-based enhancements) to bolster visual quality.

Strengths:
○
Restores reasonable text structure and preserves character spacing.
○
Characters are mostly sharp and well-formed, enabling better performance for downstream recognition.
○
The CRNN loss enforces semantic alignment between the prediction and the true label, reducing nonsensical outputs.
Issues: The final digit 7 is misclassified as Y.
○
Could be due to visual noise, font thickness, or low contrast.
○
CRNN loss may overemphasize alignment with learned text patterns rather than true visual fidelity.
○
The overall image may appear less natural or smooth and with a mosaic pattern, as CRNN loss does not optimize for perceptual beauty.

Appendix B.3. SwinT Loss Evaluation Result

The SwinT model produces a clean and well-defined structure of the characters, demonstrating its capability to reconstruct fine details effectively. The layout of the license plate, including the character spacing and background, is well preserved, maintaining a natural and realistic appearance. Additionally, SwinT successfully enhances the sharpness of the number sequence, making the license plate more readable and visually coherent compared to the low-resolution input.

Strengths:
○
Produces clean and well-defined character structures.
○
Preserves the license plate layout, including accurate character spacing and background details.
○
Effectively recovers the number sequence with enhanced sharpness and clarity.
Issues: The character 1 is incorrectly predicted as T, possibly due to:
○
Overfitting to visual patterns that resemble the letter T.
○
Absence of character-level semantic correction in the post-processing stage.

Appendix B.4. DISTS Loss Evaluation Result

The DISTS model performs well in maintaining the visual and structural integrity of license plates, producing images that appear clean and natural. However, it can suffer from semantic misrecognitions (like 8 → B, Z → 2, 5 → 8), highlighting a limitation when used in isolation without explicit text recognition guidance. Despite this, it offers a good trade-off for tasks requiring perceptual realism.

Strengths:
○
Maintains the overall plate structure, including spacing and background.
○
Produces clean and well-defined character structures.
○
Balances fine detail and smoothness, avoiding excessive artifacts.
Issues: The digit 8 is mistakenly recognized as the letter B.
○
This likely stems from DISTS optimizing for perceptual quality rather than strict text fidelity.
○
Visual similarity between characters (e.g., 8 → B) can confuse the model in the absence of semantic constraints.

Appendix B.5. Ensemble Loss of CRNN + VGG Evaluation Result

The CRNN + VGG loss strategy shows promise in improving both visual quality and semantic fidelity. However, its effectiveness may be limited by character ambiguity, especially when visual cues dominate over recognition signals. Fine-tuning the balance between VGG and CRNN loss components could further enhance accuracy for text-critical tasks.

Strengths:
○
Produces visually appealing and structurally consistent outputs.
○
Maintains smooth background, appropriate spacing, and clarity across most characters.
○
The VGG loss contributes to natural textures, while CRNN guidance improves recognizability of characters.
Issues: the digit 8 is still misinterpreted as B.
○
May indicate that the model is still visually biased due to the dominant influence of perceptual loss.
○
Suggests the need for stronger weighting on the recognition loss or more robust character supervision.

References

Chang, S.-L.; Chen, L.-S.; Chung, Y.-C.; Chen, S.-W. Automatic license plate recognition. IEEE Trans. Intell. Transp. Syst. 2004, 5, 42–53. [Google Scholar] [CrossRef]
Ozer, M. Automatic licence plate reader (ALPR) technology: Is ALPR a smart choice in policing? Police J. 2016, 89, 117–132. [Google Scholar] [CrossRef]
Shringarpure, D.V. Vehicle Number Plate Detection and Blurring Using Deep Learning. 2023. Available online: https://norma.ncirl.ie/6662/1/darshanvijayshringarpure.pdf (accessed on 1 May 2025).
Castriota, S.; Tonin, M. Stay or flee? Hit-and-run accidents, darkness and probability of punishment. Eur. J. Law Econ. 2023, 55, 117–144. [Google Scholar] [CrossRef]
Gong, H.; Feng, Y.; Zhang, Z.; Hou, X.; Liu, J.; Huang, S.; Liu, H. A Dataset and Model for Realistic License Plate Deblurring. arXiv 2024, arXiv:2404.13677. Available online: https://www.ijcai.org/proceedings/2024/86 (accessed on 1 May 2025).
Shi, C.; Wu, C.; Gao, Y. Research on image adaptive enhancement algorithm under low light in license plate recognition system. Symmetry 2020, 12, 1552. [Google Scholar] [CrossRef]
Torkian, A.; Moallem, P. Multi-frame Super Resolution for Improving Vehicle Licence Plate Recognition. Signal Data Process. 2019, 16, 61–76. [Google Scholar] [CrossRef]
Chen, G.-W.; Yang, C.-M.; İk, T.-U. Real-time license plate recognition and vehicle tracking system based on deep learning. In Proceedings of the 2021 22nd Asia-Pacific Network Operations and Management Symposium (APNOMS), Tainan, Taiwan, 8–10 September 2021; pp. 378–381. [Google Scholar]
Jin, X.; Tang, R.; Liu, L.; Wu, J. Vehicle license plate recognition for fog-haze environments. IET Image Process. 2021, 15, 1273–1284. [Google Scholar] [CrossRef]
Wu, Y.-C.; Lee, J.-W.; Wang, H.-C. Robots for search site monitoring, suspect guarding, and evidence identification. IAES Int. J. Robot. Autom. (IJRA) 2020, 9, 84. Available online: https://pdfs.semanticscholar.org/7eeb/edec1fc2cb440425a56140474652aa6afd39.pdf (accessed on 1 May 2025). [CrossRef]
Hore, A.; Ziou, D. Image quality metrics: PSNR vs. SSIM. In Proceedings of the 2010 20th International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010; pp. 2366–2369. [Google Scholar] [CrossRef]
Sara, U.; Akter, M.; Uddin, M.S. Image quality assessment through FSIM, SSIM, MSE and PSNR—A comparative study. J. Comput. Commun. 2019, 7, 8–18. [Google Scholar] [CrossRef]
Zhang, D.; Huang, F.; Liu, S.; Wang, X.; Jin, Z. Swinfir: Revisiting the swinir with fast fourier convolution and improved trainingfor image super-resolution. arXiv 2022, arXiv:2208.11247. [Google Scholar] [CrossRef]
Zhao, H.; Gallo, O.; Frosio, I.; Kautz, J. Loss functions for image restoration with neural networks. IEEE Trans. Comput. Imaging 2016, 3, 47–57. [Google Scholar] [CrossRef]
we0091234, CRNN Chinese Plate Recognition. Available online: https://github.com/we0091234/crnn_plate_recognition (accessed on 9 April 2025).
Nascimento, V.; Laroca, R.; de A Lambert, J.; Schwartz, W.R.; Menotti, D. Super-resolution of license plate images using attention modules and sub-pixel convolution layers. Comput. Graph. 2023, 113, 69–76. [Google Scholar] [CrossRef]
Ding, K.; Ma, K.; Wang, S.; Simoncelli, E.P. Image quality assessment: Unifying structure and texture similarity. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 2567–2581. [Google Scholar] [CrossRef] [PubMed]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. Available online: https://openaccess.thecvf.com/content/ICCV2021/papers/Liu_Swin_Transformer_Hierarchical_Vision_Transformer_Using_Shifted_Windows_ICCV_2021_paper.pdf (accessed on 1 May 2025).
Gong, Y.; Deng, L.; Tao, S.; Lu, X.; Wu, P.; Xie, Z.; Ma, Z.; Xie, M. Unified Chinese license plate detection and recognition with high efficiency. J. Vis. Commun. Image Represent. 2022, 86, 103541. [Google Scholar] [CrossRef]
Wang, Q.; Lu, X.; Zhang, C.; Yuan, Y.; Li, X. LSV-LP: Large-Scale Video-Based License Plate Detection and Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 752–767. [Google Scholar] [CrossRef]
Xu, Z.; Yang, W.; Meng, A.; Lu, N.; Huang, H.; Ying, C.; Huang, L. Towards end-to-end license plate detection and recognition: A large dataset and baseline. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 255–271. Available online: https://openaccess.thecvf.com/content_ECCV_2018/papers/Zhenbo_Xu_Towards_End-to-End_License_ECCV_2018_paper.pdf (accessed on 1 May 2025).
Gonc, G.R.; da Silva, S.P.G.; Menotti, D.; Schwartz, W.R. Benchmark for license plate character segmentation. J. Electron. Imaging 2016, 25, 053034. [Google Scholar] [CrossRef]
Laroca, R.; Severo, E.; Zanlorensi, L.A.; Oliveira, L.S.; Gonc, G.R.; Schwartz, W.R.; Menotti, D. A Robust Real-Time Automatic License Plate Recognition Based on the YOLO Detector. In Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil, 8–13 July 2018; pp. 1–10. [Google Scholar] [CrossRef]
Laroca, R.; Cardoso, E.V.; Lucio, D.R.; Estevam, V.; Menotti, D. On the Cross-dataset Generalizationin License Plate Recognition. In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP), Online Event, 6–8 February 2022; pp. 166–178. [Google Scholar] [CrossRef]
Gonc, G.R.; Diniz, M.A.; Laroca, R.; Menotti, D.; Schwartz, W.R. Real-time automatic license plate recognition through deep multi-task networks. In Proceedings of the 2018 31st SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), Parana, Brazil, 29 October–1 November 2018; pp. 110–117. [Google Scholar] [CrossRef]
Singh, J.; Bhushan, B. Real time Indian license plate detection using deep neural networks and optical character recognition using LSTM tesseract. In Proceedings of the 2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS), Greater Noida, India, 18–19 October 2019; pp. 347–352. [Google Scholar] [CrossRef]
Du, Y.; Li, C.; Guo, R.; Yin, X.; Liu, W.; Zhou, J.; Bai, Y.; Yu, Z.; Yang, Y.; Dang, Q. Pp-ocr: A practical ultra lightweight ocr system. arXiv 2020, arXiv:2009.09941. [Google Scholar] [CrossRef]
Yu, W.; Ibrayim, M.; Hamdulla, A. Scene text recognition based on improved CRNN. Information 2023, 14, 369. [Google Scholar] [CrossRef]
Park, S.C.; Park, M.K.; Kang, M.G. Super-resolution image reconstruction: A technical overview. IEEE Signal Process. Mag. 2003, 20, 21–36. [Google Scholar] [CrossRef]
Glasner, D.; Bagon, S.; Irani, M. Super-resolution from a single image. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision, Kyoto, Japan, 29 September–2 October 2009; pp. 349–356. [Google Scholar] [CrossRef]
Dong, C.; Loy, C.C.; He, K.; Tang, X. Learning a deep convolutional network for image super-resolution. In Proceedings of the Computer Vision-ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part IV 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 184–199. [Google Scholar] [CrossRef]
Ledig, C.; Theis, L.; Husza, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4681–4690. Available online: https://openaccess.thecvf.com/content_cvpr_2017/papers/Ledig_Photo-Realistic_Single_Image_CVPR_2017_paper.pdf (accessed on 1 May 2025).
O’shea, K.; Nash, R. An introduction to convolutional neural networks. arXiv 2015, arXiv:1511.08458. [Google Scholar] [CrossRef]
Lai, W.-S.; Huang, J.-B.; Ahuja, N.; Yang, M.-H. Fast and accurate image super-resolution with deep laplacian pyramid networks. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 2599–2613. [Google Scholar] [CrossRef]
Zhang, K.; Zuo, W.; Zhang, L. Learning a single convolutional super-resolution network for multiple degradations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3262–3271. Available online: https://openaccess.thecvf.com/content_cvpr_2018/papers/Zhang_Learning_a_Single_CVPR_2018_paper.pdf (accessed on 1 May 2025).
Lin, H.; Hosu, V.; Saupe, D. KADID-10k: A large-scale artificially distorted IQA database. In Proceedings of the 2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX), Berlin, Germany, 5–7 June 2019; pp. 1–3. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef]
Deng, J.; Wei, H.; Lai, Z.; Gu, G.; Chen, Z.; Chen, L.; Ding, L. Spatial Transform Depthwise Over- Parameterized Convolution Recurrent Neural Network for License Plate Recognition in Complex Environment. J. Comput. Inf. Sci. Eng. 2023, 23, 011010. [Google Scholar] [CrossRef]
Du, S.; Ibrahim, M.; Shehata, M.; Badawy, W. Automatic license plate recognition (ALPR): A state-of- the-art review. IEEE Trans. Circuits Syst. Video Technol. 2012, 23, 311–325. [Google Scholar] [CrossRef]
Seibel, H.; Goldenstein, S.; Rocha, A. Eyes on the target: Super-resolution and license-plate recognition in low-quality surveillance videos. IEEE Access 2017, 5, 20020–20035. [Google Scholar] [CrossRef]
Cheriet, M.; Kharma, N.; Liu, C.-L.; Suen, C. Character Recognition Systems: A Guide for Students and Practitioners; John Wiley & Sons: Hoboken, NJ, USA, 2007. [Google Scholar]
Medsker, L.R.; Jain, L. Recurrent neural networks. Des. Appl. 2001, 5, 2. [Google Scholar]
Yu, Y.; Si, X.; Hu, C.; Zhang, J. A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput. 2019, 31, 1235–1270. [Google Scholar] [CrossRef]
Dey, R.; Salem, F.M. Gate-variants of gated recurrent unit (GRU) neural networks. In Proceedings of the 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS), Boston, MA, USA, 6–9 August 2017; pp. 1597–1600. [Google Scholar] [CrossRef]
Mehrotra, K.; Gupta, M.K.; Khajuria, K. Collaborative deep neural network for printed text recognition of indian languages. In Proceedings of the 2019 Fifth International Conference on Image Information Processing (ICIIP), Shimla, India, 15–17 November 2019; pp. 252–256. [Google Scholar] [CrossRef]
Dutta, K.; Krishnan, P.; Mathew, M.; Jawahar, C.V. Improving CNN-RNN hybrid networks for hand-writing recognition. In Proceedings of the 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), Niagara Falls, NY, USA, 5–8 August 2018; pp. 80–85. [Google Scholar] [CrossRef]
Rao, Z.; Yang, D.; Chen, N.; Liu, J. License plate recognition system in unconstrained scenes via a new image correction scheme and improved CRNN. Expert Syst. Appl. 2024, 243, 122878. [Google Scholar] [CrossRef]
Aguilar, S.T.; Jolivet, V. Handwritten text recognition for documentary medieval manuscripts. J. Data Min. Digit. Humanit. 2023. Available online: https://hal.science/hal-03892163v3/file/HTR_medieval_latin_french_V3.pdf (accessed on 1 May 2025).
Dosselmann, R.; Yang, X.D. A comprehensive assessment of the structural similarity index, Signal. Image Video Process. 2011, 5, 81–91. [Google Scholar] [CrossRef]
Liu, Y.; Chen, H.; Chen, Y.; Yin, W.; Shen, C. Generic perceptual loss for modeling structured output dependencies. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 5424–5432. [Google Scholar]
Github for Chinese License Plate Generation. Available online: https://github.com/zheng-yuwei/license-plate-generator (accessed on 1 May 2025).
Gou, J.; Yu, B.; Maybank, S.J.; Tao, D. Knowledge distillation: A survey. Int. J. Comput. Vis. 2021, 129, 1789–1819. Available online: https://arxiv.org/pdf/2006.05525 (accessed on 1 May 2025). [CrossRef]
Wu, K.; Zhang, J.; Peng, H.; Liu, M.; Xiao, B.; Fu, J.; Yuan, L. Tinyvit: Fast pretraining distillation for small vision transformers. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2022; pp. 68–85. Available online: https://arxiv.org/pdf/2207.10666 (accessed on 1 May 2025).
Venkatesh, S.V.; Anand, A.P.; Sahar, S.G.; Ramakrishnan, A.; Vijayaraghavan, V. Real-time Surveillance based Crime Detection for Edge Devices. In VISIGRAPP; (4: VISAPP); SciTePress: Setúbal, Portugal, 2020; pp. 801–809. Available online: https://www.academia.edu/download/92679187/89901.pdf (accessed on 1 May 2025).

Figure 1. The surveillance system for license plate restoration and recognition for police. The license plate shown in the figure follows the standard Chinese format: it begins with a capital letter representing the province and city code, followed by six additional alphanumeric characters, making a total of seven characters.

Figure 2. Self-built license plate pair dataset extraction from real video of low-resolution image and high-resolution image. The Chinese license plate shown in the figure follows the standard Chinese format: it begins with a capital letter representing the province and city code, followed by six additional alphanumeric characters, making a total of seven characters.

Figure 3. Workflow of the license plate super-resolution (SR) model using SwinFIR. During training, features are extracted from the super-resolved (SR) and high-resolution (HR) images using a pretrained neural network, and the perceptual loss between them is used to optimize the SwinFIR model. During testing, the output SR images are fed into a pretrained CRNN-based OCR model to evaluate SR model performance based on character recognition accuracy.

Figure 4. License plate restoration with SwinFIR SR model training and validation flow. The super-resolved (SR) and high-resolution (HR) images using a pretrained neural network, and the perceptual loss between them is used to optimize the SwinFIR model. During testing, the output SR images are fed into a pretrained CRNN-based OCR model to evaluate SR model performance based on character recognition accuracy.

Figure 5. Self-built dataset of image pair of high resolution (HR) (top) and low resolution (LR) (bottom) as dataset. The Chinese license plate shown in the figure follows the standard Chinese format: it begins with a capital letter representing the province and city code, followed by six additional alphanumeric characters, making a total of seven characters.

Figure 6. SwinFIR architecture and its sub-modules for global feature learning. (a) Residual Swin Transformer Block (RSTB) for deep feature extraction. (b) Swin Transformer Layer (STL) for global attention modeling. (c) Spatial Frequency Block (SFB) for frequency-aware feature enhancement. The input LR image and output SR image are Chinese license plate. The Chinese license plate shown in the figure follows the standard Chinese format: it begins with a capital letter representing the province and city code, followed by six additional alphanumeric characters, making a total of seven characters.

Figure 7. Super resolution model SwinFIR training for transforming low-resolution image to super-resolution image.

Figure 8. Illustration of perceptual loss computation using a pretrained neural network. Features are extracted from both high-resolution (HR) and super-resolved (SR) images at multiple layers of the pretrained network. The perceptual loss is calculated based on the difference between HR and SR feature representations, and is used to guide the training of the super-resolution (SR) model for improved visual and recognition performance.

Figure 9. Architecture of the pretrained multi-task loss function combining pixel-wise MSE loss and perceptual loss. The super-resolution (SR) model is trained by minimizing the combined loss, where the MSE loss measures the pixel-level difference between the super-resolved (SR) and high-resolution (HR) images, and the perceptual loss measures the feature-level difference between SR and HR images using feature maps extracted from a pretrained neural network. This joint loss encourages the model to produce images that are both visually accurate and structurally consistent with the ground truth.

Figure 10. DISTS loss with differences in texture similarity and structural similarity as perceptual loss of LR and HR image pairs.

Figure 11. VGG loss with perceptual loss and style loss from the VGG-19 network.

Figure 12. Architecture of the proposed SwinT perceptual loss. Features are extracted from a pretrained Swin Transformer (96 channels, 4 stages: {2, 2, 6, 2}) using selected layers {0.1, 0.1, 1, 1}. Perceptual and style features are combined to form key features, and the SwinT loss is defined as their difference between LR and HR images. This loss is backpropagated to optimize the SwinFIR super-resolution model.

Figure 13. Architecture of the CRNN perceptual loss. Features are extracted from a pretrained CRNN using selected layers (e.g., {relu1_4, relu2_2, relu3_2, relu4_2}) with equal weights. The L1 difference between HR and SR feature maps defines the CRNN loss, which is backpropagated to optimize the SwinFIR model. The CRNN is pretrained on CCPD and CRPD license plate datasets.

Figure 14. (a) Validation curves in orange of the SwinT perceptual loss, showing both perceptual and style loss components over training epochs. The combined loss decreases steadily, indicating improved feature reconstruction quality during training. (b) SSIM (Structural Similarity Index Measure) trend in orange for SwinT on the validation set. SSIM increases throughout training and peaks around 5000 epochs, suggesting that the model achieves optimal structural fidelity to high-resolution images at this point.

Figure 15. Data distribution of Chinese license plate self-built dataset.

Figure 16. The test flow for the SwinFIR super-resolution image to the pretrained OCR recognizer for license plate recognition accuracy.

Figure 17. OCR accuracy of SR image with trained SwinFIR model and the OCR accuracy of LR image of license plate across SR methods of different perceptual loss.

Figure 18. Super-resolution image comparison with different perceptual losses. The Chinese license plate shown in the figure follows the standard Chinese format: it begins with a capital letter representing the province and city code, followed by six additional alphanumeric characters, making a total of seven characters. Each license plate image includes its OCR recognition result, with incorrectly recognized characters highlighted in red.

Figure 19. The license plate position from left to right of 0 to 6. The Chinese license plate shown in the figure follows the standard Chinese format: it begins with a capital letter representing the province and city code, followed by six additional alphanumeric characters, making a total of seven characters.

Figure 20. (a) Character recognition accuracy at position 0. (b) Character recognition accuracy at position 1. (c) Character recognition accuracy at position 2. (d) Character recognition accuracy at position 3. (e) Character recognition accuracy at position 4. (f) Character recognition accuracy at position 5. (g) Character recognition accuracy at position 6.

Figure 21. Examples from the synthetic dataset of Chinese license plate. The Chinese license plate shown in the figure follows the standard Chinese format: it begins with a capital letter representing the province and city code, followed by six additional alphanumeric characters, making a total of seven characters. The ‘HR’ column displays high-resolution images that include perspective distortion, HSV adjustments, blue backgrounds, noise, and stains. The ‘LR’ column shows the corresponding low-resolution input images. The columns labeled ‘SR img VGG loss’ and ‘SR img SwinT + DISTS loss’ present the super-resolved outputs generated by SwinFIR, trained using VGG loss and SwinT + DISTS perceptual losses, respectively. Each license plate image includes its OCR recognition result, with incorrectly recognized characters highlighted in red.

Table 1. Comparison of public license plate datasets.

Dataset	Number of Images	Country	Distance Variability	Tilt Angle Variability	Blur	Illumination Variability	Resolution
LSV-LP [21]	1,175,390	China	High	High	Yes	Yes	1920 × 1080
CCPD [22]	290,000	China	Medium	Medium	Yes	Yes	720 × 1160
SSIG-SegPlates [23]	2000	Brazil	Low	Low	No	Limited	1920 × 1080
UFPR-ALPR [24]	4500	Brazil	Medium	Low	Yes	Yes	1920 × 1080
RodoSol [25]	20,000	Brazil	Medium	Medium	Yes	Yes	1920 × 1080

Table 2. The perceptual loss calculated from different pretrained networks for SwinFIR model training.

Network Architecture	Layers for Feature Extraction	Losses	Network Weight from Pretrained Dataset
CNN [36]	last layer	Multi-task MSE loss [26] (MSE + perceptual loss)	SSIG-ALPR [26]
VGG-16 [13]	{conv1_2, conv2_2, conv3_3, conv4_3, conv5_3}	DISTS loss [16] (Structural loss + Texture loss)	KADID-10k [37]
VGG-19	{conv1_2, conv2_2, conv3_4, conv4_4, conv5_4}	VGG loss [17] (Perceptual loss + style loss)	ImageNet-10k [38]
Swin Transformer [13]	{layers.0.blocks.1,layers.1.blocks.1, layers.2.blocks.5,layers.3.blocks.1}	SwinT loss [19] (proposed) (Perceptual loss + style loss)	ImageNet-1k [39]
CRNN [14]	{conv3_2, conv4_1, conv4_2}	CRNN loss [40] (proposed) (Perceptual loss)	CCPD [20] and CRPD [22]

Table 3. OCR accuracy tier differentiation with references.

Metric Type	Description	Implication
7/7 Accuracy	Strictest—all characters correct	Best for law enforcement, tolling, legal ID.
6/7 Accuracy	Tolerates 1 character error	Good for real-time monitoring, alerting where minor errors are acceptable.
5/7 Accuracy	Still useful in noisy environments	Moderate-level tasks—fleet tracking, logistics where identity can be inferred with tolerable errors.
Average Accuracy	General OCR efficiency (across all characters and samples)	Overall system health indicator—used for model selection and benchmarking.

Table 4. License plate OCR recognition accuracy across SR methods of different perceptual loss.

Super-Resolution Model	Loss from Pretrained Model		Architecture Type	Function/Strength	OCR Accuracy
Super-Resolution Model	Loss from Pretrained Model		Architecture Type	Function/Strength	7 out of 7	6 out of 7	5 out of 7
Pixel Shuffle (Repro.) [16]	Single loss	Multi-Task MSE	MSE + OCR perceptual loss	Multi-task for license plate detection, character segmentation, and OCR proposed by Nascimento et al. (2023) [16]	3.85%	38.46%	67.95%
SwinFIR	Single loss	Multi-task MSE	MSE + OCR perceptual loss	Multi-task proposed by Nascimento et al. (2023) [16]	19.23%	42.31%	66.67%
		CRNN	Recurrent Neural Network (RNN)	Excels At Sequence Learning, Beneficial For Character Continuity.	25.64%	61.54%	82.05%
		SwinT	Transformer (VisionTransformer)	Leverages Global Attention, Ideal For Holistic Pattern Enhancement.	28.21%	61.54%	83.33%
		DISTS	CNN-Based Perceptual Loss	Emphasizes Visual Similarity Using Deep Features, Enhancing Human-Perceived Quality.	32.05%	65.38%	84.62%
		VGG	CNN-Based Perceptual Loss	Captures Localized Textures, Aiding Character Shape Restoration.	33.33%	73.08%	88.46%
	Ensemble loss	CRNN + VGG	CNN + RNN (Ensemble)	Combines spatial feature extraction (VGG) with sequence modeling (CRNN); improves character-level consistency.	39.74%	69.23%	89.74%
	Ensemble loss	SwinT + DISTS	Transformer + CNN Loss (Ensemble)	uses global attention (Swin Transformer) with deep perceptual similarity (DISTS); excels in semantic restoration.	47.44%	75.64%	89.74%

Table 5. Effect of loss models on OCR performance in license plate SR.

Loss Model for Feature Extraction		OCR Average Accuracy (R_i)	R_SR − R_LR	Avg. OCR Characters Recognized
LR		75.57%	0.00%	5.29
SwinFIR original loss	Charbonnier Loss	68.68%	−6.89%	4.81
Single perceptual loss	Multi-Task MSE	70.14%	−5.43%	4.91
	CRNN	78.14%	2.57%	5.47
	SwinT	79.57%	4.00%	5.57
	DISTS	80.57%	5.00%	5.64
	VGG	82.57%	7.00%	5.78
Ensemble perceptual losses	Multi-Task MSE + SwinT	80.14%	4.57%	5.61
	DISTS + VGG	80.14%	4.57%	5.61
	Multi-Task MSE + CRNN	81.86%	6.29%	5.73
	SwinT + VGG	81.86%	6.29%	5.73
	CRNN + DISTS	82.86%	7.29%	5.80
	SwinT + CRNN	82.86%	7.29%	5.80
	CRNN + VGG	83.43%	7.86%	5.84
	SwinT + DISTS	85.14%	9.57%	5.96
HR		99.00%	23.43%	6.93

Table 6. OCR performance comparison of HR, LR, and SR with VGG loss. SR with SwinT + DISTS loss with 59 test license plate images.

	GT	HR	LR	SROF VGG LOSS	SR OF SWINT + DISTS LOSS
OCR ACCURACY OF TOTAL CHARACTERS	413	389	300	363	364
OCR ACCURACY OF CHARACTERS PER LICENSE PLATE	7	6.59	5.08	6.15	6.17
OCR AVERAGE ACCURACY (%)	-	94.19%	72.64%	87.89%	88.14%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tsai, P.-F.; Shiu, J.-Y.; Yuan, S.-M. A Deep Learning Framework of Super Resolution for License Plate Recognition in Surveillance System. Mathematics 2025, 13, 1673. https://doi.org/10.3390/math13101673

AMA Style

Tsai P-F, Shiu J-Y, Yuan S-M. A Deep Learning Framework of Super Resolution for License Plate Recognition in Surveillance System. Mathematics. 2025; 13(10):1673. https://doi.org/10.3390/math13101673

Chicago/Turabian Style

Tsai, Pei-Fen, Jia-Yin Shiu, and Shyan-Ming Yuan. 2025. "A Deep Learning Framework of Super Resolution for License Plate Recognition in Surveillance System" Mathematics 13, no. 10: 1673. https://doi.org/10.3390/math13101673

APA Style

Tsai, P.-F., Shiu, J.-Y., & Yuan, S.-M. (2025). A Deep Learning Framework of Super Resolution for License Plate Recognition in Surveillance System. Mathematics, 13(10), 1673. https://doi.org/10.3390/math13101673

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Deep Learning Framework of Super Resolution for License Plate Recognition in Surveillance System

Abstract

1. Introduction

2. Related Work

2.1. License Plates Dataset

2.2. License Plate Recognition with OCR Recognizer

2.3. Licenese Plate Restoration with Super Resolution (SR)

2.4. Perceptual Loss for Super Resolution

3. Methodology

3.1. Overview

3.2. Dataset Generation

3.3. Aablation Study on PixelShuffle Three-Fold Attention Module

3.4. Super Resolution Model of SwinFIR

3.5. Perceptual Loss Calculation

3.5.1. Multi-Task MSE Loss

3.5.2. Deep Image Structure and Texture Similarity (DISTS) Loss

3.5.3. VGG Loss

3.5.4. Swin Transformer Loss

3.5.5. Convolutional Recurrent Neural Network (CRNN) Loss

3.6. Validation and Test Flow

4. Result and Discussion

4.1. OCR License Plate Recognition Accuracy by Tiers

4.2. OCR Character Recognition Accuracy on Average

4.3. Super-Resolution Images by Perceptual Losses

4.4. Character Recognition Accuracy for Single Loss on Each Position

4.5. Evaluation Result on Other Datasets

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Formula for Different Perceptual Loss

Appendix A.1. Multi-Task MSE Loss Formula

Appendix A.2. Deep Image Structure and Texture Similarity (DISTS) Loss Formula

Appendix A.3. VGG Loss Formula

Appendix A.4. Swin Transformer Loss Formula

Appendix A.5. Convolutional Recurrent Neural Network (CRNN) Loss

Appendix B. Evaluation Result on Different Perceptual Loss

Appendix B.1. Multi-Task MSE Model Evaluation Result

Appendix B.2. CRNN Loss Evaluation Result

Appendix B.3. SwinT Loss Evaluation Result

Appendix B.4. DISTS Loss Evaluation Result

Appendix B.5. Ensemble Loss of CRNN + VGG Evaluation Result

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI