4.1. Dataset and Experimental Settings
This section describes the datasets, experimental environments (hardware and software), and evaluation metrics used to assess the proposed Korean license plate recognition system. All experiments are conducted using a two-stage pipeline in which license plate detection and recognition are explicitly separated.
Performance is evaluated not only in terms of recognition accuracy but also with respect to processing speed, in order to assess the practical applicability of each configuration in real-world ALPR deployments.
4.1.1. Datasets
To evaluate recognition performance under diverse imaging conditions and license plate structures, three datasets are used in this study: UFPR-ALPR [
24], RodoSol-ALPR [
25], and the AI-HUB Vehicle and License Plate Recognition Dataset [
26]. These datasets differ in acquisition environments, camera configurations, and license plate formats, enabling assessment of both cross-domain generalization and Korean-specific recognition characteristics.
UFPR-ALPR Dataset
UFPR-ALPR is a public ALPR dataset collected in real driving scenarios where both vehicles and cameras are in motion, resulting in realistic capture conditions. The dataset contains 4500 images from 150 vehicles, with annotations covering more than 30,000 license plate characters. Images are captured using three cameras—GoPro Hero4 Silver, Huawei P9 Lite, and iPhone 7 Plus—and provided in PNG format at a resolution of . Each camera contributes 1500 images, allowing analysis of domain shifts caused by camera-specific characteristics.
The dataset includes gray-license-plate cars (900 images), red-license-plate cars (300 images), and gray-license-plate motorcycles (300 images). Data are split into training, validation, and test sets with ratios of 40%, 20%, and 40%, respectively. Motion blur, illumination variation, and viewpoint changes make UFPR-ALPR a challenging benchmark, particularly for OCR under degraded conditions.
RodoSol-ALPR Dataset
RodoSol-ALPR is a large-scale dataset collected using fixed cameras at toll booths along the ES-060 highway in Espírito Santo, Brazil. It consists of 20,000 images covering multiple vehicle types, including cars, motorcycles, buses, and trucks.
All images have a resolution of and include both Brazilian legacy and Mercosur license plate standards. The dataset is split into training, validation, and test sets at ratios of 40%, 20%, and 40%, respectively, with annotations provided as four corner coordinates of license plates.
In this study, RodoSol-ALPR is used exclusively as a cross-domain generalization benchmark. Given its Latin-based alphanumeric license plate structure, it is not used for Korean-specific error analysis, which is instead conducted primarily on the AI-HUB dataset.
AI-HUB Vehicle and License Plate Recognition Dataset
The AI-HUB dataset is a large-scale video-based dataset collected from real-world CCTV environments in South Korea. It comprises over 500 h of video captured at 75 locations and includes approximately 500,000 vehicle images and 100,000 license plate images with JSON-based annotations.
This dataset reflects characteristics unique to Korean license plates, including mixed digit–Korean–digit structures, multiple layout variants, color-based categories, and reflective distortions caused by plate materials and illumination conditions. These properties make the dataset particularly suitable for evaluating Korean ALPR systems.
In this study, the AI-HUB dataset serves as the primary benchmark for analyzing Korean license plate–specific recognition characteristics, such as prefix handling, structural consistency, and region-dependent layout variations. Accordingly, all Korean-specific error analyses reported in this work are based primarily on experiments conducted using the AI-HUB dataset.
4.1.2. Hardware and Software Environment
Experiments are conducted in both server-class GPU environments and embedded device environments to evaluate performance across diverse deployment scenarios. The server setup uses an NVIDIA GeForce RTX 2080 Ti GPU (NVIDIA Corporation, Santa Clara, CA, USA), while embedded platforms include the Jetson Nano (NVIDIA Corporation, Santa Clara, CA, USA) and Raspberry Pi 5 (Raspberry Pi Ltd., Cambridge, UK).
License plate detection models are implemented using Ultralytics YOLOv12 within Docker-based environments. Recognition models are trained and evaluated using PyTorch 2.1.0 (Meta AI, Menlo Park, CA, USA)-based Docker containers.
Software versions and inference backends vary across platforms to reflect realistic deployment conditions in practical ALPR systems.
4.1.3. Evaluation Metrics
Performance is evaluated separately for the detection and recognition stages. In addition to accuracy-related metrics, processing speed measured in frames per second (FPS) is reported to assess computational efficiency.
Detection Metrics
Detection performance is evaluated using precision, recall, and F1-score. A detection is considered correct if the Intersection over Union (IoU) between the predicted and ground-truth bounding boxes exceeds a predefined threshold. FPS is reported to characterize detection efficiency under different hardware environments.
Recognition Metrics
Recognition performance is evaluated using edit distance (ED), character accuracy (CA), and character error rate (CER). Recognition FPS is also reported to assess the real-time feasibility of each OCR model.
End-to-end latency is discussed at a conceptual level by decomposing the ALPR pipeline into detection, ROI cropping and normalization, tracking, OCR inference, and post-processing stages. In this study, detection and recognition stages are evaluated independently to enable controlled comparison of OCR architectures under identical detection results. Consequently, direct end-to-end latency measurements of a fully integrated runtime pipeline are not reported, and this is treated as a limitation of the experimental scope.
4.3. License Plate Detection Results
This subsection evaluates license plate detection performance using nine YOLOv12-based models across different datasets and hardware platforms.
4.3.1. Detection Performance on RTX 2080 Ti
Detection performance is first evaluated using the original PyTorch models on an RTX 2080 Ti GPU. As shown in
Figure 5, models trained on the RodoSol-ALPR and AI-HUB datasets (train18, train20, train21, train23, train24, and train26) achieve consistently high detection accuracy, with precision, recall, and F1-score values exceeding 0.998 across all configurations. The minimal gap between precision and recall indicates stable detection behavior with few false positives or missed detections.
In contrast, models trained on the UFPR-ALPR dataset (train19, train22, and train25) exhibit lower detection performance, with F1-scores ranging from approximately 0.89 to 0.92. This degradation can be attributed to the challenging characteristics of the UFPR dataset, including large viewpoint variations, illumination changes, and motion blur, which aligns with observations reported in previous studies.
While detection accuracy varies only marginally across model sizes, inference speed differs substantially. YOLOv12-n-based models (train18–train20) achieve real-time performance exceeding 60 FPS, whereas YOLOv12-s and YOLOv12-m models (train21–train26) show significantly lower speeds of approximately 6–7 FPS. Based on this trade-off, YOLOv12-n-based models (train18, train19, and train20) are selected as the license plate detectors in the proposed system, as they offer a favorable balance between detection accuracy and computational efficiency.
4.3.2. TensorRT-Based Optimization
Figure 6 illustrates the performance changes observed when the detection models are converted to TensorRT engines with different precision configurations. On the RTX 2080 Ti, FP16 TensorRT inference maintains precision, recall, and F1-score values comparable to those of the original PyTorch models, while substantially improving inference speed. For instance, models trained on the AI-HUB dataset (train20, train23, and train26) achieve FPS values of up to approximately 190, satisfying real-time deployment requirements.
In contrast, INT8 quantization leads to notable drops in precision for several models and datasets, including train18, train21, and train24. This degradation indicates that aggressive quantization can adversely affect boundary-sensitive tasks such as license plate detection. These results suggest that while FP16 TensorRT optimization provides a reliable speed–accuracy trade-off, the use of INT8 quantization requires careful calibration and dataset-specific validation rather than indiscriminate application.
4.3.3. TensorFlow Lite Inference Performance
TensorFlow Lite-based inference is evaluated using FP16 (TFLite16) and FP32 (TFLite32) configurations with different runtime options, as shown in
Figure 7. On the RTX 2080 Ti, TensorFlow Lite models maintain precision, recall, and F1-score values comparable to those of the original PyTorch models, indicating that detection accuracy is largely preserved.
However, inference speed achieved with TensorFlow Lite is consistently lower than that of TensorRT-based inference. In addition, FPS degradation becomes more pronounced as model size increases, reflecting structural limitations in fully exploiting GPU acceleration. These results suggest that while TensorFlow Lite is effective for maintaining detection accuracy, it is less suitable than TensorRT for high-throughput GPU-based deployment.
4.3.4. Detection Performance on Jetson Nano
Detection performance on the Jetson Nano platform is evaluated using TensorRT-optimized models that exhibit high accuracy on the RTX 2080 Ti. As shown in
Figure 8, precision, recall, and F1-score remain largely consistent across different TensorRT configurations (FP16, FP16 + INT8, INT8, and original), with variations limited to the third decimal place. This indicates that detection accuracy is dominated by dataset characteristics rather than hardware-level optimization.
Inference speed varies across configurations, with FP16 generally achieving the highest FPS on Jetson Nano. However, increased throughput does not lead to noticeable improvements in detection accuracy, confirming a speed–accuracy trade-off under embedded constraints. Notably, for the UFPR dataset, F1-scores remain relatively low regardless of FPS, further emphasizing the dominant impact of dataset difficulty on detection performance.
4.3.5. Detection Performance on Raspberry Pi 5
Detection performance on the Raspberry Pi 5 platform is evaluated using TensorFlow Lite-based inference, as shown in
Figure 9. Precision, recall, and F1-score remain nearly identical across the FP16 and original configurations, indicating that inference option changes have minimal impact on detection accuracy. In contrast, inference speed is slightly higher under the original configuration, suggesting a modest advantage in throughput.
Across datasets, the RodoSol-ALPR and AI-HUB datasets maintain highly stable detection performance with F1-scores exceeding 0.998, whereas the UFPR-ALPR dataset exhibits lower F1-scores in the range of approximately 0.88–0.92. This trend is consistent with results observed on the GPU and Jetson Nano platforms, reinforcing that dataset difficulty is the dominant factor influencing detection performance across hardware environments.
Overall, in low-power embedded settings such as the Raspberry Pi 5, inference configuration selection should prioritize processing speed over marginal accuracy differences. Based on the experimental results, the original configuration provides the most practical balance between detection accuracy and FPS for real-world deployment.
4.4. License Plate Recognition Results
This section presents a quantitative evaluation of license plate recognition (OCR) performance following the detection stage. Two OCR architectures are evaluated: (1) a CNN + Attention-LSTM-based model (Model 1, v4) and (2) a MobileNetV3 + Transformer decoder-based model (Model 2, v7). Each model is trained on three datasets (AI-HUB, RodoSol-ALPR, and UFPR-ALPR) and evaluated on a high-performance GPU (RTX 2080 Ti) as well as embedded platforms. Performance is measured using edit distance (ED), character accuracy (CA), character error rate (CER), and frames per second (FPS).
4.4.1. License Plate Recognition Performance on RTX 2080 Ti
CNN + Attention-LSTM-Based OCR (Model 1, v4)
Table 4 summarizes the recognition performance of Model 1 on the RTX 2080 Ti. The model trained on the AI-HUB dataset (exp000) achieves the best overall performance, with an average ED of 0.0583, CA of 99.12%, and CER of 0.76%, while maintaining a high inference speed exceeding 200 FPS. These results demonstrate the effectiveness of large-scale, domain-specific data for Korean license plate recognition.
The model trained on the RodoSol-ALPR dataset (exp001) shows slightly reduced accuracy compared to AI-HUB, achieving a CA of 96.79% and CER of 3.21%, while maintaining an inference speed of approximately 170 FPS. This performance degradation is attributed to increased variability in imaging conditions and license plate appearance.
In contrast, the model trained on the UFPR-ALPR dataset (exp002) exhibits severely degraded recognition accuracy, with a CA of 24.18% and CER of 75.42%. Despite high FPS, the recognition performance is insufficient for practical use. This consistent degradation across runs indicates a dataset-imposed performance ceiling caused by extreme perspective distortion, small character sizes, and low effective resolution, rather than unstable optimization or stochastic training effects.
The large FPS gap observed between Model 1 and Model 2 on the RTX 2080 Ti is primarily attributed to differences in decoder behavior. Although Transformer decoders enable parallel attention computation, autoregressive decoding with a batch size of one still requires sequential token generation. In contrast, the simpler Attention-LSTM decoder incurs lower per-step overhead, resulting in substantially higher practical throughput.
4.4.2. License Plate Recognition Performance on Jetson Nano
Jetson Nano represents a resource-constrained embedded environment in which processing speed becomes a critical factor for real-time OCR deployment.
Table 6 summarizes the recognition performance of both OCR models on this platform.
Under this setting, Model 1 satisfies real-time requirements. The AI-HUB–trained model (exp000) achieves a CA of 99.10%, a CER of 0.77%, and an inference speed of approximately 22 FPS. The RodoSol-trained model (exp001) records the highest throughput, exceeding 23 FPS, while maintaining sufficient accuracy for alphanumeric license plate recognition. In contrast, although the UFPR-trained model (exp002) attains comparable FPS, its recognition accuracy remains extremely low, indicating that dataset difficulty dominates performance under severe imaging and tracking conditions.
Model 2 exhibits substantial performance degradation on Jetson Nano. As shown in
Table 6b, inference speed drops to below 5 FPS across all datasets, with the UFPR-trained model falling below 2 FPS. These results indicate that, under the current autoregressive decoding setup, the Transformer-based v7 model does not meet real-time processing requirements on low-power embedded platforms. Consequently, its practical applicability in such environments is limited unless additional optimization strategies, such as decoder simplification, hardware-specific acceleration, or non-autoregressive inference, are employed.
4.4.3. License Plate Recognition Performance on Raspberry Pi 5
Raspberry Pi 5 provides slightly higher computational capability than Jetson Nano but lacks dedicated GPU acceleration.
Table 7 summarizes the recognition performance of both OCR models on this platform.
Model 1 continues to exhibit stable recognition performance. As shown in
Table 7a, the AI-HUB–trained model (exp000) achieves a CA of 99.10% with an inference speed of approximately 16 FPS, which is sufficient for real-time OCR in low-power environments. Similar trends are observed on the RodoSol dataset, while recognition accuracy on the UFPR dataset remains consistently low, reaffirming that dataset difficulty dominates performance under severely degraded imaging conditions.
Model 2 again demonstrates clear throughput limitations on Raspberry Pi 5. According to
Table 7b, the AI-HUB–trained model achieves approximately 5 FPS, while the RodoSol-trained model reaches around 9 FPS, approaching but not consistently meeting real-time requirements. Performance on the UFPR dataset remains inadequate in both accuracy and speed. These results confirm that Transformer-based decoders impose substantial computational overhead on embedded platforms without GPU acceleration, limiting their practical applicability under real-time constraints.
4.4.4. Overall Analysis and Discussion
Although FPS values for detection and recognition modules are reported individually, they do not directly translate into end-to-end system throughput. In practical ALPR pipelines, overall latency is determined by the slowest module due to pipeline synchronization, memory transfers, and sequential execution constraints.
From a computational perspective, this observation is consistent with the model complexity analysis in
Section 3.5. Despite having fewer parameters, the Transformer-based OCR model incurs higher FLOPs and memory usage, resulting in reduced practical efficiency. This discrepancy highlights that parameter count alone is an insufficient indicator of real-time suitability in deployment-oriented scenarios.
Results on the UFPR-ALPR dataset further clarify conditions under which OCR architecture choice becomes largely ineffective. When (1) the detected license plate region contains an insufficient number of effective pixels per character, (2) severe perspective distortion disrupts spatial alignment, and (3) tracking-based ROI propagation introduces cumulative spatial drift, both LSTM-based and Transformer-based decoders fail to recover reliable character sequences. Under such conditions, upstream feature degradation dominates overall performance, rendering decoder-level optimization insufficient.
Finally, the low variance observed across repeated runs indicates narrow confidence intervals for the reported mean recognition accuracies. As a result, the relative performance differences discussed in this section remain statistically meaningful, even without formal hypothesis testing.
4.5. Controlled Ablation Results with a Fixed ResNet-18 Backbone
Table 8 summarizes the OCR performance obtained under the controlled ablation setting, where the CNN backbone is fixed to ResNet-18 and only the sequence decoder is varied. For each dataset, the best-performing configuration among multiple training runs is reported.
All ablation configurations are trained three times using different random seeds. Across all datasets, the standard deviation of CA remains below ±0.5%, indicating that the observed performance trends are robust and not sensitive to stochastic training effects.
On the AI-HUB dataset, recognition accuracy is substantially lower than that of the primary OCR configurations, highlighting the importance of high-quality feature representations for Korean license plate recognition. In contrast, on the RodoSol-ALPR dataset, the ablation model achieves performance comparable to the main Attention-LSTM-based model, suggesting that decoder choice has a limited impact under relatively stable imaging conditions.
For the UFPR-ALPR dataset, recognition accuracy remains low despite the fixed backbone, confirming that performance is dominated by tracking-induced ROI instability and severe image degradation. These results demonstrate that in tracking-based scenarios, recognition failures are driven primarily by upstream errors in detection and ROI propagation, rather than by differences in sequence decoder architecture.
It should be noted that the term “tracking-induced ROI instability” used in this analysis does not refer to artifacts of a specific tracking algorithm. Rather, it denotes a general class of errors arising from sequential ROI propagation, including spatial jitter, gradual drift, and frozen region effects, which commonly occur in practical ALPR pipelines that rely on frame-to-frame ROI association.
Implications
of the Controlled Ablation Study
The controlled ablation results provide critical context for interpreting the main comparative findings. When the CNN backbone is fixed, recognition performance still varies substantially across datasets, indicating that observed differences cannot be attributed to sequence decoder choice alone. Notably, severe performance degradation on tracking-based datasets persists regardless of decoder design, underscoring the dominant influence of ROI stability and detection quality. The substantial performance drop observed on the AI-HUB dataset under the controlled ablation setting should not be interpreted as evidence of bias or inadequacy in the sequence decoder itself. Rather, this degradation highlights the strong sensitivity of Korean license plate recognition to feature representation quality. By constraining the backbone to a fixed ResNet-18 configuration, the ablation intentionally limits feature capacity, thereby exposing an upper-bound proxy for decoder performance under suboptimal feature conditions. In this sense, the controlled ablation is designed to evaluate decoder robustness under shared and restricted representational constraints, rather than to reflect optimal end-to-end OCR performance. The results therefore indicate that decoder-level improvements alone cannot compensate for insufficient feature quality, particularly in complex, large-scale datasets such as AI-HUB.
These findings indicate that the benefits of advanced sequence modeling, such as Transformer-based decoding, are conditional on reliable feature extraction and stable ROI generation. As a result, improving OCR robustness in real-world traffic systems requires joint optimization of detection, tracking, and recognition stages rather than isolated refinement of the OCR decoder.
Importantly, the UFPR-ALPR dataset represents a class of realistic deployment scenarios characterized by mobile cameras, long-range capture, and tracking-based recognition. In such conditions, system-level degradation overwhelms decoder-level differences. Therefore, the limited differentiation between OCR architectures under UFPR conditions should be interpreted as evidence that recognition robustness must be addressed upstream, particularly at the detection and tracking stages.
4.6. Error Analysis of License Plate Recognition
Building on the failure patterns identified in the preceding analysis, this section presents a quantitative breakdown of error types to substantiate the observed qualitative trends and to clarify their dataset-dependent behavior. In this context, “tracking-induced” is used as a system-level descriptor of ROI propagation behavior over time, rather than as an attribution to any particular tracking implementation. Similar degradation patterns have been observed across different datasets and OCR architectures, suggesting that the identified failure modes are inherent to sequential ROI processing under unstable imaging conditions.
4.6.1. Quantitative Analysis of Tracking-Induced Failures
The error analysis is conducted with explicit consideration of dataset-specific characteristics. Korean license plate–specific error categories, including prefix confusion and structural misalignment in mixed Korean–digit strings, are analyzed primarily using the AI-HUB dataset, which reflects real-world domestic traffic conditions.
Results from the RodoSol-ALPR dataset are treated separately as a cross-domain generalization case for Latin-based license plates under relatively stable imaging conditions. In contrast, the UFPR-ALPR dataset is used to analyze tracking-induced failure modes that arise from severe motion, perspective distortion, and ROI propagation errors, which are largely independent of language or plate format.
As shown in
Table 9, the character error rate increases monotonically with increasing ROI instability. This relationship is consistently observed across both v4 and v7 models, indicating that tracking-induced ROI degradation constitutes a dominant performance bottleneck that cannot be resolved through decoder architecture alone.
Beyond general accuracy degradation, tracking-based datasets exhibit a distinct failure mode characterized by repeated identical OCR predictions across consecutive frames. This behavior reflects frozen or misaligned ROIs, in which the recognition model repeatedly processes an incorrect or nearly static input region, leading to persistent recognition errors.
Table 10 shows that repeated identical predictions occur with high frequency for both v4 and v7 models. This pattern indicates that a large portion of recognition failures originates from upstream ROI propagation errors rather than intrinsic limitations of the OCR architectures. The similar failure frequencies observed across models further support the conclusion that tracking noise dominates recognition performance in sequential datasets.
4.6.2. Definition of Error Types
This subsection defines representative recognition error categories observed in the license plate recognition experiments. By analyzing typical failure cases, we examine how error patterns vary according to dataset characteristics and OCR model structures (v4 and v7). Error definitions are shared across all datasets (exp000, exp001, and exp002), while their frequencies and manifestations differ by dataset, as illustrated in
Figure 10.
To enable systematic analysis, recognition errors are categorized into six types. Korean character confusion refers to misrecognition between visually similar Korean characters (e.g., “deo” vs. “meo”, “du” vs. “nu”, “ju” vs. “su”, “ja” vs. “cha”, “ra” vs. “gu”). These errors typically occur under low-resolution or blurred conditions, where fine-grained stroke information is lost during feature extraction.
Alphabet confusion denotes misrecognition among uppercase English letters with similar visual shapes, such as “E” vs. “F”, “Q” vs. “G”, “Z” vs. “S”, “H” vs. “S”, and “K” vs. “Y”. These errors are more common in license plates with low contrast or imbalanced character distributions.
Digit confusion arises from visually similar digits (e.g., “9” vs. “0”, “1” vs. “I”, “6” vs. “8”, “4” vs. “1”) and is often caused by font variations, strong reflections, or motion blur.
Prefix errors refer to omission, substitution, or over-recognition of regional prefixes (e.g., Gyeonggi, Incheon). For example, “37ba5016” may be misrecognized as “Gyeonggi37ba50”, or “GyeonggiBucheonja0542” as “GyeonggiBucheoncha0542”. Such errors are primarily attributed to variable string lengths and misalignment between detection and recognition stages.
Structural errors represent severe failures in which the overall string structure collapses (e.g., “MLS5511” recognized as “AZH5611”). These errors occur frequently in tracking-based datasets due to accumulated ROI misalignment or incorrect frame propagation.
Finally, annotation errors correspond to cases where the ground-truth labels are incorrect. These are treated as dataset quality issues rather than model limitations and are analyzed separately from recognition performance.
4.6.3. Quantitative Breakdown of Error Types
Table 11 presents a quantitative summary of recognition error distributions across datasets and OCR models. The results confirm that the failure patterns illustrated in
Figure 10 reflect consistent dataset-dependent trends rather than isolated examples.
On the AI-HUB dataset, recognition errors are dominated by Korean character confusions. This tendency is more pronounced for the v4 model, while the v7 model exhibits a reduced proportion of Korean-specific errors, indicating improved context modeling for mixed Korean–digit sequences. Other error types, including digit and structural errors, occur with relatively low frequency.
For the RodoSol-ALPR dataset, errors are primarily attributed to alphabet and digit confusions, which is consistent with its Latin-based alphanumeric license plate structure. Structural errors remain infrequent for both models, reflecting the stable imaging conditions and limited reliance on long-term tracking.
In contrast, the UFPR-ALPR dataset exhibits a fundamentally different error profile. For both v4 and v7 models, structural errors overwhelmingly dominate, while character-level confusions occur infrequently. This indicates that recognition failures on UFPR are driven mainly by tracking-induced ROI degradation and accumulated spatial misalignment, rather than by limitations of the OCR decoder architecture.
It should be emphasized that Korean character error categories are only applicable to datasets containing Korean characters, namely AI-HUB. Accordingly, results from the RodoSol-ALPR dataset are interpreted exclusively as a cross-domain generalization case for Latin-based license plates and are not used to infer Korean-specific recognition behavior.
4.6.4. Error Characteristics of the CNN + Attention-LSTM-Based Model (v4)
The CNN + Attention-LSTM-based OCR model (v4) exhibits distinct error characteristics depending on dataset properties. On the AI-HUB dataset (exp000), Korean character confusion is the dominant failure mode. Errors frequently occur between characters with similar stroke-level structures, resulting in partial misrecognition while preserving the overall string layout. This behavior suggests that v4 maintains stable global sequence modeling but remains sensitive to fine-grained visual ambiguities in Korean characters.
On the RodoSol-ALPR dataset (exp001), error patterns shift toward alphabet and digit confusions. Misrecognitions typically involve visually similar Latin characters or digits, reflecting limited generalization when transitioning from Korean-centered training distributions to purely alphanumeric license plate formats.
In contrast, the UFPR-ALPR dataset (exp002) is dominated by structural errors. Because UFPR consists of tracking-based sequential frames, accumulated ROI misalignment and temporal drift significantly degrade recognition quality. Under these conditions, the v4 model fails to preserve the overall string structure, indicating high sensitivity to tracking-induced noise despite its stability in single-frame recognition scenarios.
4.6.5. Error Characteristics of the MobileNetV3 + Transformer Decoder-Based Model (v7)
The MobileNetV3 + Transformer decoder-based OCR model (v7) generally exhibits improved global structural consistency compared to the Attention-LSTM-based v4 model, although dataset-dependent error patterns remain evident. On the AI-HUB dataset (exp000), the frequency of Korean character confusion is reduced relative to v4, indicating that Transformer-based sequence modeling more effectively captures long-range contextual dependencies. In particular, prefix-related errors are substantially mitigated, suggesting stronger preservation of overall license plate structure.
On the RodoSol-ALPR dataset (exp001), the v7 model largely maintains correct string-level structure, with most errors confined to fine-grained alphabet or digit confusions. These results indicate that while global context modeling is effective, discriminating between visually similar characters remains challenging under purely visual ambiguity.
In contrast, on the UFPR-ALPR dataset (exp002), severe structural failures persist and closely resemble those observed in v4. Repeated identical predictions across consecutive frames are also frequently observed. As these failure modes appear with comparable frequency in both architectures, they are attributed primarily to tracking-induced ROI instability rather than limitations of the Transformer-based decoder itself.
4.6.6. Comparative Discussion: v4 vs. v7
A direct comparison between the two OCR architectures highlights clear but conditional performance differences. The v7 model demonstrates improved recognition of Korean characters and more reliable handling of regional prefixes compared to v4. These gains are attributed to the Transformer decoder’s ability to model longer-range dependencies and preserve global string structure, thereby reducing the likelihood that local misrecognitions escalate into full structural collapse.
In contrast, digit recognition stability shows no substantial difference between the two models. Moreover, both v4 and v7 exhibit severe vulnerability to structural errors on tracking-based datasets such as exp002, where accumulated ROI instability dominates recognition performance. Under these conditions, architectural differences in sequence modeling provide limited benefit.
Therefore, improvements observed on the AI-HUB dataset should be interpreted as evidence of enhanced modeling of Korean-specific linguistic and structural characteristics, rather than a universal advantage of Transformer-based decoding. Conversely, results on the RodoSol-ALPR dataset primarily reflect the generalization capability of OCR architectures to Latin-based license plate formats under relatively stable imaging conditions.
4.6.7. Summary
In summary, the v7 model provides measurable improvements over v4 in Korean character recognition and regional prefix handling. These gains confirm the effectiveness of Transformer-based decoding in modeling Korean-specific structural dependencies. However, both models remain highly susceptible to structural collapse in tracking-based datasets, where recognition failures are dominated by accumulated errors from detection and ROI propagation rather than decoder design.
The quantitative error analysis further shows that Transformer-based decoding primarily reduces Korean character and prefix-related errors, while having limited impact on structural failures induced by unstable tracking. These findings indicate that overall license plate recognition performance is governed by system-level robustness rather than OCR architecture alone. Consequently, future research should prioritize integrated optimization across detection, tracking, and recognition stages instead of isolated improvements to OCR models.
4.7. Mitigation Experiment for Tracking-Induced Errors
To examine whether simple temporal strategies can mitigate tracking-induced OCR failures, a lightweight mitigation experiment based on temporal majority voting is conducted. For each tracked license plate sequence, OCR predictions within a fixed temporal window of consecutive frames are aggregated, and the most frequently predicted character at each position is selected as the final output. The window size was chosen empirically as the smallest value that consistently yielded stable accuracy improvements across tracking-based datasets, while avoiding excessive temporal smoothing or additional latency. This approach introduces negligible computational overhead and does not require any modification to the OCR architecture. Accordingly, the observed tracking-induced failures should be interpreted as consequences of sequential ROI propagation under noisy localization conditions, rather than as deficiencies of a particular tracking algorithm. This interpretation is consistent with the comparable trends observed across different OCR architectures and mitigation strategies. From a temporal perspective, majority voting over a window of N frames introduces an inherent delay of frames, which corresponds to two frames for the configuration used in this study (). For this reason, temporal voting is evaluated as an optional post-processing step rather than as a mandatory component of a real-time ALPR pipeline. All mitigation experiments in this section are conducted under an offline evaluation setting, with the goal of analyzing relative robustness trends of different OCR architectures under tracking-based conditions. End-to-end latency optimization and strict real-time constraints are therefore outside the scope of this analysis and are treated as system-level design considerations.
Table 12 summarizes the effect of temporal majority voting on tracking-based OCR performance using the UFPR-ALPR dataset. For both v4 and v7 models, the application of temporal voting consistently improves character accuracy (CA), reduces character error rate (CER), and alleviates severe structural errors caused by tracking instability. Importantly, the relative improvement provided by temporal voting is comparable for both architectures, indicating that tracking-induced failures are largely independent of the sequence decoder design. These results further support the conclusion that structural recognition errors in sequential datasets are primarily driven by upstream ROI instability, and can be partially mitigated through simple temporal aggregation without altering the OCR model itself. Crucially, the application of temporal majority voting does not confer a Transformer-specific advantage. Both the CNN + Attention-LSTM model (v4) and the MobileNetV3 + Transformer model (v7) exhibit similar relative performance gains under temporal aggregation, and the fundamental performance gap between architectures remains unchanged. This confirms that the main conclusion of this study holds: under tracking-based sequential recognition, OCR performance is governed primarily by ROI stability and upstream system behavior rather than by the choice of sequence decoder.
4.8. Detailed Analysis of Transformer Decoder Parameters (Model 2)
This subsection analyzes how key structural parameters of the Transformer decoder—namely the number of decoder layers (num_layers) and attention heads (nhead)—affect recognition accuracy and processing speed in the MobileNetV3-based OCR model (v7). Although Transformer-based architectures provide strong sequence modeling capability through self-attention, their computational cost increases rapidly as decoder depth and width grow. Therefore, understanding the accuracy–efficiency trade-off induced by these parameters is essential for practical deployment.
In this analysis, only num_layers and nhead are systematically varied, while all other Transformer components, including feed-forward network dimensions, dropout rates, and decoding strategy, are kept fixed. This design choice is intentional, aiming to isolate the impact of decoder depth and attention parallelism on OCR performance and inference speed. Expanding the parameter sweep to additional hyperparameters would substantially enlarge the search space and obscure causal attribution of performance changes.
Given the short and bounded sequence length of license plate strings, decoder depth and multi-head attention were identified as the dominant architectural factors governing the accuracy–efficiency trade-off. Accordingly, this focused analysis provides a clear and interpretable assessment of how Transformer decoder capacity influences practical OCR performance under real-time constraints.
4.8.1. Experimental Setup
All experiments were conducted using the AI-HUB Korean license plate dataset under identical training conditions and data splits. Only the Transformer decoder parameters were varied, while the CNN feature extractor was fixed to MobileNetV3-Large. The embedding dimension () and maximum sequence length () were kept constant across all configurations.
The embedding dimension was selected based on a trade-off between representational capacity and computational efficiency. In preliminary experiments, larger embedding dimensions caused a disproportionate increase in memory footprint and decoder-side latency due to the quadratic complexity of self-attention, which was particularly problematic in embedded environments. Conversely, smaller dimensions () consistently exhibited underfitting behavior, including unstable convergence and degraded character accuracy. Accordingly, was chosen as the smallest configuration that reliably avoids underfitting while remaining feasible for real-time embedded inference.
Evaluation metrics include edit distance (ED), character accuracy (CA), character error rate (CER), and processing speed measured in frames per second (FPS).
Six Transformer decoder configurations were evaluated. The number of decoder layers and attention heads were increased jointly (1/1, 2/2, 4/4, 7/7, and 14/14), along with a baseline configuration (6/7), to analyze the accuracy–efficiency trade-off as decoder capacity increases.
4.8.2. Recognition Accuracy Analysis
Table 13 summarizes the recognition performance obtained with different Transformer decoder configurations. Overall, recognition accuracy is strongly influenced by decoder depth and width, but the relationship is non-monotonic.
The shallowest configuration (1/1) achieves an average edit distance (ED) of 0.0848 and a character accuracy (CA) of 98.74%, indicating limited representational capacity for modeling character dependencies. The 2/2 configuration does not improve performance and instead exhibits a slight degradation (ED of 0.0949 and CA of 98.61%), suggesting that shallow Transformer structures are insufficient to reliably capture contextual information in license plate sequences.
Clear performance gains emerge once the decoder depth increases to four layers. The 4/4 configuration reduces ED to 0.0794 and improves CA to 98.80%, indicating that moderate depth is necessary to effectively model both local and global dependencies. The best overall recognition accuracy is achieved with the 7/7 configuration, which attains an ED of 0.0775 and a CA of 98.84%.
Further increasing model complexity to the 14/14 configuration does not yield meaningful accuracy improvements. This saturation effect suggests that, beyond a certain depth, additional Transformer layers provide diminishing returns for license plate recognition, where sequence lengths are short and structural variability is limited.
4.8.3. Processing Speed (FPS) Analysis
Processing speed exhibits a strong inverse relationship with Transformer decoder complexity. The shallow configurations (1/1 and 2/2) achieve the highest throughput, reaching approximately 67 and 49 FPS, respectively, which is sufficient for high-throughput real-time OCR applications.
As decoder depth and attention width increase, FPS decreases rapidly. The 4/4 configuration drops to approximately 32 FPS, while the 6/7 and 7/7 settings further reduce throughput to around 23 and 21 FPS. The most complex configuration (14/14) achieves only about 11 FPS, falling well below practical real-time requirements for license plate recognition.
This degradation is primarily caused by the quadratic computational complexity of self-attention and the increased memory traffic introduced by deeper decoder stacks. Under the autoregressive decoding setting with batch size one, these costs dominate runtime behavior, effectively limiting the benefits of architectural parallelism.
4.8.4. Trade-Off Analysis and Optimal Configuration
Jointly considering recognition accuracy and processing speed, a clear accuracy–efficiency trade-off is observed. Shallow configurations (1/1 and 2/2) achieve high FPS but exhibit insufficient recognition accuracy for reliable license plate recognition. In contrast, very deep configurations (e.g., 14/14) provide only marginal accuracy improvements while incurring substantial computational overhead, resulting in severely reduced throughput.
Among the evaluated settings, the 7/7 configuration offers the most balanced compromise between accuracy and efficiency. It achieves approximately 98.8% CA with an average edit distance of 0.077 while maintaining an inference speed of around 20 FPS on GPU platforms. Accordingly, this configuration is adopted as the representative setting for the Transformer-based OCR model in subsequent experiments, as it provides stable recognition performance without excessive computational cost.
4.8.5. Discussion
This analysis shows that increasing the structural complexity of Transformer-based OCR models does not necessarily lead to improved performance. For Korean license plate recognition, where character sequences are short and follow well-defined structural patterns, moderately deep Transformer decoders achieve the most favorable balance between accuracy and efficiency. Beyond this point, additional layers mainly increase computational cost without providing commensurate gains in recognition accuracy.
Importantly, the Transformer decoder in Model 2 was intentionally designed under strict computational and structural constraints rather than optimized for maximal representational capacity. The goal of this study is therefore not to demonstrate the theoretical superiority of Transformer architectures, but to assess their practical effectiveness and limitations in realistic, tracking-based license plate recognition systems, particularly under embedded deployment constraints.