A Structure-Aware Masked Autoencoder for Sparse Character Image Recognition

Luo, Cheng; Wang, Wenhong; Mai, Junhang; Mu, Tianwei; Guo, Shuo; Yuan, Mingzhe

doi:10.3390/electronics14244886

Open AccessArticle

A Structure-Aware Masked Autoencoder for Sparse Character Image Recognition

by

Cheng Luo

¹

,

Wenhong Wang

^2,*,

Junhang Mai

^3,4,

Tianwei Mu

²,

Shuo Guo

¹ and

Mingzhe Yuan

^2,5

¹

College of Information Engineering, Shenyang University of Chemical Technology, Shenyang 110142, China

²

Guangzhou Industrial Intelligence Research Institute, Guangzhou 511458, China

³

College of Electronic Engineering, South China Agricultural University, Guangzhou 510642, China

⁴

College of Artificial Intelligence, South China Agricultural University, Guangzhou 510642, China

⁵

Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang 110169, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(24), 4886; https://doi.org/10.3390/electronics14244886

Submission received: 3 November 2025 / Revised: 24 November 2025 / Accepted: 5 December 2025 / Published: 12 December 2025

(This article belongs to the Section Electrical and Autonomous Vehicles)

Download

Browse Figures

Versions Notes

Abstract

Conventional vehicle character recognition methods often treat detection and recognition as separate processes, resulting in limited feature interaction and potential error propagation. To address this issue, this paper proposes a structure-aware self-supervised Masked Autoencoder (CharSAM-MAE) framework, combined with an independent region extraction preprocessing stage. A YOLOv8n detector is employed solely to crop the region of interest (ROI) from full-frame vehicle images using 50 single bounding-box annotated samples. After cropping, the detector is discarded, and subsequent self-supervised pre-training and recognition are fully executed using MAE without any involvement of YOLO model parameters or labeled data. CharSAM-MAE incorporates a structure-aware masking strategy and a region-weighted reconstruction loss during pre-training to improve both local structural representation and global feature modeling. During fine-tuning, a multi-head attention-enhanced CTC decoder (A-CTC) is applied to mitigate issues such as sparse characters, adhesion, and long-sequence instability. The framework is trained on 13,544 ROI images, with only 5% of labeled data used for supervised fine-tuning. Experimental results demonstrate that the proposed method achieves 99.25% character accuracy, 88.6% sequence accuracy, and 0.85% character error rate, outperforming the PaddleOCR v5 baseline (98.92%, 85.2%, and 1.15%, respectively). These results verify the effectiveness of structure-aware self-supervised learning and highlight the applicability of the proposed method for industrial character recognition with minimal annotation requirements.

Keywords:

self-supervised learning; masked autoencoder; structure-aware masking strategy; region-weighted reconstruction loss; VIN code recognition

1. Introduction

With the rapid advancement of industrial automation and intelligent manufacturing, character recognition technology has found increasingly widespread applications in industrial production, particularly in scenarios such as automatic acquisition of vehicle identification numbers (VINs) and detection of engraved characters [1]. This technology not only supports product traceability and anti-counterfeiting mechanisms but also significantly enhances production quality management and informatization levels. However, character images captured in industrial environments face formidable challenges: on the one hand, character regions typically occupy only a small proportion of the entire image, resulting in high information density and sparse spatial distribution; on the other hand, the background areas are highly variable and cluttered with interferences such as textures, scratches, reflections, and stains. These factors severely constrain the feature extraction and classification accuracy of character recognition algorithms, thereby undermining model robustness, especially in small-batch or customized production settings.

In recent years, self-supervised learning methods based on deep learning have emerged as an effective paradigm for enhancing performance in visual tasks [2]. Among these, the Masked Autoencoder (MAE) captures rich and generalizable global features by randomly masking input images and reconstructing the masked regions, demonstrating superior efficacy in image classification and detection tasks [3]. However, in industrial images featuring sparse and structured characters, the fully random masking strategy is prone to over-masking critical character regions, thereby impairing the model’s learning of structural character features and constraining the performance of downstream recognition tasks.

To address the aforementioned challenges, this study proposes a structure-aware region masking strategy and introduces a region-weighted reconstruction loss function to enhance the model’s self-supervised learning capabilities in character-sparse scenarios [4]. Specifically, leveraging the character bounding box information output by the object detection model (e.g., YOLOv8 [5]), differential masking ratios are applied to character and background regions: a lower masking ratio (e.g., 25%) is adopted for character regions to maximize the preservation of structural features, while a higher masking ratio (e.g., 75%) is applied to background regions to suppress irrelevant interferences. Concurrently, a region-weighted mechanism is incorporated into the reconstruction loss computation, assigning higher weights to character regions (with a penalty factor

α = 3

), thereby further strengthening the model’s sensitivity and discriminability toward local character structures.

Following the self-supervised pre-training based on the structure-aware MAE, this study integrates the optimized encoder with a CTC (Connectionist Temporal Classification) decoding module incorporating a fusion attention mechanism [6], enabling end-to-end fine-tuning with minimal annotated samples. This approach substantially reduces manual annotation costs while achieving high-precision character sequence recognition.

The main contributions of this study are as follows:

1.: A structure-aware region masking strategy is proposed to effectively alleviate the over-masking of character regions associated with random masking;
2.: A region-weighted reconstruction loss function is designed to significantly enhance the model’s representational capabilities for critical character regions;
3.: Self-supervised pre-training is integrated with few-shot supervised fine-tuning to construct an efficient vehicle character recognition framework;
4.: Systematic experimental validation is conducted on a self-constructed industrial vehicle character dataset, which encompasses diverse vehicle types, complex backgrounds, and varying imaging conditions.

The results demonstrate that the proposed method maintains high accuracy and robustness in industrial complex scenarios, exhibiting substantial application potential.

2. Related Work

Vehicle character recognition (VCR) plays a crucial role in industrial automation and intelligent manufacturing, particularly for vehicle identification number (VIN) acquisition and quality traceability [7]. Existing studies can mainly be categorized into three methodological directions: CNN-based recognition, Transformer-driven models, and self-supervised representation learning.

2.1. CNN-Based Recognition Methods

Traditional VCR approaches were dominated by handcrafted feature extraction (e.g., edge detection and template matching), which are highly sensitive to illumination variation and background interference [8]. With the development of deep learning, convolutional neural networks (CNNs) enabled end-to-end optimization, such as the CRNN integrated with CTC loss [9], significantly improving recognition performance. Object detection frameworks (e.g., YOLOv8 [10]) have also been introduced for character localization, thus boosting processing speed. However, most CNN-based methods treat detection and recognition as separate tasks, which often leads to insufficient feature interaction and cumulative errors in sparse-character scenarios.

2.2. Transformer-Based and Attention-Driven Methods

Recent advancements in attention mechanisms and Transformer architectures have facilitated improved multi-scale feature fusion, particularly for low-contrast industrial images [11]. These models are capable of long-range dependency modeling, which enhances recognition robustness. Nevertheless, they typically rely on strong semantic priors, which may be less suitable for VIN code recognition where the character structure is fixed, semantics are weak, and background complexity is constrained.

2.3. Self-Supervised Learning and Masked Autoencoders

Self-supervised learning (SSL) has become a prominent paradigm due to its ability to learn transferable representations from unlabeled data [12,13]. The Masked Autoencoder (MAE) [14] achieves scalable SSL by randomly masking image patches and reconstructing them through a lightweight decoder, demonstrating excellent performance in visual pre-training. MAE-based adaptations in OCR tasks, such as MaskOCR and masked text modeling [15,16], provide a unified encoder–decoder framework for enhancing sequential character representation. In challenging scenarios, degradation-invariant autoencoders further improve robustness to noisy text inputs [17,18]. However, random masking often leads to excessive occlusion of critical character regions when applied to character-sparse industrial images, resulting in reconstruction bias and degraded feature consistency.

2.4. Structure-Aware Masking Strategies

To address this issue, structure-aware masking approaches have been proposed to guide SSL toward task-relevant regions. For example, SSTAG selectively masks non-essential regions in text alignment tasks [19], while RoMAE performs region-level masking to better focus on semantic structures [14]. Weighted reconstruction and penalty-based learning strategies have also shown effectiveness in suppressing background interference and improving representation stability [17]. However, existing methods do not explicitly differentiate between character and non-character regions in vehicle character recognition tasks, limiting their applicability in industrial environments.

In addition, Raisi and Zelek [20] investigated character-level occlusion handling for text recognition in natural scene images. Their approach focuses on free-form text with strong semantic dependencies, addressing complex occlusion patterns through robustness-oriented modeling. However, VIN characters in industrial settings exhibit fixed-length spatial structures, weak semantic relevance, and consistent imaging conditions, making direct adaptation of such occlusion-based methods non-trivial without substantial architectural modification.

In summary, while recent deep learning-based methods have significantly improved VCR performance, their generalization capability and annotation efficiency remain limited under real-world industrial conditions. To this end, this study introduces a structure-aware masked autoencoder that integrates region-weighted reconstruction and targeted masking strategies with YOLOv8n-based character localization and A-CTC decoding, providing a practical SSL-based paradigm for industrial character recognition.

3. Materials and Methods

To achieve efficient recognition of vehicle identification number (VIN) characters in industrial environments, this study proposes a self-supervised pre-training framework based on a structure-aware Masked Autoencoder (CharSAM-MAE). The framework exploits unlabeled ROI images for pre-training to learn both global and local structural features inherent to character-sparse industrial imagery, followed by few-shot fine-tuning to achieve end-to-end sequence recognition.

In designing the workflow, the selection of individual modules is guided by the characteristics of VIN recognition rather than the convenience of implementation. VIN codes generally exhibit a fixed structural format, weak semantic dependency, and sparse spatial distribution under complex imaging conditions. Consequently, a lightweight YOLOv8n detector is employed exclusively for region-of-interest extraction, while self-supervised MAE-based modeling is used for representation learning without the use of YOLO-derived parameters.

The overall pipeline—illustrated in Figure 1—includes character region detection, structure-aware masking during pre-training, region-weighted reconstruction loss optimization, and fine-tuning with a CTC decoder enhanced by a multi-head attention mechanism. This design effectively reduces annotation requirements while preserving structural consistency in the reconstructed features.

The suitability of the architectural design is further validated by comparative experiments in Section 4.3 and in-depth ablation studies in Section 4.4, which demonstrate the effectiveness of the structure-aware masking strategy and attention-based decoding under low-sample supervision.

To provide a concise and operational understanding of the complete framework, Algorithm 1 summarizes the end-to-end workflow from ROI extraction to structure-aware self-supervised learning and fine-tuning with attention-enhanced decoding. The following subsections provide detailed descriptions of each component.

Algorithm 1: Overall Training Workflow of the Proposed CharSAM-MAE

3.1. VIN Recognition System

The proposed VIN recognition system has been successfully deployed on an industrial automotive assembly line to enable high-accuracy real-time identification for light and dark vehicles [21]. It comprises high-speed industrial area-scan cameras, edge computing units, and dedicated recognition software, realizing fully automated end-to-end VIN acquisition while substantially reducing manual intervention and improving production traceability.

To address challenging factory conditions—including conveyor-induced motion blur, partial occlusion, varying illumination, and extremely low character contrast on dark body surfaces—images are captured continuously with adaptive exposure control. A lightweight YOLOv8n detector, trained on only 50 annotated full-frame images, performs offline ROI cropping to extract VIN character regions, which are then used exclusively for all subsequent model training and evaluation.

The practical deployment is shown in Figure 2: (a) illustrates the assembly-line environment with fixed camera positions above the moving conveyor, and (b) demonstrates the integrated software interface on industrial displays with real-time monitoring and alarm functions. This solution requires minimal modification to existing production infrastructure, facilitating large-scale industrial adoption [22].

The VIN recognition workflow is outlined in Figure 3, commencing with a trigger signal for image capture: the camera acquires the image, followed by preprocessing (including grayscale conversion and noise suppression), YOLOv8n-based character region detection, CharSAM-MAE feature extraction, and CTC sequence decoding, culminating in the output of recognition results and an end signal. With vehicles arriving at a rate of one per minute, the total process latency is controlled within 10 s, supporting continuous production rates.

To address illumination inconsistency and image degradation caused by vehicle body color variations, an adaptive gain adjustment mechanism was integrated into the imaging system, as shown in Figure 4. The mechanism evaluates scene brightness by monitoring the average grayscale value of incoming frames in real time. When the grayscale value drops below 20, the camera automatically increases the gain parameter from the default 4 to 18.9, while keeping exposure time unchanged to prevent motion-induced blur. This adjustment enhances character contour visibility and local contrast under low-light conditions, thereby facilitating more reliable structural feature learning during MAE pre-training and improving downstream recognition performance.

The software interface of the system, as depicted in Figure 5, integrates functionalities for displaying detection processes, presenting recognition results, monitoring status, and issuing anomaly alerts. The upper section of the interface displays the current captured image alongside corresponding character recognition outcomes, including YOLOv8n-cropped character regions and the VIN sequence output from the CharSAM-MAE model, with annotations for confidence scores and judgment status (OK/NG) [23]. The middle module visualizes judgment results and NG root cause analysis, enabling quality tracking and process retrospection; the bottom status area monitors the real-time heartbeat and connectivity of key system components (industrial camera, I/O card, database, PLC, and production signaling). Through this visualization interface, operators can monitor detection processes in real time, trace anomalous events, and execute device interlocks, thereby ensuring highly reliable operation of the VIN recognition system in industrial settings [24].

3.2. YOLOv8n-Based Character Region Detection

In vehicle character recognition tasks, raw industrial images typically contain extensive irrelevant regions, while the VIN character strip occupies only a minor portion. To eliminate redundant input and improve the efficiency of the subsequent recognition pipeline, a YOLOv8n model is employed for rapid localization and cropping of the character region, which has demonstrated strong performance in small-object detection with high inference speed [25].

As a stable evolution of the YOLO series, YOLOv8n adopts an anchor-free pipeline that reduces redundant bounding box proposals and improves detection efficiency. Its lightweight architecture is well-suited to the compact and fixed spatial characteristics of VIN regions in industrial images, thereby meeting real-time deployment requirements. In this study, the YOLOv8n (nano) variant is trained using 50 raw industrial images, each manually annotated with a single bounding box, following the default Ultralytics training configuration without any further adjustment.

The annotated dataset used to train YOLOv8n is completely independent from the 13,544 cropped character images later used for both the self-supervised pre-training and the fine-tuning stages of CharSAM-MAE. After region extraction, the YOLOv8n model is discarded entirely, and no parameters, predicted features, or detection results are used in subsequent learning as illustrated in Figure 6. Only the cropped character regions serve as the actual input dataset for CharSAM-MAE, ensuring strict data isolation and preventing any potential leakage.

This preprocessing strategy significantly reduces annotation cost while providing clean and focused visual input for downstream modeling, thereby facilitating the integration of structure-aware masking strategies within the MAE framework [26].

3.3. Structure-Aware Masking Strategy for Pretraining

In the pre-training stage for character recognition, self-supervised learning offers a promising solution to alleviate the challenge of limited labeled data. Masked Autoencoders (MAEs), which reconstruct missing image regions by randomly masking patches, have demonstrated strong feature learning capabilities in various visual tasks [27]. However, in vehicle character recognition scenarios, character regions typically occupy only a small portion of the image but carry a high information density and significant recognition value. In contrast, background areas, while pixel-rich, often introduce noise and redundant features. When conventional MAE randomly masks image patches, it is likely that a large portion of the character region is obscured, making it difficult to learn effective structural representations of the characters.

To address this limitation, we propose a structure-aware masking strategy that enables the MAE to retain more informative content in character regions during pre-training, while applying a higher masking ratio to background regions. This strategy leverages the bounding boxes of character regions provided by a YOLOv8n-based detection module, allowing the input image to be segmented into character and background zones. Different masking probabilities are applied accordingly: a lower masking ratio (e.g., 25%) is used for character regions to preserve essential structural details, while a higher ratio (e.g., 75%) is applied to background regions to suppress redundant information and reduce the computational burden of pre-training.

Specifically, the input image is partitioned into a set of patches

P = {p_{1}, p_{2}, \dots, p_{N}}

, which is then divided into a character patch subset

P_{char}

and a background patch subset

P_{bg}

, based on the character region masks provided by the YOLOv8n detection model. The proposed structure-aware masking strategy can be formally expressed as follows:

M (p) = \{\begin{matrix} 1, & if p \in P_{char} and r < m_{char} \\ 1, & if p \in P_{bg} and r < m_{bg} \\ 0, & otherwise \end{matrix}

(1)

Here,

M (p)

indicates whether a patch p is masked (1 for masked, 0 for retained);

r \sim U (0, 1)

denotes a random variable uniformly sampled from the range

[0, 1]

;

m_{char}

and

m_{bg}

represent the masking ratios for the character and background regions, respectively.

This strategy enables the model to focus more effectively on local structural features in the character regions, while suppressing interference from background areas, as illustrated in Figure 7. As a result, it significantly enhances the feature learning performance during the pre-training stage and lays a solid foundation for downstream character recognition tasks.

3.4. Region-Weighted Reconstruction Loss

Building upon the structure-aware masking strategy, this work further introduces a region-weighted reconstruction loss function to better guide the model to focus on the reconstruction quality of character regions during the pre-training process, as illustrated in Figure 8. Traditional MAE frameworks typically adopt the mean squared error (MSE) loss as the reconstruction objective, treating all masked regions equally during optimization.

L_{MSE} = \frac{1}{| M |} \sum_{i \in M} {∥ x_{i} - {\hat{x}}_{i} ∥}^{2}

(2)

However, in industrial scenarios, character images often exhibit a significant regional imbalance: character regions are densely informative and carry high recognition value. In contrast, background regions tend to be redundant and contribute little to the recognition task [28]. If a uniform reconstruction loss is applied across all areas, the model may focus excessively on background details during training, while neglecting the structural features of character regions.

To address this issue, a region-weighted reconstruction loss is introduced, where different weights are assigned to other regions during pixel-level error computation. Specifically, higher weights are applied to reconstruction errors in character regions to emphasize their structural importance and enhance feature learning. This weighting mechanism encourages the model to be more attentive to the reconstruction of character areas, enabling it to learn finer and more discriminative character representations.

L = \sum_{i \in char} α {(x_{i} - {\hat{x}}_{i})}^{2} + \sum_{j \in bg} β {(x_{j} - {\hat{x}}_{j})}^{2}

(3)

In the above equation,

x_{i}

and

{\hat{x}}_{i}

denote the original and reconstructed values of the i-th pixel, respectively. The coefficients

α

and

β

represent the reconstruction weights for the character and background regions, respectively, where

α > β

. In our experiments, we empirically set

α = 3.0

and

β = 1.0

to emphasize the fidelity of character region reconstruction.

By integrating the structure-aware masking strategy, the proposed region-weighted reconstruction loss can effectively enhance the model’s ability to capture character region features during the pre-training stage, thereby providing a more robust feature foundation for downstream character recognition tasks.

3.5. Character Recognition in the Fine-Tuning Stage

After the structure-aware self-supervised pre-training, the MAE encoder is further fine-tuned in a supervised manner to achieve accurate recognition of vehicle character regions. Taking into account challenges such as character adhesion, blurring, and confusion between similar characters, traditional CTC often struggles with alignment issues in long sequences.

To address this, we propose an attention-constrained CTC (A-CTC) decoding structure that integrates a self-attention mechanism to enhance contextual modeling, as illustrated in Figure 9. This design is inspired by recent work that combines attention with CTC to improve sequence alignment and recognition robustness [29]. In this module, the output features from the MAE encoder are first enhanced by a multi-head self-attention mechanism, and then passed to a CTC decoder for final sequence prediction.

In the mathematical formulation of CTC, let the input characteristic sequence be denoted as

X = {x_{1}, x_{2}, \dots, x_{T}}

, and the target character sequence as

Y = {y_{1}, y_{2}, \dots, y_{L}}

. Since the length of the input sequence T is usually much greater than the length of the target sequence L, CTC introduces a blank symbol and allows repetition of characters to generate an alignment path

π

along the time axis. Each path

π

is sampled from the set of all valid alignments

B^{- 1} (Y)

that can be mapped to the target sequence Y. Accordingly, the CTC-based probability model can be formulated as:

p (Y ∣ X) = \sum_{π \in B^{- 1} (Y)} p (π ∣ X)

(4)

Here,

B

denotes the mapping from the alignment path to the target character sequence by collapsing repeated characters and removing blank symbols. The traditional CTC loss is defined as:

L_{CTC} = - log p (Y ∣ X)

(5)

In the proposed Attention-enhanced CTC, the input feature sequence X is first processed using a multi-head self-attention mechanism to model contextual dependencies.

\tilde{X} = MultiHeadAttention (Q = X, K = X, V = X)

(6)

This results in a contextually enriched feature sequence

\tilde{X}

, which is then used to compute the refined CTC loss.

L_{A - CTC} = - log \sum_{π \in B^{- 1} (Y)} p (π ∣ \tilde{X})

(7)

The attention mechanism enables the model to assign dynamic weights to different time steps, effectively mitigating the contextual modeling limitations of traditional CTC in long sequence tasks while preserving its flexible alignment property.

The fine-tuning stage consists of the following steps:

1.: Input images containing character regions are cropped using YOLOv8n;
2.: High-level feature sequences are extracted via the structure-aware MAE encoder;
3.: These features are contextually enhanced using a multi-head self-attention module;
4.: The enriched sequence is passed to a CTC decoder for final character prediction;
5.: The CTC loss is computed and backpropagated to optimize the network.

By integrating multi-head self-attention into the CTC framework, the model maintains strong alignment capability while significantly improving contextual understanding of long character sequences, leading to enhanced robustness in vehicle character recognition.

4. Experiments

4.1. Data Preprocessing

The original vehicle character dataset contained 7967 images. To enhance model robustness and increase training diversity, data augmentation was performed by randomly selecting 30% of the images for

\pm 20

% linear brightness adjustment and 10% for mild Gaussian blur, simulating illumination variation and defocus effects. After augmentation, the dataset was expanded to 13,544 images.

Before augmentation, all images were preprocessed using a YOLOv8n cropping stage to remove irrelevant background and retain only the VIN character region. Only these cropped ROI images were used for both pre-training and fine-tuning of the CharSAM-MAE model. It should be emphasized that the 50 raw annotated images used to train YOLOv8n were not included in this dataset and were discarded immediately after training the detector, ensuring no interaction between YOLO-based preprocessing and the subsequent self-supervised learning process.

Two distinct data partitions were used during model training. In the self-supervised pre-training stage, all 13,544 unlabeled ROI images were utilized to train the structure-aware MAE for generalizable representation learning. During the fine-tuning stage, only 5% of the original labeled data (approximately 400 images) were employed for supervised adaptation [30], further divided into training (70%), validation (20%), and test (10%) subsets.This strategy significantly reduces dependence on labeled data while maintaining strong generalization capability, as summarized in Table 1.

To enhance the quality of vehicle character images and improve recognition performance, a multi-stage image preprocessing pipeline was designed. First, the original RGB images were converted to grayscale and normalized to ensure uniform pixel distribution. Then, contrast-limited adaptive histogram equalization (CLAHE) was applied at multiple scales (8 × 8, 16 × 16, and 32 × 32 grid sizes) to simultaneously enhance local character details and global contrast. The enhanced results from different scales were fused with weighted averaging, followed by bilateral filtering to preserve edges while smoothing. In addition, a combination of detail enhancement and non-local means denoising was employed to suppress noise and highlight character contours and textures. This preprocessing significantly improved the clarity and signal-to-noise ratio of character regions, providing a solid foundation for downstream deep feature extraction, The full preprocessing pipeline is illustrated in Figure 10.

Given that most raw images had a resolution of 1300 × 300, black padding was applied to maintain the original structure while resizing them to a square format. All images were finally rescaled to 256 × 256 pixels to standardize input dimensions, ensuring centered character regions without distortion and improving training efficiency.

4.2. Experimental Setup

The proposed method first performs self-supervised pretraining on the full unlabeled dataset using the structure-aware MAE to learn key structural features of vehicle characters. Then, 15% of the labeled samples are used for supervised fine-tuning, with a 70%-20%-10% split for training, validation, and testing, respectively. During fine-tuning, high-level features are extracted via the pretrained MAE encoder, further enhanced by a multi-head self-attention module, and decoded by an attention-integrated CTC decoder for character sequence prediction. The model is optimized using the AdamW optimizer, with class weights incorporated into the loss function to address class imbalance.

Model performance is evaluated using multiple metrics: character-level accuracy (CharAcc), character error rate (CER), and sequence-level accuracy (SeqAcc), providing a comprehensive assessment of recognition quality. The corresponding metric formulas are defined as follows:

CharAcc = \frac{N_{correct}}{N_{total}} \times 100 %

(8)

CER = \frac{N_{substitutions} + N_{insertions} + N_{deletions}}{N_{total}} \times 100 %

(9)

SeqAcc = \frac{N_{correct sequences}}{N_{total sequences}} \times 100 %

(10)

where

N_{correct}

denotes the number of correctly predicted characters,

N_{total}

represents the total number of characters,

N_{substitutions}

,

N_{insertions}

, and

N_{deletions}

indicate the number of character substitutions, insertions, and deletions, respectively, and L denotes the length of the character sequence.

Considering the practical industrial production workflow, the VIN acquisition cycle is approximately 60 s per vehicle, during which image capturing and manufacturing data synchronization dominate total processing time. In contrast, the inference stage of the proposed method accounts for only a negligible fraction of this period and does not impose any delay on the overall pipeline. Therefore, model efficiency is not a limiting factor for deployment, and the recognition module fully meets real-time operational requirements under factory conditions.

4.3. Comparison with Baselines

To validate the effectiveness of CharSAM-MAE in the VIN code recognition task, we conducted two groups of comparative experiments: one comparing the performance of different models on VIN recognition, and another evaluating the impact of data augmentation on model performance.

As shown in the experimental results, CharSAM-MAE clearly outperforms both CRNN and PaddleOCR v5 in terms of CharAcc and SeqAcc.Specifically, CharSAM-MAE achieves a CharAcc of 99.25%, compared to 98.61% for CRNN and 98.92% for PaddleOCR v5 [31]. The improvement in SeqAcc is even more pronounced, reaching 88.6%, while CRNN and PaddleOCR v5 obtain 79.6% and 85.2%, respectively. Regarding the CER, CharSAM-MAE records the lowest value at 0.85%, further confirming its superior recognition capability.

The results demonstrate that CharSAM-MAE exhibits a clear advantage in recognizing complete VIN sequences. The significant improvement in SeqAcc indicates that the structure-aware masking strategy effectively mitigates cumulative errors caused by sparse character sequences, enabling more reliable recognition in industrial continuous-shooting scenarios. To ensure that these improvements are not due to random fluctuations, all experiments were repeated under identical deployment settings, and the variance of CharAcc, SeqAcc, and CER across repeated evaluations remained within ±0.05%. This level of stability is considerably below the performance gains observed in Table 2, confirming the robustness and statistical reliability of the proposed method.

4.4. Ablation Study

To comprehensively assess the contribution of each component within the proposed CharSAM-MAE framework to VIN recognition, multiple groups of ablation experiments were performed. In addition, training convergence behavior was evaluated using loss and accuracy curve analysis.

Furthermore, to investigate the influence of contrast-enhanced preprocessing on recognition robustness, the original VIN images were processed using contrast enhancement and subsequently evaluated on CharSAM-MAE without modifying any other parameters. As shown in Table 3, this preprocessing step led to a substantial improvement in recognition performance: character accuracy increased from 87.46% to 99.25%, sequence accuracy rose from 11.72% to 88.3%, and CER was reduced from 12.54% to 0.85%. These results indicate that contrast enhancement effectively strengthens visual feature extraction on sparse character input, particularly for complete VIN sequence recognition.

As shown in Table 4, both the Structure-Aware Masking strategy and the Region-Weighted Reconstruction Loss independently contribute to performance improvement over the baseline MAE configuration, increasing the sequence accuracy to 86.1% and 86.4%, respectively. When the two components are jointly applied, the sequence accuracy further improved to 88.6%, indicating a clear synergistic enhancement rather than a simple additive effect. This confirms that the observed performance gain arises from the intrinsic effectiveness of the proposed architectural design, rather than from model scaling or hyperparameter tuning alone.

To further quantify the effectiveness of the proposed attention-enhanced decoding scheme, an additional comparison was conducted between the standard CTC and A-CTC decoders while all other components were kept unchanged. As summarized in Table 5, the introduction of multi-head self-attention leads to a consistent improvement in recognition performance, particularly on long VIN sequences.

This confirms that the attention mechanism effectively improves contextual dependency modeling while maintaining the alignment flexibility of the CTC paradigm.

The experiments also examined the effects of model scale, character retention ratio, and loss penalty strategy, as illustrated in Figure 11. Specifically, CharSAM-MAE was tested in three configurations—base, large, and huge—with parameter sizes of 110.92 M, 328.35 M, and 655.90 M, respectively. Results show that the model capacity increases and the recognition performance consistently improves: the base model achieves a CharAcc of 99.25% and a SeqAcc of 88.6%, the large model reaches 99.43% and 91.2%, and the huge model further attains 99.52% and 92.6%.

It can be observed that enlarging the model scale improves recognition accuracy but also leads to a substantial increase in parameters, requiring a trade-off between performance and computational cost for industrial deployment.

To examine the adaptability of the structure-aware masking strategy to sparse character images, different character retention ratios were explored during pre-training, as shown in Table 6. The results show that a 25%/75% retention ratio achieves the best balance, giving a CharAcc of 99.25% and a SeqAcc of 88.6%. Higher or lower ratios cause a decrease in SeqAcc—for example, 30%/70% results in 88.2%, while 15%/85% drops to 87.0%.

This experiment investigates the impact of different loss penalty schemes (w/o penalty,

α

= 2,

α

= 3,

α

= 4) on model performance, as shown in Table 7. The results indicate that the introduction of a penalty factor of

α

= 3 yields the best results, achieving a CharAcc of 99.25% and a SeqAcc of 88.6%, significantly outperforming the no-penalty setting (SeqAcc 82.6%). Meanwhile, the CER decreases from 1.10% to 0.85%, demonstrating the effectiveness of the penalty strategy in improving the robustness of the recognition.

Comprehensive comparison and ablation experiments demonstrate that the proposed CharSAM-MAE consistently outperforms existing industrially deployable approaches for VIN character recognition, exhibiting superior robustness under low-contrast conditions and in the presence of complex imaging artifacts. The ablation findings further confirm that both the character-region retention ratio and the region-weighted reconstruction strategy are essential to improve the quality of the feature reconstruction and maintain sequence integrity.

To complement the quantitative analysis, a qualitative comparison is provided to visually illustrate the effectiveness of the proposed modules. As shown in Figure 12, both PaddleOCR v5 and the original MAE framework suffer character-level recognition errors (highlighted in red), particularly in cases involving weak contrast or character adhesion. In contrast, CharSAM-MAE demonstrates improved structural consistency and significantly reduces misrecognition across all illustrated samples.

It should be noted that, in the third example, all models—including ours—produce duplicated leading characters. This behavior is attributed to motion-induced streaking and optical defocus in the source image, which distorts structural boundaries and impairs separability between adjacent characters at the feature level. Although such degradation lies beyond the modeling scope of self-supervised pre-training, CharSAM-MAE still produces results that are most closely aligned with the ground truth. This observation indicates that the proposed structure-aware masking strategy and the attention-enhanced decoding mechanism substantially improve resilience against extreme imaging degradation, while the remaining inconsistency mainly originates from inherent limitations in the acquisition process rather than from the recognition architecture itself.

5. Conclusions

This study proposes CharSAM-MAE, a structure-aware self-supervised pre-training framework tailored for industrial VIN character recognition. By integrating a masking strategy guided by character-region structure and a region-weighted reconstruction loss, the method effectively enhances feature representation in sparse-character scenarios while maintaining sequence consistency. Unlike conventional OCR-based solutions, which heavily rely on large-scale manual annotations, the proposed approach substantially reduces labeling cost by fully utilizing unlabeled ROI data during pre-training and requiring only 5% supervised fine-tuning samples.

Experimental results confirm that CharSAM-MAE consistently outperforms existing deployable methods under low-contrast imaging and background interference. Ablation studies further validate the individual contributions of both the structure-aware masking and region-weighted reconstruction modules, as well as the effectiveness of the attention-enhanced CTC decoder. Qualitative analysis demonstrates that the method is particularly robust against character adhesion and structural distortion, achieving superior sequence accuracy compared to representative industrial and MAE-based baselines.

However, in extreme cases involving severe motion-induced streaking or optical defocus, the recognition performance of all models, including CharSAM-MAE, may still be affected due to the physical loss of feature separability in the original image. Although such degradation lies beyond the learnable scope of the recognition architecture, the proposed method generates the closest output to ground truth among all the approaches tested, highlighting its enhanced resilience under challenging acquisition conditions.

In summary, CharSAM-MAE provides both theoretical and practical contributions for industrial character recognition, offering a low-annotation-cost and highly robust framework suitable for real-world deployment. Future work will explore extending the method to real-time edge computing, as well as investigating degradation-aware self-supervised strategies to improve robustness under extreme visual perturbations further and broaden cross-scenario generalization capability.

Author Contributions

Conceptualization, T.M. and C.L.; methodology, C.L.; validation, T.M., W.W. and C.L.; formal analysis, C.L.; investigation, C.L.; resources, C.L., W.W., T.M. and J.M.; data curation, C.L.; writing—original draft preparation, C.L.; writing—review and editing, T.M., M.Y. and W.W.; visualization, C.L.; supervision, T.M., W.W. and S.G.; project administration, W.W.; funding acquisition, T.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Liaoning Provincial Education Department Fund Project: Digital twin-driven global resilience assessment model and dynamic simulation research, No. 202464252-1.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used and/or analyzed during the current study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, H.; Tian, M.; Huang, S.; Zhuang, Z.; Tan, X.; Dong, X. Vehicle VIN Code Recognition Based on Deep Learning and OCR. In Proceedings of the International Conference on Image, Signal Processing, and Pattern Recognition (ISPP 2025), Nanjing, China, 28–30 March 2025; SPIE: Beijing, China, 2025; Volume 13664, pp. 708–712. [Google Scholar]
Rani, V.; Nabi, S.T.; Kumar, M.; Mittal, A.; Kumar, K. Self-Supervised Learning: A Succinct Review. Arch. Comput. Methods Eng. 2023, 30, 2761–2775. [Google Scholar] [CrossRef] [PubMed]
He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked Autoencoders Are Scalable Vision Learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: New Orleans, LA, USA, 2022; pp. 16000–16009. [Google Scholar]
Yuan, J.; Zhang, X.; Zhou, H.; Wang, J.; Qiu, Z.; Shao, Z.; Zhang, S.; Long, S.; Kuang, K.; Yao, K.; et al. Hap: Structure-Aware Masked Image Modeling for Human-Centric Perception. Adv. Neural Inf. Process. Syst. 2023, 36, 50597–50616. [Google Scholar]
Wang, X.; Gao, H.; Jia, Z.; Li, Z. BL-YOLOv8: An Improved Road Defect Detection Model Based on YOLOv8. Sensors 2023, 23, 8361. [Google Scholar] [CrossRef]
Shi, B.; Bai, X.; Yao, C. An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 2298–2304. [Google Scholar] [CrossRef]
Moussaoui, H.; El Akkad, N.; Benslimane, M.; El-Shafai, W.; Baihan, A.; Hewage, C.; Rathore, R.S. Enhancing Automated Vehicle Identification by Integrating YOLOv8 and OCR Techniques for High-Precision License Plate Detection and Recognition. Sci. Rep. 2024, 14, 14389. [Google Scholar] [CrossRef]
Swaroop, P.; Sharma, N. An Overview of Various Template Matching Methodologies in Image Processing. Int. J. Comput. Appl. 2016, 153, 8–14. [Google Scholar] [CrossRef]
Al-Saffar, A.; Awang, S.; Al-Saiagh, W.; Al-Khaleefa, A.S.; Abed, S.A. A Sequential Handwriting Recognition Model Based on a Dynamically Configurable CRNN. Sensors 2021, 21, 7306. [Google Scholar] [CrossRef]
Bhuiyan, M.R.I.; Efaz, M.S.; Reza, T.; Ria, A.S.; Reza, M.T.; Hossain, M.I. Segmentation of Bangla Compound Characters: Underlying Simple Character Detection from Handwritten Compound Characters Using YOLOv8. In Proceedings of the 2024 6th International Conference on Electrical Engineering and Information & Communication Technology (ICEEICT), Dhaka, Bangladesh, 2–4 May 2024; IEEE: Dhaka, Bangladesh, 2024; pp. 1–6. [Google Scholar]
Cheng, Y.; Wang, W.; Zhang, W.; Yang, L.; Wang, J.; Ni, H.; Guan, T.; He, J.; Gu, Y.; Tran, N.N. A Multi-Feature Fusion and Attention Network for Multi-Scale Object Detection in Remote Sensing Images. Remote Sens. 2023, 15, 2096. [Google Scholar] [CrossRef]
Khan, A.; Sohail, A.; Fiaz, M.; Hassan, M.; Afridi, T.H.; Marwat, S.U.; Munir, F.; Ali, S.; Naseem, H.; Zaheer, M.Z.; et al. A Survey of the Self-Supervised Learning Mechanisms for Vision Transformers. arXiv 2024, arXiv:2408.17059. [Google Scholar]
Huang, Y.; Wen, Z.; Chi, Y.; Liang, Y. A Theoretical Analysis of Self-Supervised Learning for Vision Transformers. arXiv 2024, arXiv:2403.02233. [Google Scholar]
Zivanovic, U.; Di Gioia, S.; Scaffidi, A.; Rios, M.; Contardo, G.; Trotta, R. Rotary Masked Autoencoders Are Versatile Learners. arXiv 2025, arXiv:2505.20535. [Google Scholar] [CrossRef]
Lyu, P.; Zhang, C.; Liu, S.; Qiao, M.; Xu, Y.; Wu, L.; Yao, K.; Han, J.; Ding, E.; Wang, J. MaskOCR: Text Recognition with Masked Encoder–Decoder Pretraining. arXiv 2022, arXiv:2206.00311. [Google Scholar]
Wang, K.; Xie, H.; Wang, Y.; Zhang, D.; Qu, Y.; Gao, Z.; Zhang, Y. Masked Text Modeling: A Self-Supervised Pre-Training Method for Scene Text Detection. In Proceedings of the 31st ACM International Conference on Multimedia (MM 2023), Ottawa, ON, Canada, 29 October–3 November 2023; ACM: Ottawa, ON, Canada, 2023; pp. 2006–2015. [Google Scholar]
Nguyen, T.-N.; Burie, J.-C.; Le, T.-L.; Schweyer, A.-V. On the Use of Attention in Deep Learning-Based Denoising Method for Ancient Cham Inscription Images. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR 2021), Lausanne, Switzerland, 5–10 September 2021; Springer: Lausanne, Switzerland, 2021; pp. 400–415. [Google Scholar]
Doerrich, S.; Di Salvo, F.; Ledig, C. Self-Supervised Vision Transformers Are Scalable Generative Models for Domain Generalization. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI 2024), Marrakesh, Morocco, 6–10 October 2024; Springer: Marrakesh, Morocco, 2024; pp. 644–654. [Google Scholar]
Liu, R.; Yin, R.; Bo, X.; Hao, X.; Liu, Y.; Zhong, J.; Ma, C.; Wang, W. SSTAG: Structure-Aware Self-Supervised Learning Method for Text-Attributed Graphs. arXiv 2025, arXiv:2510.01248. [Google Scholar]
Raisi, Z.; Zelek, J. Occluded Text Detection and Recognition in the Wild. In Proceedings of the 2022 19th Conference on Robots and Vision (CRV), Toronto, ON, Canada, 18–20 May 2022; pp. 140–150. [Google Scholar]
Nuritdinovich, B.M.; Uktamjonovich, O.M. AI-Driven VIN Verification and RFID Integration for Error-Proofing in Automotive Manufacturing. Vibroeng. Procedia 2025, 58, 417. [Google Scholar] [CrossRef]
Ye, L.; Wang, D.; Yang, D.; Ma, Z.; Zhang, Q. VELIE: A Vehicle-Based Efficient Low-Light Image Enhancement Method for Intelligent Vehicles. Sensors 2024, 24, 1345. [Google Scholar] [CrossRef]
Monteiro, G.; Camelo, L.; Aquino, G.; Fernandes, R.d.A.; Gomes, R.; Printes, A.; Torné, I.; Silva, H.; Oliveira, J.; Figueiredo, C. A Comprehensive Framework for Industrial Sticker Information Recognition Using Advanced OCR and Object Detection Techniques. Appl. Sci. 2023, 13, 7320. [Google Scholar] [CrossRef]
Pauliková, A. Visualization Concept of Automotive Quality Management System Standard. Standards 2022, 2, 226–245. [Google Scholar] [CrossRef]
Zhang, X.; Yu, S. A Lightweight License Plate Recognition Method Based on YOLOv8. Electronics 2025, 14, 3482. [Google Scholar] [CrossRef]
Sun, H.; Tan, C.; Pang, S.; Wang, H.; Huang, B. RA-YOLOv8: An Improved YOLOv8 Seal Text Detection Method. Electronics 2024, 13, 3001. [Google Scholar] [CrossRef]
Wang, Q.; Zhang, Z.; Wang, D.; Gai, D.; Xiong, X.; Xu, J.; Zhou, R. VehicleMAE: View-Asymmetry Mutual Learning for Vehicle Re-Identification Pre-Training via Masked AutoEncoders. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Honolulu, HI, USA, 19–23 October 2025; IEEE: Milan, Italy, 2025; pp. 4701–4711. [Google Scholar]
Tang, S.; Cao, Y.; Liang, S.; Jin, Z.; Lai, K. Scene Text Recognition That Eliminates Background and Character Noise Interference. Appl. Sci. 2025, 15, 3545. [Google Scholar] [CrossRef]
Wang, S.; Shen, Y.; Yuan, X. SANet: Multi-Scale Dynamic Aggregation for Chinese Handwriting Recognition. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR 2025), Wuhan, China, 16–21 September 2025; Springer: San José, CA, USA, 2025; pp. 201–216. [Google Scholar]
Souibgui, M.A.; Fornés, A.; Kessentini, Y.; Megyesi, B. Few Shots Are All You Need: A Progressive Learning Approach for Low Resource Handwritten Text Recognition. Pattern Recognit. Lett. 2022, 160, 43–49. [Google Scholar] [CrossRef]
Taufik, I.; Hadiq, M.A.; Gerhana, Y.A.; Budiman, I. Assessing the Accuracy of Object Detection and Image-to-Text Conversion on Motor Vehicle Number Plate Using YOLOv5 and PaddleOCR. In Proceedings of the 2024 12th International Conference on Cyber and IT Service Management (CITSM), Batam, Indonesia, 3–4 October 2024; IEEE: Bandung, Indonesia, 2024; pp. 1–4. [Google Scholar]

Figure 1. The proposed CharSAM-MAE framework.

Figure 2. System components overview. (a) On-site construction of the VIN code recognition system. (b) Client interface for character recognition results and monitoring. All components are designed for industrial-grade reliability.

Figure 3. Workflow of the on-site character recognition process. (a) Image capture is triggered by the arrival signal. (b) Character recognition is completed upon receiving the end signal.

Figure 4. Illustration of the brightness adjustment process. Process 1 shows a dark capture, while Process 1–5 demonstrate progressively enhanced brightness achieved by increasing camera gain to improve VIN code visibility.

Figure 5. Software Interface.

Figure 6. Pipeline of character region extraction. Raw industrial images are manually annotated using a bounding box interface, followed by YOLOv8n-based region cropping. Only the cropped character regions are used as input for CharSAM-MAE pre-training.

Figure 7. Comparison between the original random masking strategy and the proposed structure-aware masking strategy.

Figure 8. Overall structure of Region-Weighted Reconstruction Loss.

Figure 9. Character recognition module structure.

Figure 10. Illustration of the image enhancement and preprocessing pipeline.

Figure 11. Curves of pre-training loss and character recognition accuracy over training epochs, illustrating the convergence behavior of the proposed CharSAM-MAE framework.

Figure 12. Qualitative comparison of VIN recognition performance. Characters highlighted in red indicate incorrect predictions, while those marked in denote correctly recognized results.

Table 1. Comparison of Label Requirements.

Scheme	Total Samples	Labeled Samples	Training Set	Validation Set	Test Set
Conventional supervised	7967	7967	5577	1593	797
Proposed method	13,544	150	105	30	15
Annotation workload reduced compared to conventional scheme					94.9%

Table 2. VIN Recognition Performance Across Different Models.

Model	Pre-Training Loss	CharAcc (%)	CER (%)	SeqAcc (%)
CRNN (ResNet34)	0.0238	98.61	1.41	79.6
ViT-Base	0.0187	98.72	1.28	82.3
PaddleOCR v5	0.0116	98.92	1.15	85.2
MAE (original)	0.0138	98.80	1.20	83.7
PARSeq	0.0080	99.02	0.98	87.0
ABINet	0.0074	99.05	0.96	87.4
Ours	0.0047	99.25	0.85	88.6

Table 3. Impact of Contrast Enhancement on VIN Recognition.

Case	CharAcc (%)	CER (%)	SeqAcc (%)
w/o enhancement	87.46	12.54	11.72
w/ contrast enhancement	99.25	0.85	88.3

Table 4. Effect of Proposed Modules on VIN Recognition Performance.

Configuration	Pre-Training Loss	CharAcc (%)	CER (%)	SeqAcc (%)
MAE (origin)	0.0138	98.80	1.20	83.7
w/o Structure-Aware Masking	0.0096	99.06	1.05	86.1
w/o Region-Weighted Loss	0.0089	99.02	1.01	86.4
( $α = 3$ )
Ours	0.0047	99.25	0.85	88.6

Table 5. Comparison Between Standard CTC and Attention-enhanced CTC (A-CTC).

Decoder	CharAcc (%)	CER (%)	SeqAcc (%)
Standard CTC	99.11	1.03	86.9
A-CTC (Ours)	99.25	0.85	88.6

Table 6. Effect of Character Retention Ratio on Pretraining and Recognition.

Character Retention Ratio	Pre-Training Loss	CharAcc (%)	CER (%)	SeqAcc (%)
30%/70%	0.0064	98.90	0.85	88.2
25%/75%	0.0047	99.25	0.85	88.6
20%/80%	0.0083	98.84	0.87	88.1
15%/85%	0.0115	98.15	0.91	87.0

Table 7. Effect of Penalty Schemes on Pretraining and Recognition.

Penalty Scheme	Pre-Training Loss	CharAcc (%)	CER (%)	SeqAcc (%)
w/o penalty	0.0054	98.80	1.10	82.6
w/loss $\times 2$	0.0049	99.08	0.92	85.8
w/loss $\times 3$	0.0047	99.25	0.85	88.6
w/loss $\times 4$	0.0046	99.17	0.81	87.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Luo, C.; Wang, W.; Mai, J.; Mu, T.; Guo, S.; Yuan, M. A Structure-Aware Masked Autoencoder for Sparse Character Image Recognition. Electronics 2025, 14, 4886. https://doi.org/10.3390/electronics14244886

AMA Style

Luo C, Wang W, Mai J, Mu T, Guo S, Yuan M. A Structure-Aware Masked Autoencoder for Sparse Character Image Recognition. Electronics. 2025; 14(24):4886. https://doi.org/10.3390/electronics14244886

Chicago/Turabian Style

Luo, Cheng, Wenhong Wang, Junhang Mai, Tianwei Mu, Shuo Guo, and Mingzhe Yuan. 2025. "A Structure-Aware Masked Autoencoder for Sparse Character Image Recognition" Electronics 14, no. 24: 4886. https://doi.org/10.3390/electronics14244886

APA Style

Luo, C., Wang, W., Mai, J., Mu, T., Guo, S., & Yuan, M. (2025). A Structure-Aware Masked Autoencoder for Sparse Character Image Recognition. Electronics, 14(24), 4886. https://doi.org/10.3390/electronics14244886

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

A Structure-Aware Masked Autoencoder for Sparse Character Image Recognition

Abstract

1. Introduction

2. Related Work

2.1. CNN-Based Recognition Methods

2.2. Transformer-Based and Attention-Driven Methods

2.3. Self-Supervised Learning and Masked Autoencoders

2.4. Structure-Aware Masking Strategies

3. Materials and Methods

3.1. VIN Recognition System

3.2. YOLOv8n-Based Character Region Detection

3.3. Structure-Aware Masking Strategy for Pretraining

3.4. Region-Weighted Reconstruction Loss

3.5. Character Recognition in the Fine-Tuning Stage

4. Experiments

4.1. Data Preprocessing

4.2. Experimental Setup

4.3. Comparison with Baselines

4.4. Ablation Study

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI