1. Introduction
With the acceleration of the digitization of calligraphy stone inscriptions, the demand for automatic identification of calligraphy stone inscriptions has become increasingly prominent. However, traditional computer vision methods (such as SIFT and HOG features [
1]) are difficult to capture deep style features due to the unity of ink color, the diversity of pen tip morphology and the complexity of paper texture in single channel rubbing images, especially in the face of the single channel rubbing image. Therefore, the development of high precision classification system is urgent practical significance for calligraphy heritage protection.
From the perspective of the overall research context, deep learning technology has become the driving force promoting research in the field of calligraphy. It has shown excellent performance in key tasks such as calligraphy text detection, character recognition and classification, style analysis, and aesthetic evaluation (see
Table 1).
In the direction of calligraphy text detection, several studies have proposed solutions to the problem of text extraction in complex scenarios. Mask TextSpotter v3 [
2] proposed by Peking University in 2021 achieves an F1 value of 86.3% in the curved text detection through the fusion of instance segmentation and semantic segmentation, but these methods cannot significantly reduce performance when facing calligraphy text with severe occlusion, blurring, complex backgrounds, or very different styles. For example, the rubbings of calligraphy stone inscriptions may have disturbances such as fading, staining, and cracks. Point Gathering Network (PGNet) [
3], developed by the Institute of Automation of the Chinese Academy of Sciences at the same time, significantly improves the positioning accuracy of scenes such as street view signboards with the help of pixel-level geometric modeling. In addition, Huang et al. [
4] tried to replace the convolutional neural network embedded in the main structure of connection text proposal network (CTPN) and Efficient and Accurate Scene Text Detection (EAST) with ResNet (Residual Network) with better feature extraction sensitivity, the results indicate that the accuracy of modified EAST with ResNet101 would be the highest with a deeper depth and larger width of ResNet. The accuracy of text segmentation on ICDAR 2015 is 83.4% which is 7% higher than the original PVANET-EAST. The text detection accuracy is 83.9% on the untrained scanned document. Also, it achieved an accuracy of 86.3% when applied to self-collected Chinese calligraphy. However, general purpose networks such as ResNet are not designed for calligraphy images and may not be able to optimally capture the subtle structure and stroke characteristics of calligraphy characters. Based on a dual generator framework of deep adversarial networks, Wang et al. [
5] proposed a method for robotic calligraphy reproduction. In this model, an encoder-decoder module is employed as one generator for style learning, while a robotic arm serves as the other generator for motion learning. These two generators work together to optimize the network and obtain optimal robotic calligraphy works. The experimental results demonstrate that this method is extremely efficient and applicable in robotic calligraphy, achieving state-of-the-art performance. Specifically, the average structural similarity index reaches 75.91%, the coverage rate is 70.25%, and the intersection over union (IoU) stands at 80.68%. Wang et al. [
6] proposed a new detector named EMANet. First, replaced the feature extraction module in the core of EfficientNet-B3 with an improved feature extraction (MBConv++) module to better capture the interdependencies between channels. By integrating this module, the network can focus on the key features in the entire dataset. Additionally, designed a Multi-Dimensional Scale Fusion (MDSF) module, which effectively enhances the scale robustness of the network. Finally, developed a mobile ancient text digitization application suitable for the Internet of Things (IoT). The experimental results show that EMANet achieves an F-measure of 94.9% on the handwritten ancient book dataset, which proves its effectiveness. Qi et al. [
7] proposed an advanced deep learning framework for detecting ancient Chinese characters in complex scenarios, with the goal of improving detection accuracy. This framework introduced an Ancient Character Haar Wavelet Transform Downsampling Block (ACHaar), which effectively reduces the spatial resolution of feature maps while preserving key features of ancient characters. Secondly, a Glyph Focus Module (GFM) is incorporated; it utilizes an attention mechanism to enhance the processing of deep semantic information and generates feature maps of ancient characters that emphasize horizontal and vertical features through a four-path parallel strategy. To train and validate the model, a dedicated dataset was constructed. The experimental results show that this method outperforms previous text detection methods on the HUSAM-SinoCDCS dataset: the accuracy is increased by 1.36–92.84%, the recall rate is increased by 2.24–85.61%, and the F1-score is increased by 1.84–89.08%.
In the field of calligraphy recognition and classification, the research focus is on algorithm improvement and model adaptation to improve recognition robustness in complex scenarios. Huang et al. [
8] conducted Convolutional Neural Networks (CNN) training and testing on the four main calligraphy fonts of the Tang Dynasty (690–907 AD), and achieved 89.5–96.2% in the calligraphy classification task. The accuracy rate highlights the importance of CNN in font and art style classification. However, the scope of its testing is still limited to a specific dynasty or font. Wang et al. [
9] proposed a calligraphy recognition method based on the improved EfficientNet network, which efficiently extracts a single calligraphy character and expands the training set through a character box algorithm, and introduce Convolutional Block Attention Module (CBAM) in the training process enhance the network attention mechanism with the Self-Attention module; however, its structure is relatively complex, which may increase the computational cost of model training and inference. In addition, attention mechanisms such as BAM and CBAM may not fully consider the spatial information of different scales in calligraphy characters, and fail to make full use of the contextual information in different receptive fields to enrich the feature space. The hierarchical general network Hierarchical Universal Network (HUNet) designed by Wang et al. [
10] realizes the efficient recognition of different types with the help of parameter sharing architecture and multi-stage feature extraction fusion, which achieves better recognition performance than other evaluated efficient models while ensuring high computational throughput. After training on MTACCR, HUNet’s Top-1 accuracy on the MACR test set improved by 9.64% However, the specific computational overhead of its parameter sharing architecture and multi-stage feature extraction fusion still needs to be considered. In response to the special scenarios, Han et al. [
11] proposed a crowdsourced-based expert recognition method to restore severely degraded calligraphy characters through a novel character similarity reasoning (CSI) algorithm, and developed a crowdsourcing repair task system LanT, the experiment confirmed that CSI can reliably identify experts; but relying on crowdsourcing and expert identification can be costly and difficult to scale. Zerdoumi et al. [
12] focus on offline Arabic text separation, which performed well in distinguishing Arabic texts, the method used Osmani and Arabic scripts based on verses 84 and 117, with an accuracy rate greater than 98.14% and 90.16%, respectively. In addition, to solve the special problem of calligraphy characters, Gao et al. [
13] proposed a method based on Faster RCNN to solve the problem of component segmentation caused by partial adhesion and overlap in incoherent handwritten calligraphy characters in multi-scale and small object detection. The experimental results demonstrate the accurate segmentation effectiveness of the method for adhering and overlap components. In addition, these components could be retrieved accurately in the retrieval system, and the mean Average Precision of the top 30 retrieval results reached 95.7%. A better retrieval accuracy reflects a better segmentation effect from the side, which proves the effectiveness of the proposed method. Chen et al. [
14] constructed an improved algorithm YOLOv7-PDM for regular calligraphy recognition, and after training and testing on a self-made square character single-character dataset, it was confirmed that the accuracy of the method in orthographic feature detection reached 94.19%. However, it was trained and tested on homemade, style-specific calligraphy datasets. This may lead to weak generalization of models in other styles, calligraphy works by different calligraphers, or in different periods. Ye et al. [
15] proposed innovative recognition methods, one of which integrates feature sequence extraction, sequence modeling, and transcription, through an integrated architecture, and the other is to break through the limitation of traditional algorithms. The recognition accuracy of both methods is 84.70%, and the latter shows higher robustness in multi-style Chinese character recognition. Yan et al. [
16] proposed a novel one-shot learning model called SMFNet for Chinese character font recognition. This model was trained and tested on the XIKE-CFS Chinese font style dataset. The experimental results indicate that SMFNet achieved a recognition accuracy of 97.50% on the XIKE-CFS dataset; additionally, it supports one-shot recognition, with an accuracy of 92.84% for this specific task. Zhang et al. [
17] proposed an end-to-end Chinese character generation model based on dense blocks and capsule networks. The generator of this model adopts a self-attention mechanism and densely connected blocks, while the discriminator is composed of a capsule network and a fully connected network. To verify the effectiveness of the model, they conducted comparative experiments and ablation experiments on the Yan Zhenqing Zhengwen dataset, Deng Shiru Wenshuwen dataset, and Wang Xizhi Yunxing Wenshu dataset. The experimental results show that compared with the comparison models, this model increases the SSIM by an average of 0.07, decreases the MSE by an average of 1.95, and increases the PSNR by 0.92.
In terms of style analysis and aesthetic evaluation, Sun et al. [
18] proposed a Siamese regression aesthetic fusion method based on calligraphy aesthetics and deep learning, called SRAFE. To further enhance the performance of calligraphy evaluation, the SRAFE method incorporates depth features from SRN and hand-designed aesthetic features. Sun et al. further investigated the effectiveness of the SRAFE method through ablation experiments, the SRAFE method improved by 0.170 on Mean Absolute Error (MAE) and 0.041 on Pearson correlation coefficient (PCC) compared to SRN while improving MAE by 0.323 and PCC by 0.261 compared to using manual features alone. Yoshida et al. [
19] used the deep learning method TabNet to expand the previous research to improve the accuracy and interpretability of the aesthetic quality assessment of Chinese calligraphy; however, how to make deep learning models better understand and integrate higher-level abstract concepts such as calligraphy aesthetic principles, charm, and qi charm is still a challenge. Xu et al. [
20] proposed a shear-guided filter to extract calligraphy art information, and experiments showed that the filter could extract form and spirit information more accurately, and the new evaluation parameters performed better in performance evaluation. In each image, RGF performs best. OTSU is the worst effective, with a maximum MER value of 0.0268. Compared with MER values and averages, FFCM is better than OTSU but inferior to MCGF. Si et al. [
21] identified the unique characteristics of each character through the HOG method, combined with Euler distance to measure the spatial relationship between the target and the background point to capture unique strokes and patterns, and then used the Google LeNet Inception-v3 model to exert CNN capabilities, highlighting its application potential. The results of the in-depth analysis revealed an awareness rate of 93.12%, illustrating the immense potential of their approach. Even in the presence of Gaussian white noise, 91.3%, 90.9%, and 89.4% accuracy are obtained at noise levels of 0.02, 0.04, and 0.06. In terms of style generation and calligraphy application, Liang et al. [
22] proposed a calligraphy writing method based on style transfer algorithm to achieve Chinese font art style transfer while maintaining calligraphy imitation accuracy, and proposed a typical similarity evaluation method including balance, tilt, and writing strength similarity to analyze the writing effect. The experimental results show that the minimum similarity between the robot’s writing style and the target character style is greater than 0.75. The proposed method shows good calligraphy writing effect and can be applied to various calligraphy training processes and trajectory planning applications. Wang et al. [
5] designed a dual-generator framework based on deep adversarial network for calligraphy reproduction, with an average structural similarity index of 75.91%, a coverage rate of 70.25%, and an intersection rate of 80.68%. Lin et al. [
23] proposed a stroke generation module based on a generative adversarial learning model to learn to disassemble the writing style of strokes so that it can generate stroke trajectories based on the learned style through character disassembly and apply the writing style to any Chinese character in a given database. Liu et al. [
24] employed the Latent Diffusion Model (LDM) to generate Chinese calligraphy characters, they use ready-made SimSun font character images as guiding conditions, reference styles can be learned at the component level demonstrating state-of-the-art performance that surpasses comparison methods both quantitatively and qualitatively. Zhao et al. [
25] proposed a self-supervised style learning (SSL) method that aims to increase the skeletal similarity between the output image and the target. An evaluation of the proposed method using a public Chinese handwriting database shows that the handwritten images generated by its method have higher quality, stroke correctness, and calligraphy similarity compared to other popular font imitation methods. Xin et al. [
26] proposed a novel detection network based on dynamic feature fusion upsampling and text region focusing, named DRA-Net. The core innovations of this method include: (1) a Dynamic Fusion Upsampling Module, which adaptively assigns weights to effectively fuse multi-scale features while preserving key information during feature propagation; (2) an Adaptive Text Region Focusing Module, which integrates an axial attention mechanism to enhance the model’s ability to locate text regions and suppress background interference; and (3) the integration of deformable convolution, which improves the network’s capability to model irregular text shapes and extreme aspect ratios. Additionally, a dataset named ACST was constructed. The experimental results show that the detection accuracy of DRA-Net on the ACST dataset is significantly higher than that of existing methods, and it performs robustly in scenarios with complex backgrounds and extreme text aspect ratios. Its F1-score reaches 72.9%, precision is 82.8%, and recall rate is 77.5%.
In conclusion, in image-based calligraphy style classification tasks, especially for unique image carriers such as rubbing images of stone inscriptions, existing methods are confronted with significant challenges.
First, rubbing images are essentially single-channel grayscale images, whose feature representation dimension is far lower than that of color images. This single-channel characteristic determines that rubbing images mainly rely on information such as stroke morphology, ink shade, and paper texture for representation. During the digitization process, affected by factors including ambient light, rubbing quality, and paper aging, the images usually contain a large amount of low-quality noise, uneven ink distribution, or damage marks. These non-semantic noises are mixed with the genuine structural features of calligraphy, which greatly limits the extraction of effective style features and the robustness of traditional convolutional neural networks (CNNs).
Second, the style discrimination of calligraphy art is highly dependent on subtle local features (e.g.; the pressing and lifting of brush tips, the force at turning points, and the angles of stroke initiation and termination) as well as global spatial layouts across character scales. For instance, the differences in clerical script styles among different dynasties may only be reflected in the delicate handling of the wavy stroke (the extended right-falling stroke) or the degree of outward expansion of the overall character structure. Traditional CNNs perform excellently in extracting local textures, but they often struggle to achieve an efficient balance when required to simultaneously capture these fine-grained local variations and long-range global dependencies. This deficiency makes it difficult for the models to accurately distinguish calligraphy works with similar styles but different ages, thereby compromising the final classification accuracy.
Based on the two key challenges mentioned above, we propose an efficient deep learning model integrated with a dual-path attention mechanism. The model is designed to overcome the limitations of single-channel features and guide the network to accurately focus on the key brushstrokes and spatial regions critical for style discrimination during the feature extraction process, thus achieving accurate and efficient calligraphy style classification.
Table 1.
A Summary of Relevant Research in Calligraphy Image Analysis.
Table 1.
A Summary of Relevant Research in Calligraphy Image Analysis.
| Research Directions | Representative Literature | Dataset | Core Approach | Key Performance |
|---|
| Calligraphy text detection | [2] Mask TextSpotter v3 | Natural scenes bend text | Instance segmentation and semantic segmentation | Curved Text Detection F1: 86.3% |
| [4] Huang et al. | Universal text | CTPN/EAST + ResNet | Improved feature extraction to improve positioning accuracy |
| Calligraphy identification and classification | [8] Huang et al. | Four regular script fonts in the Tang Dynasty | CNN | Classification accuracy: 89.5–96.2% |
| [9] Wang et al. | Single calligraphy character | Improved EfficientNet + CBAM | Introduce attention mechanisms |
| [14] Chen et al. | Self-built square italic single-word dataset | Improved YOLOv7 | Accuracy of Orthographic Feature Detection: 94.19% |
| Style analysis and generation | [19] Yoshida et al. | Calligraphy aesthetic evaluation | TabNet | Improve the accuracy and interpretability of aesthetic evaluation |
| [5] Wang et al. | Calligraphy reprint | Deep Adversarial Network Dual Generator Framework | Average Structural Similarity Index: 75.91% |
3. Results
3.1. Evaluation Indicators
The loss value of the model on the validation set quantifies the discrepancy between the model’s predictions and the true labels for unseen validation data.
Validation Loss (Val Loss) serves as a core indicator reflecting the model’s generalization ability. Specifically, if Val Loss is significantly higher than Training Loss (Train Loss), it indicates the model is overfitting; if Val Loss is close to Train Loss, the model demonstrates strong generalization ability.
Accuracy: Defined as the ratio of correctly predicted samples to the total number of samples, it is the most intuitive indicator of classification performance.
To validate the effectiveness of the proposed method, we compare it against conventional CNN models and ResNet-50 as baseline models.
3.2. Experimental Results
The EfficientNet-B2 + BAM model attained a classification accuracy of 98.44% on the test set, significantly outperforming ResNet-50 (92.34%) and conventional CNN models (84.67%). More importantly, both its training loss and validation loss were minimized to 0.1049 and 0.0625, respectively the lowest among all compared models. This indicates that the model not only possesses strong fitting capability but also demonstrates superior generalization performance, with overfitting effectively mitigated (see
Table 4).
In terms of comprehensive evaluation metrics, the proposed model also exhibits notable advantages. Its F1-score (0.9675) and AUC (Area Under the Curve) value (0.9694) rank the highest among the three models, confirming that it achieves an optimal trade-off between precision and recall while demonstrating robust inter-category discrimination capability.
To verify the effectiveness of the BAM attention module and the necessity of its dual-path design, a series of systematic ablation experiments were conducted, with the results presented in
Table 5. The experiments employed EfficientNet-B2 as the baseline model (test set accuracy: 76.68%) and sequentially integrated different attention mechanisms for comparison—encompassing standalone channel attention, standalone spatial attention, SE attention, ECA attention, CBAM, and the proposed BAM module.
The experimental results show that the proposed EfficientNet-B2 + BAM model achieved an excellent accuracy of 98.44% on the test set, significantly outperforming all other variants. Meanwhile, its extremely low validation loss (0.0625) and training loss (0.1049) indicate the model exhibited excellent generalization ability and optimization efficiency. In terms of macro evaluation metrics, the BAM model performed remarkably well: it achieved a Macro F1-score of 0.9675, a Macro Recall of 0.9480, a Macro AUC of 0.9694, and a Macro Precision of 0.8690, delivering balanced and leading overall performance.
Standalone spatial attention significantly increased accuracy from 76.68% to 87.62% (a 10.94 percentage point gain), whereas standalone channel attention yielded only a marginal improvement (78.24%, a 1.56 percentage point increase). This explicitly demonstrated that in calligraphy style classification, geometric information—such as character stroke structure and spatial layout—is more discriminative of style differences than channel feature responses. The spatial attention model’s F1-score (0.8614), Recall (0.8527), and AUC (0.9258) were all significantly higher than those of the channel attention model, further corroborating the importance of geometric features.
Although spatial attention contributed substantially, the complete BAM model (accuracy: 98.44%) substantially outperformed any single-path model. BAM’s F1-score (0.9675) was 10.61 percentage points higher than that of standalone spatial attention, and its AUC (0.9694) was 4.36 percentage points higher. This robustly validated the synergistic effect between channel and spatial attention: through adaptive recalibration of feature channel weights, the channel path assisted the spatial path in focusing more precisely on style-relevant regions.
Compared with the well-established CBAM, the BAM model outperformed it by 5.7 percentage points in accuracy (98.44% vs. 92.75%) and held advantages in F1-score (0.9675 vs. 0.9138), Recall (0.9480 vs. 0.9054), and AUC (0.9694 vs. 0.9689), with only a slightly lower Precision (0.8690 vs. 0.9237). This result underscored the BAM module’s unique advantages in calligraphy image feature extraction, and its dual-path integration method is better suited to capturing the complex features of such cultural heritage data. Furthermore, BAM’s validation loss (0.0625) was lower than CBAM’s (0.0802), indicating stronger generalization ability.
Performance showed a gradual improvement from SE attention (accuracy: 89.43%) to ECA attention (90.67%), then to CBAM (92.75%), and finally to BAM (98.44%), mirroring the iterative optimization of attention mechanism design. BAM achieved the highest accuracy while maintaining low training and validation losses, showcasing a favorable balance between model complexity and performance.
The loss and accuracy curve during the experiment is shown in
Figure 7.
As shown in
Figure 7a, in the early stage of training (Epoch 0–5), the model quickly captures the high-frequency features of the training data with efficient parameter updates. The validation loss rises initially before declining: this is because the model tends to overfit the training data at first, leading to an increase in validation loss. However, as training progresses, the model gradually develops generalized representations, and the validation loss drops sharply.
In the middle stage of training (Epoch 5–10), both the training loss and validation loss decrease concurrently, indicating that the model enters a stable learning phase. During this period, it not only refines its fitting to the training data but also enhances its generalization ability.
In the late stage of training (Epoch 10–35), both losses stabilize, with the validation loss being slightly lower than the training loss. At this point, the model has fully grasped the core features of the data. A lower validation loss suggests that the model’s generalization error on the validation set is smaller than the fitting error on the training set.
As shown in
Figure 7b, during the initial training phase (Epoch 0–15), the training accuracy rises rapidly. In this period, the model quickly captures the discriminative features of the training data, resulting in a marked enhancement in its classification capability.
Subsequently (Epoch 15–35), the validation accuracy aligns closely with the training accuracy—even reaching parity with or slightly exceeding the latter. This phenomenon fully reflects the model’s strong generalization ability.
Figure 8 shows the multi-class ROC curves of the proposed calligraphy dynasty classification model on the test set. Key observations from the figure are as follows:
The individual ROC curves for all four dynasties assume a shape close to the ideal, with their AUC values all approaching 1.00. This demonstrates that the model has highly accurate recognition capability for calligraphy works of each dynasty. While the performance across all categories is exceptional, minor differences are still observable. For example, the ROC curves of Tang Dynasty Lishu (green) and Yuan Dynasty Lishu (purple) are closer to the upper-left corner than those of Han Dynasty Lishu (blue) and Qing Dynasty Lishu (orange), with their AUC values also being marginally higher.
Figure 9 shows the normalized confusion matrix of the proposed model on the test set. Overall, the values on the main diagonal are significantly higher than those on the off-diagonals—indicating the model has a robust ability to discriminate between the calligraphy styles of the four dynasties. Nevertheless, the matrix also reveals notable error patterns, as detailed below:
First, the most prominent confusion occurs between Han and Qing. Fifty-five Han samples were incorrectly classified as Qing, while four Qing samples were misclassified as Han. This mutual confusion strongly suggests a high degree of visual stylistic similarity between the two.
Second, a certain degree of confusion is also observed between Tang and Yuan, with 27 Tang samples misclassified as Yuan. Tang is characterized by rigorous regularity, while Yuan often integrates the brushwork of regular scripts. This confusion may stem from two factors: inherent structural commonalities between the two styles, or unrepresentative stylistic features in some samples of the dataset.
These findings not only validate the model’s effectiveness but also, more importantly, shed light on the model’s decision-making mechanism. They further offer directions for future optimization—for instance, developing more refined feature extraction strategies specifically for these easily confused dynasty style pairs.
This section presents the classification results for calligraphy samples (specifically Lishu) from four dynasties: Han, Tang, Yuan, and Qing. As shown in
Figure 10, the model demonstrates outstanding classification performance for samples from these four dynasties and is capable of accurately identifying unseen samples from the test set.
3.3. Discussion
As shown in
Table 4, a noteworthy finding requiring in-depth analysis is that ResNet-50 achieved the highest recall rate, whereas the conventional CNN model yielded the highest precision. This reflects the inherent characteristics of different model architectures: ResNet-50’s deep structure is more adept at “capturing” positive samples, yet this capability comes at the cost of potentially introducing more misclassifications; in contrast, the conventional CNN model is more “cautious,” with higher confidence levels for its positive predictions.
In comparison, the proposed model achieves a high-level balance across all metrics while maintaining exceptionally high accuracy and F1-score. This is attributable to the BAM attention mechanism’s precise focusing on key features, enabling overall performance that comprehensively outperforms the baseline models.
In order to evaluate the performance stability and statistical reliability of the calligraphy image classification model based on attention enhancement, the model was statistically validated by 10-fold Cross Validation.
Table 6 shows the average performance indicators and standard deviations of the model over 10 cross-validations.
The statistical results show that the model’s performance in the 10 cross experiments is stable with a high level of around 98%, and the standard deviation is less than 0.015. This indicates that the model demonstrates strong generalization ability and robustness, showing no significant fluctuations across different data partitions. The low standard deviation indicates that the model’s performance is minimally affected by data partitioning, demonstrating good data adaptability without overfitting any specific subset. Both precision and recall rates remain high, further confirming the model’s ability to effectively reduce false positives while maintaining an extremely low false negative rate.
3.4. Error Analysis
Combined with the error analysis, it can be found that a small number of misclassifications mainly occur on the samples with fuzzy font style boundaries or similar character structures. For example:
The Han dynasty clerical script (han li) and Qing dynasty clerical script (qing li) share similar stylistic features, though some samples exhibit stylistic confusion.
Some clerical script characters in Tang Dynasty (tang li) exhibit structural similarities to Han clerical script (han li), resulting in occasional misinterpretation.
These confusions have a certain fontlogical rationality, and are closely related to the brushwork, structure and the evolution of different calligraphy styles.
Although the misclassifications identified in the error analysis can be partially attributed to the historical continuity and stylistic proximity between certain dynasties, several methodological improvements could further reduce these errors and enhance the robustness of the proposed model. A key direction is the incorporation of character-level geometric normalization, as variations in stroke proportion, rubbing distortion, and carving depth may obscure essential structural cues; techniques such as skeleton extraction, affine-invariant preprocessing, or stroke-width standardization could help the model better capture dynasty-specific morphology. In addition, expanding the dataset with transitional or stylistically ambiguous samples—either through targeted data collection or controlled synthetic augmentation—would allow the classifier to learn finer boundaries where historical styles overlap. Strengthening the feature extractor with multi-scale or fine-grained modules may also mitigate confusions such as Han–Tang or Han–Qing, where differences lie in subtle stroke angularity and serif formation. Moreover, contrastive learning paradigms that explicitly enforce larger inter-dynasty feature margins could improve separability in the embedding space, particularly for historically adjacent styles. Finally, integrating domain-specific constraints from calligraphy studies, such as stroke entry/exit patterns or radical-level structural rules, may provide discriminative cues that conventional CNN architectures do not explicitly model. Together, these improvements offer promising pathways for addressing the observed misclassifications and further enhancing the system’s interpretive reliability.