5.2. Qualitative Results
CAMs are used here as an interpretation tool towards understanding how the models are recognizing the depicted saints.
Figure 10,
Figure 11 and
Figure 12 demonstrate the CAMs from the best performing model in Experiments 1, 2, and 3, respectively, regarding all classes involved.
Figure 10 includes six CAM images, one indicative example for each class, referring to the best performing model of Experiment 1, VGG19. From the CAM, we can see that VGG learns to classify Saints Nicholas, Raphael and Irene from the face (
Figure 10a) and Saint Athanasios the Great (
Figure 10b) from the crosses on his clothes. As for Saint John the Baptist and Saint Demetrios, who were the two saints that the model mixed up, as it can be seen from
Figure 10c,d, both images reveal areas with great activation on the left bottom, where the second human head is depicted, as it was initially supposed. Saint Paisios is recognized by the black hat, while Apostle Peter and Apostle Paul are recognized from the church miniature that they hold between them.
Figure 11 includes 12 CAM images, one indicative example for each class, referring to the best performing model of Experiment 2, MobileNet.
From the CAMs in
Figure 11a, we can see that MobileNet paid attention to all three figures to detect the Saints Nicholas, Raphael and Irene classes. Saint Athanasios the Great (
Figure 11b) was detected mainly from the crosses on his clothes, as in Experiment 1. Regarding the classes of Saint Demetrios and Saint George, the model failed to correctly classify the images of Saint Demetrios, attributing them to the Saint George class. Both saints appear on horseback with comparable poses and weapons, leading the model to focus on these shared attributes rather than the subtle iconographic cues that differentiate them. As can be seen from the indicative CAMs in
Figure 11c,d, the model pays attention to the horse, pose and weapon in both cases, which explains the quantitative results of Experiment 2 for these two classes. Saint John the Baptist (
Figure 11e) is classified from his pose, as the model is paying attention to the entire human figure. Saint Nicolas (
Figure 11f) is recognized by his face and clothes. Note that, as seen in the confusion matrices of
Figure 7, images of Saint Nicolas are classified as Saint Athanasios the Great. The CAM of
Figure 11f reveals that the model learns features from the clothes of Saint Nicolas, which have the same cross pattern as in the case of Saint Athanasios the Great, meaning that these two classes are, in cases, very close to each other, confusing the model. Both wear vestments with nearly identical cross patterns, which explains overlapping Grad-CAM activations and the misclassification between these two classes.
The Grad-CAM visualizations of the imbalanced Experiment 2 further illustrate how class imbalance shaped the models’ internal representations. Minority classes with limited training samples (Saint Panteleimon, Apostle Peter and Apostle Paul, and Prophet Ilias) exhibit more diffused activation maps, indicating that the model did not develop sufficiently discriminative features for these classes. These patterns highlight that the classification task is fundamentally fine-grained and that both visual similarity and class imbalance jointly shape the observed errors.
In the case of Saint Paisios (
Figure 11g), all testing images were correctly classified, yet there is no clear heatmap activation in either of the images. The latter means that the model still finds enough abstract cues to make the correct decision. MobileNet has a lightweight architecture, and, therefore, it might rely on non-localized features, rather than distinct regions within the image. MobileNet uses depthwise separable convolutions that significantly reduce the number of parameters and shift how spatial features are captured. This can result in lower-resolution feature maps at later layers, which makes the CAM’s output faint or blank, especially for underrepresented classes.
In order to further investigate if MobileNet is being minimalist, the CAM of the same class in more complex architectures, that of VGG19, ResNet201 and EfficientNet, is illustrated in
Figure 12. The results indicate attention regions on the hat and robe of the saint, verifying the lower activation visibility of lightweight MobileNet.
As for Saint Panteleimon (
Figure 11h), activation is observed on the saint’s face and clothes, while for Apostle Peter and Apostle Paul (
Figure 11i), attention is paid in the church miniature between them, as in Experiment 1. In
Figure 11j, we can observe that the main characteristic that the model uses to classify Jesus Christ is its head pose and beard. Therefore, misclassification of Saint John the Baptist as Jesus Christ may be attributed to the same head pose of the two saints, who do not look straight forward, as most of the other saints, but shift their heads slightly to the left. For the class of Mother of God and Jesus Christ (
Figure 11k), the model captures both faces, as for the class of Prophet Ilias (
Figure 11l). Yet, Prophet Ilias also tilts his head to the left, which explains why testing images from his class were misclassified as Saint John the Baptist in most of the cases (
Figure 7).
Figure 13 includes eight CAM images, one indicative example for each class, referring to the best performing model of Experiment 3, DenseNet201.
In the case of Experiment 3, the CAMs reveal expected regions in the image where the model paid attention, similar to the previous two experiments. The CAMs for Saint George, Saint Demetrios and Saint Paisios do not indicate activation regions. In cases, correct predictions may occur via subtle activation patterns that do not concentrate in one specific region strongly for the CAMs to highlight, especially for well-separated classes, such as in the case of Saint Paisios and Saint George. For underrepresented classes, such as for Saint Demetrios, the network may not learn strong, localized features. Indeed, in Experiment 3, ResNet201 failed to correctly classify the images of Saint Demetrios, attributing them to the Saint George class. Moreover, the class of Saint John the Baptist was mainly misclassified as Jesus Christ; the latter can be explained from the CAMs of
Figure 13d,g, where the model in both cases paid attention to the same hand gesture of the Saints.
Overall, from the presented CAMs over all three experimental setups, it can be observed that the pretrained models of various architectures, from dense to simple, fine-tuned on our datasets of noisy hand-painted images of icons, exploring balanced, imbalanced and medium-balanced case studies, are capable of learning to discriminate the several classes by using the same visual clues as a human observer.
5.3. Discussion
In this work, a novel Christian Orthodox icon dataset has been presented and tested with 13 different deep architectures for the recognition of the depicted saints. Various experimental setups have been tested, overall indicating the ability of specific deep models to correctly classify imbalanced data, as well as data that are noisy, e.g., images poorly preserved, affected by illuminations, or of different orientations. By providing experiments across balanced, imbalanced and medium-imbalanced datasets, we aimed to uncover how the composition of data would affect the models’ behavior.
The results offered a deeper understanding of performance trade-offs; the balanced dataset resulted in higher performance across all classes, the imbalanced dataset resulted in biased accuracy in favor of majority classes, and the medium-imbalanced dataset revealed the threshold at which the imbalanced classes begin to affect the models’ performance. Moreover, from the provided CAMs, it became clear that balanced setups offer richer learning, while in imbalanced setups, feature representation for minority classes tends to be underdeveloped.
The results verify that deep learning and computer vision are able to help towards identifying and categorizing a wide range of different icons. Most of the models used were properly trained and had a high percentage of positive predictions in the set of testing images, despite the complexity of the multi-classification problem. The task of Christian Orthodox saint identification is very complex, as many saints are likely to have a variety of different depictions (e.g., Saint George is not always on the horse, Jesus Christ has a Halo or wears crown of thorns), while some saints have similar characteristics, making their icons very similar (e.g., Saint Demetrios and Saint Gregory) and difficult to be correctly classified not only by AI but also by experienced humans.
The performance results of the models, as well as their CAM visualizations, give room for future improvement and can be positively influenced by many configurations. Initially, collecting a larger amount of data from different saints in order to create a fully balanced dataset would be able to help the models in their training and final performance. Therefore, future work includes the enrichment of our dataset with more images, as well as with more classes, making it the first and biggest publicly available dataset of Christian Orthodox saint icons. Considering that Christian Orthodox iconography relies heavily on symbolic attributes such as crosses and characteristic vestments, future research would also benefit from the exploitation of explicit object-level detectors in the classification pipeline [
34]. The fusion of whole-image classifiers with specific object detectors of key iconographic elements could potentially enhance the models’ ability to better distinguish between visually similar classes [
35].
While in this work data augmentation was intentionally avoided, future research could investigate data augmentation strategies, considering the highly imbalanced nature of the dataset. Specifically, fine-grained recognition with feature-level data augmentation [
36] could be beneficial, since many images share similar visual cues, and discriminative characteristics play a key role in distinguishing fine-grained classes. While image-level data augmentation is commonly used in deep learning classification tasks, it is not efficient in fine-grained problems due to randomly editing regions of the image, thus destroying discriminative characteristics in the subtle region. Feature-level augmentation strategies could therefore be employed in future research to balance and enrich the dataset without risking the loss of discriminative details [
37,
38,
39].
Our long-term goal is to create a robust model able to deal with the complex multi-class classification problem of saint recognition so as to detect saints’ iconographic variations across regions, as well as on non-preserved old historical icons. The latter would support the development of a digital tool for automatic saint categorization, useful for church cataloging or as an educational tool for theology students or for personal use by the religious.
Note that the identification of Christian Orthodox saints is a fine-grained task that can even challenge trained art historians or theologians, especially when icons are partially damaged, stylistically diverse or sharing overlapping iconographic features. Theologists, especially non-expert ones, typically rely on broader cues and may struggle to distinguish attributes that require experts’ knowledge of vestment, symbolic attributes, etc. Thus, compared to the identification capacity of art historians or non-specialist theologians, the evaluated models achieve notable performances, considering the complexity of the task and the imbalanced dataset. The models managed to classify the majority of the test images and exhibited consistent patterns aligned with interpretable human reasoning, as evaluated by the Grad-CAM visualizations. In this context, the practical impact and the potential of the proposed system as an assistive tool for cataloging, education, or preliminary analysis is further highlighted.
From a deployment perspective, the evaluated model architectures differ significantly in computational costs and, thus, in suitability for real-world applications. Lightweight models, such as the MobileNet family, can offer fast inference and could be easily integrated into mobile applications [
40], while heavier architectures could be employed for server-side processing [
41]. Moreover, practical deployment should consider a preprocessing pipeline to deal with varying lighting conditions and illuminations, which are common in-field conditions affecting the quality of images. By addressing such aspects, the proposed benchmark could be transformed into a practical and operational tool.
The use, however, of AI in sacred contents may raise ethical and theological concerns [
42,
43]. Data acquisition should be respectful, ensuring that the religious significance of icons is protected, such as being based on requested permissions and contextual awareness. Algorithmic representations and assumptions made by programmers need to be verified by theologists, clarifying that the proposed method aims to be an assistive tool, complementary to the expertise of theologists and art historians. Deployment in real-world settings must prevent misuse, such as inappropriate commercial exploitation or trivialization of sacred images. Use and distribution of sacred imagery needs to be done with full respect in a way that does not offend the divine and the faithful, offering room for both ethical scientific explorations and practical applications. The involvement of religious and cultural heritage communities in the expansion of the dataset and the design and evaluation of the proposed system could help ensure that technological innovations would align with the values and traditions associated with sacred art.