1. Introduction
Due to the acceleration of the global cultural heritage digitization, museums and digital libraries have accumulated massive amounts of art image data. Facing such enormous digital visual archives, how to efficiently organize, retrieve, and analyze these artworks has become a core challenge at the intersection of digital humanities and computer vision [
1,
2]. As a foundational technology for automating art cataloging, art style classification also provides critical feature representations for art recommendation systems, computational aesthetic analysis, and artwork forgery detection [
3]. In recent years, the deep learning paradigm has largely replaced traditional methods based on hand-crafted features (e.g., color histograms, SIFT [
4]), making significant progress in multiple art image analysis tasks using deep convolutional neural networks (CNNs). Other studies have demonstrated that cascaded multi-scale attention frameworks can significantly improve feature localization in challenging environments [
5]. Similarly, the use of cross-modal collaboration to fuse complementary information has shown great promise in complex detection tasks [
6]. However, unlike general object recognition in natural images (e.g., distinguishing cats from dogs), art style is a highly abstract concept that integrates the artist’s subjective intent, history, and complex structural frameworks. Current deep learning models still face a severe representational bottleneck when capturing the global aesthetic structural principles relied by human experts.
The core reason leading to this bottleneck lies in the inherent “texture bias” of standard CNN architectures [
7]. Extensive research indicates that CNN models pre-trained on large-scale natural image datasets (e.g., ImageNet) highly depend on local, high-frequency texture patterns for their decision-making, rather than global object shapes or structural contours [
8]. In art applications, this bias is further amplified: due to the local receptive field of convolution operations, the network is extremely adept at extracting local features such as “brushstrokes”, “color gradients”, and “canvas textures”, but often ignores the global spatial layout and contours that define the core characteristics of many artworks. This representational imbalance causes the network to easily confuse art styles that share similar color palettes and brushstrokes but possess entirely different structural logics. To achieve robust and cognitively aligned art image understanding, there is an urgent need to shift from a purely texture-driven learning methodology to a structured learning paradigm that integrates explicit geometric priors [
9].
This representational bias inherent in neural networks finds profound theoretical support in art history. The art historian Heinrich Wölfflin proposed five pairs of fundamental concepts for artistic expression in his classic work, the most central of which is the dichotomy of “Linear vs. Painterly”. As illustrated in
Figure 1, Wölfflin argued that artistic expression fluctuates between the “linear” mode and the “painterly” mode. The former (e.g.,
High Renaissance,
Ukiyo-e) relies on clear boundaries, distinct outlines, and solid structural logic; while the latter (e.g.,
Impressionism,
Abstract Expressionism) tends to convey atmosphere through colors, lighting, and blurred boundaries. From a computational perspective, current CNNs are essentially visual perception models highly biased toward the “painterly” mode, lacking perceptual capability for “linear” structures. Existing art classification methods mostly treat image style as a single feature for global pooling or self-attention computation [
10], failing to explicitly introduce the “linear” structural prior, which defines the spatial logic of the artwork, into the network architecture.
To overcome the texture bias of CNNs and bridge this representational gap, we propose the Edge-Guided Spatial Attention Network (ESA-Net) as the technical implementation of Linear-Aware Attention to bridge the gap between feature extraction and aesthetic structure. This network aims to explicitly reintroduce the structural priors of artworks into general deep learning models. Unlike existing methods that treat style as an indivisible characteristic, ESA-Net decouples visual representation into two complementary concepts based on Wölfflin’s theory: “semantic texture” (RGB stream) to capture local features, and “structural contours” (Edge stream) to reflect global logic. We utilize an edge detection operator to extract high-fidelity edge maps, serving as color-independent “linear” constraints.
To efficiently fuse these two modalities, we designed a core component: the Edge-Guided Convolutional Block Attention Module (EG-CBAM). This module breaks the limitation of traditional attention modules that rely solely on internal feature statistics by treating the introduced structural edge map as an explicit gating signal. Through this module, the network is forced to recalibrate the spatial attention, focusing on salient regions with clear geometric contours while suppressing background texture noise generated by intense brushstrokes. Furthermore, addressing the common issues of “spatial aliasing” and edge fragmentation when aligning high-resolution edge maps with deep semantic features, we propose the Consecutive Average Pooling (CAP) strategy. Through multi-stage smooth downsampling, CAP preserves the topological coherence of the structural guidance signal without increasing the network’s parameter overhead.
We comprehensively evaluated ESA-Net on the large-scale WikiArt dataset, which is characterized by a high degree of class imbalance and a long-tail distribution. The experimental results demonstrate that ESA-Net achieves a top 1 accuracy of 69.40%, establishing a new state-of-the-art (SOTA) performance. In addition, qualitative analysis via Grad-CAM visualizations confirms that the model effectively broadens its decision-making basis, with its attention distribution closely aligning with the structural layouts favored by human art experts.
Distinct from conventional attention paradigms—such as Squeeze-and-Excitation (SE) networks or the vanilla Convolutional Block Attention Module (CBAM)—which rely on endogenous feature statistics to implicitly compute importance weights, the proposed ESA-Net introduces an exogenous structural constraint. By leveraging non-parametric geometric operators to inject external spatial priors, our framework shifts from a purely data-driven heuristic to a cognition-based structural attention regime. This approach ensures that the network’s focus is not merely an emergent property of internal activations but is explicitly governed by the formalist logic of artistic composition, thereby bridging the gap between deep learning engineering and art-theoretical principles.
The main contributions of this paper are summarized as follows:
Dual-Stream Feature Fusion Architecture: We establish a novel network paradigm that structurally separates color texture from physical contours. By introducing explicit edge information as a structural prior supplement to the CNN, it effectively mitigates the inherent “texture bias” of deep neural networks in art image processing.
Edge-Guided Attention Module (EG-CBAM): We propose a novel spatial attention module that utilizes pure edge priors as spatial signals to refine the model’s receptive focus. Combined with our proposed Consecutive Average Pooling (CAP) strategy, this module ensures that the model can resist the interference of high-frequency brushstroke noise and precisely locate the structural skeleton of artworks.
Superior Classification Performance and Theoretical Interpretability: ESA-Net achieves SOTA results superior to existing baseline models on the highly challenging WikiArt dataset. More importantly, visual explanations prove that the model’s decision-making logic is highly consistent with classical art historical theory (Wölfflin’s dichotomy), providing a theoretically grounded methodology for computational connoisseurship.
3. Methodology
As stated in Wölfflin’s classical theory of “Linear and Painterly”, art appreciation necessitates analysis from both linear and textural perspectives. Previous research indicates that while CNNs are adept at extracting semantic textures (aligning with the “painterly” style), they suffer from an inherent limitation known as “texture bias”. This bias implies that networks often over-rely on local color patterns and fail to process global structural contours (the “linear” logic) with the same success.
To conceptually bridge this art history theory with computational design, we formulate a conceptual framework that maps Wölfflin’s principles to our network architecture. To rectify the representational imbalance caused by traditional CNNs, we propose the Edge-guided Spatial Attention Network (ESA-Net). By explicitly decoupling artistic representation into two formalized components—semantic texture (RGB stream) and structural contours (Edge stream)—this dual-stream paradigm explicitly incorporates edge structural priors into the learning process, thereby achieving a theoretically grounded approach to art style classification.
To facilitate a precise mathematical formalization of the architecture, the key symbols and their respective dimensions are summarized in
Table 1. These notations are utilized consistently throughout the subsequent descriptions and align with the schematic representations in
Figure 2.
3.1. Architectural Overview
As illustrated in
Figure 2a and detailed in
Table 1, ESA-Net adopts a dual-stream structure to formalize and reconstruct two essential properties of artistic styles. The concept of decoupled feature encoders is increasingly used in modern neural architectures to isolate and capture independent trends within complex data [
34]. In ESA-Net, we apply this principle to artistic images by decoupling color texture from structural contours. Regarding semantic texture, we aim to preserve texture details to the greatest extent by providing the backbone network with complete RGB image information. Regarding structural contours, we provide the neural network with precise edge information as a reference, thereby compelling the model to learn structural contour features. Specifically, given a complete artwork
as input for the semantic texture stream (RGB stream), we extract its edges
via an edge detection algorithm to serve as input for the edge contour stream (Edge stream).
A CNN is employed as the backbone network to extract deep features from the RGB stream. Specifically, the backbone network
is utilized to process the image to derive the deep semantic representation
, formulated as follows:
To achieve effective feature fusion, we propose the Edge-Guided Convolutional Block Attention Module (EG-CBAM). Unlike the standard CBAM, which relies solely on internal feature statistics, our module explicitly injects structural priors into the attention generation process. This module consists of three core stages:
Channel Attention: Identical to the conventional CBAM, this stage focuses on what features are meaningful by modeling the inter-dependencies among RGB image features, thereby emphasizing specific stylistic patterns and generating the refined features ;
Edge-Guided Spatial Attention: The traditional CBAM generates the spatial attention map by synthesizing internal statistics with external edge features by aggregating internal feature statistics to determine the regions that require more focus. In contrast, by injecting contour features here, the network is more effectively guided to focus on contour regions. The experimental results, particularly the Grad-CAM visualizations, well validate this hypothesis;
Feature Integration: Utilizing a standard image classifier, this stage maps the deep features to specific image classification y.
To provide a holistic view of the information flow within our proposed framework, the complete forward pass of ESA-Net—integrating edge extraction, dual-stream feature recalibration, and final classification—is formally synthesized in Algorithm 1.
| Algorithm 1 Inference pipeline of ESA-Net |
Require: Input artwork Ensure: Predicted style category y // Step 1: Feature Decoupling 1: ▹ Extract structural edge prior 2: ▹ Extract deep semantic features // Step 2: Edge-Guided Feature Recalibration 3: Channel Attention: 4: 5: ▹ Refine channel importance 6: Edge-Guided Spatial Attention: 7: ▹ Downsample edge via CAP 8: 9: 10: ▹ Inject structural constraints // Step 3: Classification 11: 12: 13: return
y
|
3.2. Channel Attention
Channel attention module is designed to prioritize stylistic features such as specific color palettes while suppressing noise. Specifically speaking, the channel attention evaluates the relative importance of different feature channels, thereby selectively enhancing task-relevant features and suppressing irrelevant ones. Since each channel in a neural network typically represents specific semantic attributes, this module can dynamically optimize feature representations by modeling cross-channel inter-dependencies.
As detailed in
Figure 2b, to enable the network to evaluate individual channels from a global perspective, the module first utilizes Adaptive Average Pooling to compress the spatial dimensions of the input feature
. This step eliminates local spatial details, aggregating the global information of each channel into a
descriptor. Subsequently, a Bottleneck structure is employed to learn the non-linear interactions among channels, outputting a modulation vector
that contains the importance weights for each channel:
where
and
denote the Sigmoid and ReLU activation functions, respectively. Two
convolutions,
and
, constitute the bottleneck structure:
performs dimensionality reduction with a reduction ratio
to lower computational complexity, while
is responsible for restoring the dimensionality to the original number of channels. The Sigmoid function
ensures that the generated weight values are normalized to the range of 0 to 1.
Finally, using the learned modulation vector
as weights, the original input feature
is recalibrated through element-wise multiplication (broadcasted along the spatial dimensions), yielding the final output feature
:
3.3. Edge-Guided Spatial Attention
In computer vision architectures, integrating explicit structural information, such as object boundaries, into deep semantic features can significantly enhance the precise localization of target regions. This module aims to leverage the structural priors as a spatial gate to guide the model’s focus. As illustrated in the left part of
Figure 2d, to effectively align the high-resolution external edge map
with the corresponding internal semantic features
in the spatial domain, we introduce a mechanism termed Consecutive Average Pooling (CAP). This operation is formally defined as follows:
Rather than utilizing a single pooling operation with a large kernel size, which often leads to abrupt downsampling and loss of structural connectivity, CAP employs five sequential average pooling layers, each with
kernels. The objective of this consecutive downsampling strategy is to effectively suppress spatial aliasing. Spatial aliasing is a common degradation artifact in single-stage pooling, where high-frequency structural details, such as continuous thin edges, become disjointed or heavily distorted. By reducing the resolution step-by-step, CAP ensures a smoother transition, thereby preserving the integrity of the edge guidance. We discussed the benefits of this specific design in detail in
Section 4.4.
The core target of this module is to identify and highlight salient spatial regions by merging inherent feature statistics (deep representations derived from the backbone network) with external structural features (the processed edge map
). To extract the spatial statistics from the semantic feature map
, we apply both average pooling and maximum pooling operations independently across the channel dimension. This process collapses the multi-channel tensor into two distinct single-channel descriptors: the average-pooled map captures the overall distribution of spatial activations, while the max-pooled map highlights the most discriminative spatial responses. Specifically, we aggregate
through these two pooling operations, and then concatenate the resulting maps with the structurally preserved edge feature
along the channel axis:
This concatenation constructs a unified 3-channel spatial descriptor
that simultaneously encapsulates background context, salient object responses, and explicit boundary priors. To generate the final spatial weights from this composite representation, a convolution is then applied to capture extensive spatial context and local inter-pixel relationships, yielding the attention map.
denotes a convolution operation with a kernel size of
:
The architectural rationale behind this design is to establish a structural information bottleneck that counteracts the inherent texture bias of CNNs. As argued by Geirhos et al. [
7], standard neural architectures predominantly rely on local texture statistics, which often prove insufficient or even misleading in the context of art style classification, where brushstroke patterns (painterly features) may overlap across disparate genres.
By injecting exogenous edge-based geometric constraints into the spatial attention mechanism, EG-CBAM forces the model to achieve structural alignment between the extracted semantic features and the underlying physical contours. While the channel attention module remains responsible for modeling the global importance of stylistic attributes (what features to focus on), the edge-guided spatial attention acts as a gatekeeper (where to focus), ensuring that the most salient activations are anchored to the artwork’s fundamental structural logic rather than high-frequency textural noise. This synergy facilitates a conditional feature selection mechanism: the network is encouraged to prioritize features that possess both stylistic significance and structural coherence, effectively operationalizing Wölfflin’s theory through a decoupled yet synergistic dual-stream optimization.
3.4. Classification
The final refined feature is computed by spatially re-weighting with . The classification result y is then derived:
By utilizing the learned spatial weight matrix as a gating signal, a weighted computation is applied to the primary feature maps . This process integrates the edge feature information into the feature vectors, ultimately generating a complete feature representation. In the subsequent stages, the complete feature representation is compressed into a denser information space, and the classification result is yielded through a fully connected layer.
Specifically, by performing element-wise multiplication (⊗) between
and
, followed by feature aggregation and projection, the classification result
y is derived:
Here, the Global Average Pooling (GlobalAvgPool) operation compresses the complete feature maps into a compact, one-dimensional feature vector. Subsequently, the Fully Connected (FC) layer serves as a linear classifier, mapping the global feature vector to the final categorical probability distribution y.
In summary, the dual-stream architecture of ESA-Net is designed to process two complementary modalities of visual information. The RGB stream is responsible for extracting dominant textures and color distribution features. In contrast, the edge stream focuses on applying a robust structural prior to feature identification and selection, enforcing the model to concentrate on edge information. Our proposed EG-CBAM block effectively fuses edge information and supplements the Convolutional Neural Network’s (CNN) comprehension of edge features, which largely demonstrates the critical importance of integrating edge feature information in art style classification tasks.
4. Experiments
4.1. Dataset and Preprocessing
We evaluate our model on the WikiArt dataset [
12]. This is a large-scale dataset that covers a long historical scope and a large number of art categories.
4.1.1. Dataset Statistics
The dataset contains 81,446 unique high-resolution images categorized into 27 distinct styles. As shown in
Figure 3, the dataset exhibits a severe long-tail distribution. The Pareto curve explicitly illustrates this imbalance, revealing that the top 12 majority classes account for 80% of the cumulative data share. The significance of this Pareto distribution lies in its direct impact on feature representation: it visually quantifies the risk of the model developing a bias. Because the feature space is overwhelmingly dominated by these few head classes, standard CNNs tend to optimize for their prevalent patterns (such as dominant textures), thereby neglecting the nuanced structural logic of the tail categories. Furthermore, the compositional breakdown in
Figure 4 highlights that the top 10 styles dominate approximately 76.7% of the entire dataset, with the remaining 17 minority styles comprising only 23.3% (labeled as “Others”). While mainstream styles such as
Impressionism and
Realism contribute over 20,000 samples each, minority genres like
Action Painting and
Analytical Cubism contain fewer than a few hundred images. Such an intrinsically imbalanced data distribution poses a significant challenge for model performance.
4.1.2. Data Partitioning
To ensure experimental consistency and reproducibility, the dataset is partitioned into training and test sets at a 9:1 ratio using a fixed random seed. The final training set comprises 73,302 RGB-edge image pairs, while the test set contains 8144 pairs. Each pair consists of an RGB image and its corresponding edge map.
4.1.3. Preprocessing and Edge Extraction
In line with existing studies, all images are resized to a standard resolution of pixels using bicubic interpolation. To generate the structural priors, we employ the Sobel operator for edge extraction. The selection of the Sobel operator over other edge detection methodologies is grounded in both technical robustness and theoretical alignment with art history.
Technically, the Sobel operator, as a first-order derivative-based gradient filter, offers superior stability compared to second-order operators like the Laplacian. While the Laplacian is highly sensitive to fine details, it tend to disproportionately amplify high-frequency textural noise and local brushstroke artifacts inherent in high-resolution digitized artworks, leading to fragmented and chaotic edge maps. In contrast, the Sobel operator effectively suppresses such noise while preserving the continuity of salient structural boundaries.
Furthermore, although deep-learning-based detectors (e.g., HED [
35] or PiDiNet [
36]) excel in general computer vision tasks, they are typically pre-trained on semantic datasets such as BSDS500. Consequently, these models are biased toward identifying “semantic object boundaries” in natural scenes. According to Wölfflin’s theory, artistic “linearity” does not strictly equate to the physical segmentation of objects; it encapsulates the artist’s subjective structural logic and formalist contours. By utilizing the Sobel operator—a non-parametric, mathematically transparent gradient tool—we ensure that the extracted structural constraints remain “semantically neutral.” This approach avoids the introduction of modern semantic biases, allowing the EG-CBAM module to focus purely on the geometric skeletons of the artworks as originally intended by the artists.
4.2. Implementation Details
The proposed method is optimized using SGD with a momentum of and a weight decay of . The training process spans a total of 50 epochs with a batch size of 32. The initial learning rate is set to and adjusted with a cosine annealing scheduler. To ensure reproducibility, the random seed is fixed at 42 for all experiments. Computational efficiency is enhanced through the application of Automatic Mixed Precision (AMP). All experiments are conducted on a single NVIDIA RTX 5090 GPU, utilizing eight worker threads for accelerated data loading. The classification performance is quantitatively evaluated using top 1 and top 5 accuracy, alongside Precision, Recall, and F1-score for a comprehensive assessment. For qualitative interpretation of the model’s decision-making process, Grad-CAM is employed to visualize the regions critical to the final predictions.
4.3. Experimental Results and Comparison
Table 2 presents a comprehensive quantitative comparison of our proposed ESA-Net against several established baseline methods on the WikiArt dataset. All metrics are reported in percentages (%), with the best results highlighted in
bold. The results demonstrate that ESA-Net achieves a new state-of-the-art performance across the majority of evaluation metrics.
Specifically, ESA-Net achieves a top 1 accuracy of 69.40%, establishing a significant margin of 4.96% over the best-performing baseline by Wu et al. [
40]. Although the method by Wu et al. obtains a marginally higher top 5 accuracy (96.58% vs. 95.97%), the substantial improvement in our top 1 accuracy indicates that our model possesses superior discriminative capability to pinpoint the exact artistic style.
Furthermore, the evaluations based on Precision, Recall, and F1-Score highlight the robustness of ESA-Net against the inherent long-tail distribution of the WikiArt dataset (detailed in
Figure 3). As evidenced in
Table 2, early CNN-based methods (e.g., Karayev et al. [
11] and Tan et al. [
15]) struggle to effectively handle class imbalance, exhibiting an imbalanced phenomenon where recall significantly exceeds precision. In contrast, ESA-Net significantly mitigates this issue, achieving a highly balanced performance with a Precision of 70.39% and a Recall of 68.01%. This balanced and superior performance empirically validates our architectural design: by explicitly incorporating edge structure features to guide model learning, ESA-Net effectively overcomes the common tendency of networks to over-rely on dominant texture patterns prevalent in head classes. Instead, the model successfully learns more fundamental, structural style representations, thereby maintaining high classification efficacy even for minority classes with limited samples.
4.4. Ablation Studies
We conducted a series of ablation studies on the WikiArt dataset, which not only demonstrate the superiority of ESA-Net but also validate our previous hypotheses regarding texture and contour. Specifically, we systematically investigated: (1) Selection of backbone network; (2) Generalizability of EG-CBAM; (3) Ablation of components; and (4) The effectiveness of the consecutive average pooling layers (CAP) strategy.
4.4.1. Selection of Backbone Network
The feature extraction capability of the backbone network largely determines the upper performance limit of artistic style classification tasks. To identify the most suitable foundational architecture for our proposed model, we comparatively evaluated three representative convolutional neural networks under the condition of integrating the same core modules: ResNet50 [
41], DenseNet121 [
42], and EfficientNet-B0 [
43].
The detailed comparison results are presented in
Table 3. The empirical data demonstrate that EfficientNet-B0 achieved the optimal performance across four core evaluation metrics: top 1 Accuracy (69.40%), Precision (70.39%), Recall (68.01%), and F1-Score (68.80%). Although DenseNet121 holds a marginal advantage in top 5 Accuracy (96.14% vs. 95.97%), when comprehensively considering the model’s precise recognition capability for single style categories (evidenced by the significantly higher top 1 and F1-Score), we ultimately selected EfficientNet-B0 as the default backbone network for ESA-Net.
We attribute this superiority to the architectural alignment between EfficientNet-B0 and the intrinsic characteristics of artistic images. The Mobile Inverted Bottleneck Convolution (MBConv) utilizes depthwise separable convolutions to decouple spatial and channel feature extraction. Compared to ResNet50, this mechanism is significantly more efficient in capturing the discriminative, subtle brushstrokes and multi-scale textures prevalent in paintings.
In contrast, although DenseNet121 achieves a high top 5 accuracy via dense feature reuse, it lacks explicit attention guidance for targeted feature selection. ResNet50 similarly exhibits limited representational capacity for highly abstract stylistic features. Therefore, EfficientNet-B0 provides the optimal capability for feature selectivity and detail capture.
4.4.2. Generalizability of EG-CBAM
To validate the architectural adaptability and model-agnostic nature of the proposed EG-CBAM, we incorporated it into two distinct convolutional neural network paradigms with varying parameter scales: ResNet18 and our standard backbone, EfficientNet-B0. As demonstrated in
Table 4, integrating EG-CBAM consistently enhances classification metrics across both backbones. Specifically, for ResNet18, the F1-Score increases from 55.82% to 57.02%. Similarly, when applied to EfficientNet-B0, the precision and F1-Score experience notable improvements of 1.67% and 1.17%, respectively. These empirical results substantiate that EG-CBAM can effectively recalibrate feature representations and improve the discriminative capacity of diverse baseline models without requiring customized structural modifications.
Furthermore, ESA-Net exhibits distinct advantages in deployment efficiency and architectural generality when compared to recent state-of-the-art approaches, such as the work by Luo et al. [
39]. Luo et al. employ a knowledge distillation framework to elevate art classification accuracy. While effective, knowledge distillation inherently necessitates a cumbersome two-stage training pipeline, a strict reliance on a pre-trained high-capacity teacher model, and extensive computational overhead during the training phase. Consequently, its practical application is often constrained by these external structural dependencies. In contrast, the superiority of EG-CBAM lies in its self-contained feature refinement mechanism. Rather than relying on external knowledge transfer, EG-CBAM adaptively accentuates salient features along both spatial and channel dimensions intrinsically within the network. This internal feature enhancement paradigm allows EG-CBAM to be seamlessly embedded into existing feed-forward architectures, achieving competitive performance gains with negligible additional computational cost and eliminating the need for complex auxiliary training strategies.
4.4.3. Ablation of Components
To identify the source of the performance, specifically, to determine whether the improvement stems from the inherent feature recalibration capability of the attention module or the introduction of external edge priors, we conducted a comprehensive ablation study on the network components. The original EfficientNet-B0, without any auxiliary modules, was established as the baseline (Base Model). We subsequently evaluated the classification performance of the baseline integrated with the vanilla CBAM [
44] (Base + CBAM) and with our proposed Edge-Guided CBAM (Base + EG-CBAM).
The experimental results presented in
Table 5 reveal a counter-intuitive phenomenon: the direct incorporation of the vanilla CBAM module into the baseline model fails to yield the anticipated performance gains; rather, it leads to a degradation in generalization capability. Specifically, the addition of the vanilla CBAM decreases the top 1 accuracy from 69.25% to 68.65%, and the F1-Score from 67.63% to 67.26%. Conversely, improvements across key metrics are only observed when the external edge guidance is introduced. Notably, the Base + EG-CBAM model achieves a Precision of 70.39% and an elevated F1-Score of 68.80%.
These performance discrepancies substantiate our “Texture Bias” hypothesis. Unlike natural images, artworks contain intense brushstrokes and complex textures. Lacking spatial constraints, the vanilla CBAM assigns high attentional weights to these semantically irrelevant high-frequency noises, amplifying stylistic interference and degrading classification performance.
Conversely, EG-CBAM introduces explicit Structural Normalization via edge priors. By utilizing object contours as spatial constraints, it aligns the attention map with actual physical boundaries, suppressing the network’s tendency to overfit local textures. Ultimately, EG-CBAM successfully decouples “semantic texture” and “structural contours”, demonstrating that edge-guided attention is crucial for enhancing feature discriminability and generalization in texture-heavy artistic domains.
4.4.4. The Effectiveness of the Consecutive Average Pooling Layers (CAP) Strategy
As illustrated in
Figure 5, the transformation of high-resolution edge maps into deep-seated edge features requires a carefully designed downsampling pipeline to ensure spatial and structural alignment. In this section, we provide a rigorous quantitative evaluation of our proposed Consecutive Average Pooling (CAP) strategy against three distinct regimes: the baseline (No Edge), single-step large-stride pooling, and learnable convolutional downsampling.
The empirical results in
Table 6 yield several critical insights. Most notably, the “Single Pooling Layer” approach—which utilizes a single large-stride operation to match the spatial dimensions as depicted in
Figure 5—achieved a top 1 accuracy of only 68.37%. This result is paradoxically 0.88% lower than the baseline model without any edge information (69.25%). This performance degradation provides strong evidence for our “structural noise” hypothesis: aggressive, single-stage downsampling ignores the Nyquist-Shannon sampling theorem, leading to severe spatial aliasing. This aliasing effect disintegrates the thin, continuous contours of the edge map into disjointed, non-semantic artifacts, which “misguides” the attention mechanism and introduces destructive noise into the feature fusion process.
In contrast, the proposed CAP strategy, which employs a multi-stage smoothing approach (see
Figure 5), significantly reverses this trend. It achieves a top 1 accuracy of 69.40% and an F1-score of 68.80%—marking a substantial +1.03% accuracy gain and a +1.28% F1-score improvement over the single-layer pooling alternative. Mathematically, the CAP strategy acts as a hierarchical low-pass filter. By decomposing the global downsampling task into five sequential
operations, it ensures that the topological connectivity of artistic outlines is preserved across varying scales.
Furthermore, while the “Convolutional Layer” approach introduces learnable parameters, it achieved a suboptimal top 1 accuracy of 67.96%. This confirms that the extreme spatial sparsity of Sobel-extracted edges makes them ill-suited for parameterized kernels, which tend to overfit local noise. The superiority of CAP demonstrates that a parameter-free, smoothness-preserving downsampling logic is the most robust method for injecting structural priors into the EG-CBAM module, ensuring that the “Linear” constraints remain coherent and meaningful for the spatial attention gate.
4.5. Visual Interpretability via Grad-CAM
To demonstrate that ESA-Net indeed leverages the structure prior and not merely memorizes solely on texture information, we employ Grad-CAM [
27] to visualize and analyze the model’s attention maps.
Figure 6 presents a comparative visualization between the baseline EfficientNet-B0 and our proposed model enhanced by the EG-CBAM module.
Due to the texture bias nature of vanilla CNNs, its attention tends to focus on regions with high luminance or primary semantic subjects. For instance, in the “High Renaissance” sample, the activation of the baseline model is localized on the characters’ faces. In contrast, the attention of ESA-Net is more diffusely distributed across the torso, revealing an elegant attention to the textures of the drapery. Likewise for the “Ukiyo-e” sample with our method suppresses high-brightness background noise and focuses more accurately on subjects with well-defined edges compared to the baseline. These results indicate that the EG-CBAM module suppresses the background noise well, which can limit the model’s attention to painterly structures defined by edge information.
4.6. Cost Analysis
To quantify the computational overhead introduced by the structural prior and the attention mechanism, we evaluate the parameter volume and Floating Point Operations (FLOPs) of ESA-Net.
Table 7 provides a comparative analysis between the baseline backbones and their corresponding edge-guided versions (Base + EG-CBAM).
The empirical data demonstrates that the Edge-Guided Spatial Attention module incurs a minimal computational penalty. For the EfficientNet_B0 backbone, the addition of the EG-CBAM module increases the parameter count by approximately 0.205 M (5.07%), while the FLOPs increase by a negligible margin of 0.001 G. In the case of the ResNet18 backbone, the FLOPs remain virtually unchanged at 16.747 G, with only a slight increment in parameters. This high level of efficiency is primarily achieved because the edge extraction process utilizes the non-parametric Sobel operator and the spatial alignment is handled by the parameter-free Consecutive Average Pooling (CAP) strategy. These results confirm that ESA-Net enhances classification accuracy without compromising the model’s suitability for resource-constrained deployment in digital heritage archives.
4.7. Supplementary Experiments
4.7.1. Stability Analysis and Reproducibility
To ensure the scientific rigor and reproducibility of the proposed ESA-Net, we conducted a stability analysis across multiple independent trials. While standard benchmarks in art style classification often report performance from a single execution, we evaluate the consistency of our model by performing three independent training runs using different random seeds (42, 123, and 456). All other experimental configurations, including hyperparameters, data partitioning, and the cosine annealing scheduler, remained strictly identical to those described in
Section 4.2.
The results of these trials, along with the calculated Mean and Standard Deviation (Mean ± Std), are summarized in
Table 8.
As evidenced by the data, the standard deviation for the core metric, top 1 accuracy, is remarkably low at , and the F1-score variance remains within . These results demonstrate that the performance improvements in ESA-Net are not the result of fortuitous weight initialization but are rooted in the robust structural guidance provided by the EG-CBAM module. The high degree of stability and low variance across trials confirm that our proposed method is both reliable and highly reproducible for large-scale art analysis.
4.7.2. Comparison with Transformer-Based Architectures
To rigorously evaluate the architectural adaptability of the EG-CBAM module and determine whether explicit structural priors remain necessary alongside the implicit global modeling of Vision Transformers (ViTs), we substituted the CNN backbone with a Swin Transformer (Swin-T) [
16]. This comparison assesses if the self-attention mechanism in Transformers can bypass the need for edge-guided spatial attention.
As shown in
Table 9, ESA-Net (w/Swin-T) achieves a top 1 accuracy of 69.51%, establishing the empirical upper bound for our framework. This indicates that EG-CBAM is highly versatile, successfully complementing the long-range dependency modeling of Transformers to further refine feature localization. However, the performance gain over the EfficientNet-B0 variant is a marginal 0.11%, suggesting that our explicit structural prior provides a discriminative signal robust enough to rival the implicit modeling of much larger Transformer models.
4.7.3. Training Dynamics and Efficiency Analysis
Beyond final accuracy, we analyzed the training stability and computational cost of the candidate backbones.
Figure 7 illustrates the convergence behavior over 50 epochs. While Swin-T converges relatively early (around epoch 30), its test accuracy exhibits higher volatility, likely due to the higher complexity of self-attention optimization on the long-tailed WikiArt dataset. In contrast, EfficientNet-B0 demonstrates a superior optimization profile, characterized by a smooth, monotonic decrease in loss and steady accuracy improvement.
Table 10 summarizes the efficiency metrics. Although ResNet18 is marginally faster per epoch, its F1-score is non-competitive. Critically, the Swin-T variant entails a significant computational penalty: it possesses 6.6 times more parameters and requires 22.4% more training time per epoch than the EfficientNet-B0 version, yet offers no substantial improvement in F1-score (68.65% vs. 68.80%). Consequently, EfficientNet-B0 is selected as the optimal foundational backbone for ESA-Net, providing an elite balance of precision, stability, and resource economy for large-scale digital heritage applications.
4.7.4. Cross-Dataset Generalization
To evaluate the robustness and domain-invariance of the learned structural representations, we conducted a cross-dataset evaluation using the SemArt dataset. This dataset is particularly suitable for assessing generalization as its imagery is sourced exclusively from the Web Gallery of Art, thereby ensuring an absence of data overlap with WikiArt-based archives. Such a setup is critical to avoid the data contamination frequently encountered in other art benchmarks that share overlapping web sources.
We employed a linear probing protocol to assess the quality of the features extracted by the pre-trained ESA-Net. The backbone, previously optimized on the WikiArt dataset, was frozen to function as a static feature extractor, while a single linear classification layer was trained to categorize the ten artistic genres (Types) defined in SemArt. Without any fine-tuning of the internal weights, the model achieved a top 1 accuracy of 69.97% on the SemArt test set. This significant performance under zero-shot transfer conditions demonstrates that the Edge-Guided Spatial Attention module successfully captures an intrinsic structural logic that is consistent across disparate digital archives. The result confirms that by anchoring the attention mechanism to explicit geometric priors rather than dataset-specific textural artifacts, ESA-Net develops a generalized stylistic representation that is highly transferable to independent art domains.
5. Discussion
5.1. Error Analysis via Confusion Matrix
We further investigated the classification characteristics of ESA-Net through the normalized confusion matrix (
Figure 8). The model demonstrates high efficacy in styles characterized by clear structural logic, but faces challenges in styles primarily defined by color and texture.
Ukiyo-e (95%), Synthetic Cubism (91%), and Northern Renaissance (86%) all exhibit strong performance. Consistent with our architectural design, the model shows high classification accuracy in styles that rely heavily on lines and structural features.
Challenges in color-and-texture-dominant styles are mainly concentrated in categories dominated by color and texture.
Lack of Formal Definition: Significant confusion (55%) exists between
Action Painting and
Abstract Expressionism. As visualized in the left panel of
Figure 9, both styles are characterized by spontaneous, gestural paint application rather than the depiction of physical forms. Consequently, the Sobel operator extracts chaotic, high-frequency noise derived from paint splatters and canvas textures rather than coherent object boundaries. Since our ESA-Net heavily relies on consistent contour structures (the “linear” prior) to guide the EG-CBAM module, the absence of stable geometric topologies in these abstract genres fundamentally neutralizes the advantage of structural guidance, causing the model to default back to learning unstructured textural noise.
Evolutionary Overlap: Moderate confusion was observed among
Impressionism,
Post-Impressionism, and
Fauvism. Rather than a simple computational error, this overlap macroscopically mirrors the authentic evolutionary logic of art history. As shown in the right panel of
Figure 9, there is a clear structural transition across these movements. Impressionism dissolves physical boundaries into fragmented, light-driven brushstrokes (resulting in a highly noisy and dense edge map). As art evolved chronologically into Post-Impressionism and eventually Fauvism, artists began to reintroduce distinct, explicit outlines (e.g., Cloisonnism), which is clearly reflected in the increasingly continuous and bold white contours in their respective edge maps. Because these styles share a continuous trajectory of structural reshaping, the discrete artificial labels of the dataset inherently conflict with their visual continuity, reducing class separability.
Unlike traditional general image classification tasks (such as distinguishing cats from dogs in ImageNet), where objective and absolute physical boundaries exist between categories, the taxonomy of artworks is inherently characterized by expert subjectivity and chronological ambiguity. Consequently, the model’s confusion matrix reflects the continuous evolutionary trajectory of art history rather than mere computational errors. For example, in the historical progression from Impressionism to Post-Impressionism and Fauvism, the mid-to-late works of masters like Cézanne or Van Gogh frequently integrate the atmospheric light-and-shadow textures of the former with the structural reshaping of the latter.
Thus, the classification confusion, or probability overlap, produced by ESA-Net among these highly correlated styles offers a valuable insight: the model avoids rigidly overfitting the discrete artificial labels of the dataset. Instead, it captures the historical continuity embedded within the visual features of these art movements. This further substantiates that by integrating Wölfflin’s theoretical priors, the representation space learned by ESA-Net not only maintains robust discriminative capacity but also macroscopically resonates with the authentic evolutionary logic of art history.
5.2. Quantitative Analysis of Texture Bias Reduction
To rigorously quantify the extent to which ESA-Net mitigates the inherent texture bias of CNNs, we perform a class-wise accuracy divergence analysis between genres aligned with Wölfflin’s “Linear” and “Painterly” paradigms. If the structural priors introduced by the EG-CBAM module are effectively utilized, the model should exhibit a significant performance ceiling in genres where stylistic identity is rooted in geometric clarity, while demonstrating expected confusion in genres characterized by amorphous textures.
As evidenced by the quantitative data in the normalized confusion matrix (
Figure 8), our model achieves near-optimal discriminative power in “Linear” styles with well-defined contours:
Ukiyo-e (95%),
Synthetic Cubism (91%), and
Northern Renaissance (86%). These high precision values indicate that the structural guidance effectively anchors the network’s attention to the artwork’s skeletal layout rather than local brushstroke noise.
In stark contrast, styles defined primarily by spontaneous color application and the absence of formal boundaries—where “Painterly” features dominate—exhibit a substantial quantitative degradation in classification accuracy. For instance, Action Painting yields an accuracy of only 45%, with a 55% probability of misclassification into Abstract Expressionism. This massive performance gap (exceeding 40%) between structure-dominant and texture-dominant categories provides concrete empirical evidence that ESA-Net’s decision-making logic has shifted from a reliance on local texture statistics to global structural logic. This divergence quantitatively validates that the network has successfully incorporated the “Linear” prior as a primary discriminative feature, thereby effectively operationalizing a theoretically-grounded reduction in texture bias.
5.3. Limitations and Future Directions
Despite the performance gains achieved by ESA-Net, the explicit integration of structural priors introduces specific theoretical and technical constraints that warrant further discussion.
Sensitivity to Exogenous Edge Noise. The efficacy of the EG-CBAM module is fundamentally contingent upon the fidelity of the extracted structural priors. In this study, the Sobel operator serves as a deterministic gradient filter; however, it exhibits susceptibility to high-frequency artifacts. For digitized artworks characterized by significant JPEG compression noise or physical degradation—such as surface craquelure, canvas aging, or pigment cracking—the edge detection process may inadvertently capture stochastic noise rather than meaningful stylistic contours. Such non-semantic signals can introduce “structural interference” into the spatial attention gate, potentially polluting the feature recalibration process and diminishing classification robustness in low-quality digital archives.
Generalization Bottlenecks in Amorphous Styles. Following Wölfflin’s “Linear vs. Painterly” dichotomy, the proposed architecture is inherently optimized for styles possessing identifiable geometric skeletons. However, its generalizability is constrained when encountering genres defined by “informalism” or a total absence of delineated forms. As evidenced by the performance deficit in categories like Color Field Painting and Abstract Expressionism, the structural branch faces a state of informational sparsity when boundaries are deliberately dissolved into tonal masses. In these instances, the “Linear-Aware” mechanism lacks stable anchors for spatial gating, causing the model to default to unstructured textural patterns, which limits the marginal utility of the dual-stream architecture.
Future Prospects. To mitigate these limitations, future research will focus on transitioning from static edge extraction to dynamic, content-aware methodologies. A promising trajectory involves the implementation of adaptive thresholding mechanisms or learnable gradient encoders that can distinguish between aesthetic contours and stochastic noise. Furthermore, exploring the integration of multi-scale structural descriptors may provide a more comprehensive representation for transitional styles that occupy the threshold between the linear and the painterly, thereby enhancing the model’s adaptability across the full spectrum of art history.
6. Conclusions
In this paper, we introduce the Edge-Guided Spatial Attention Network (ESA-Net), a novel architecture designed to mitigate the “texture bias” in traditional convolutional neural networks (CNNs) for artistic style classification. By formalizing the problem through Heinrich Wölfflin’s “Linear vs. Painterly” theory, we introduce the Edge-Guided Convolutional Block Attention Module (EG-CBAM), which integrates edge priors with deep semantic features.
Experimental evaluations on the WikiArt dataset demonstrate that ESA-Net achieves substantial performance, reaching a top 1 accuracy of 69.40% and an F1-score of 68.80%. Moreover, the qualitative analysis using Grad-CAM proves that our model effectively aligns its focus with the structural contours that define artistic genres. This alignment not only enhances classification robustness, particularly across long-tail distributions, but also provides empirical validation for classical art historical theories.
Despite these advancements, analysis of the confusion matrix reveals persistent challenges in cases characterized by a “Lack of Formal Definition” and “Evolutionary Overlap,” where structural boundaries are deliberately obscured. Future research will explore the integration of multimodal features or more refined semantic texture extraction schemes to address these boundary cases.