Linear-Aware Attention: Enhancing Art Style Classification with Structural Edge Priors

Yu, Wanglong; Liu, Xuefeng

doi:10.3390/electronics15112314

Open AccessArticle

Linear-Aware Attention: Enhancing Art Style Classification with Structural Edge Priors

by

Wanglong Yu

¹ and

Xuefeng Liu

^2,*

¹

Shanghai Academy of Fine Arts, Shanghai University, Shanghai 200444, China

²

School of Communication and Information Engineering, Shanghai University, Shanghai 200444, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(11), 2314; https://doi.org/10.3390/electronics15112314

Submission received: 17 April 2026 / Revised: 19 May 2026 / Accepted: 22 May 2026 / Published: 27 May 2026

(This article belongs to the Section Electronic Multimedia)

Download

Browse Figures

Versions Notes

Abstract

While deep learning has achieved impressive success in art style classification, standard convolutional neural networks (CNNs) often exhibit a “texture bias”, prioritizing local brushstrokes and color patterns over the global structural logic essential for stylistic identification. Drawing inspiration from Heinrich Wölfflin’s “Linear and Painterly” theory, we propose the Edge-Guided Spatial Attention Network (ESA-Net) to bridge the gap between feature extraction and aesthetic structure. ESA-Net utilizes a dual-stream architecture that decouples artistic representation into semantic textures and structural contours. As its core, the proposed Edge-Guided Convolutional Block Attention Module (EG-CBAM) treats exogenous edge maps as spatial gates, recalibrating the model’s focus toward salient outlines while suppressing textural noise. The experimental results on the WikiArt dataset demonstrate that ESA-Net achieves a state-of-the-art top 1 accuracy of 69.40%. Qualitative visualizations via Grad-CAM further confirm that our model effectively aligns its decision-making process with the structural layouts which are favored by human experts, providing a theoretically grounded approach to computational connoisseurship.

Keywords:

art style classification; edge-guided attention; texture bias; structural priors; cultural heritage

1. Introduction

Due to the acceleration of the global cultural heritage digitization, museums and digital libraries have accumulated massive amounts of art image data. Facing such enormous digital visual archives, how to efficiently organize, retrieve, and analyze these artworks has become a core challenge at the intersection of digital humanities and computer vision [1,2]. As a foundational technology for automating art cataloging, art style classification also provides critical feature representations for art recommendation systems, computational aesthetic analysis, and artwork forgery detection [3]. In recent years, the deep learning paradigm has largely replaced traditional methods based on hand-crafted features (e.g., color histograms, SIFT [4]), making significant progress in multiple art image analysis tasks using deep convolutional neural networks (CNNs). Other studies have demonstrated that cascaded multi-scale attention frameworks can significantly improve feature localization in challenging environments [5]. Similarly, the use of cross-modal collaboration to fuse complementary information has shown great promise in complex detection tasks [6]. However, unlike general object recognition in natural images (e.g., distinguishing cats from dogs), art style is a highly abstract concept that integrates the artist’s subjective intent, history, and complex structural frameworks. Current deep learning models still face a severe representational bottleneck when capturing the global aesthetic structural principles relied by human experts.

The core reason leading to this bottleneck lies in the inherent “texture bias” of standard CNN architectures [7]. Extensive research indicates that CNN models pre-trained on large-scale natural image datasets (e.g., ImageNet) highly depend on local, high-frequency texture patterns for their decision-making, rather than global object shapes or structural contours [8]. In art applications, this bias is further amplified: due to the local receptive field of convolution operations, the network is extremely adept at extracting local features such as “brushstrokes”, “color gradients”, and “canvas textures”, but often ignores the global spatial layout and contours that define the core characteristics of many artworks. This representational imbalance causes the network to easily confuse art styles that share similar color palettes and brushstrokes but possess entirely different structural logics. To achieve robust and cognitively aligned art image understanding, there is an urgent need to shift from a purely texture-driven learning methodology to a structured learning paradigm that integrates explicit geometric priors [9].

This representational bias inherent in neural networks finds profound theoretical support in art history. The art historian Heinrich Wölfflin proposed five pairs of fundamental concepts for artistic expression in his classic work, the most central of which is the dichotomy of “Linear vs. Painterly”. As illustrated in Figure 1, Wölfflin argued that artistic expression fluctuates between the “linear” mode and the “painterly” mode. The former (e.g., High Renaissance, Ukiyo-e) relies on clear boundaries, distinct outlines, and solid structural logic; while the latter (e.g., Impressionism, Abstract Expressionism) tends to convey atmosphere through colors, lighting, and blurred boundaries. From a computational perspective, current CNNs are essentially visual perception models highly biased toward the “painterly” mode, lacking perceptual capability for “linear” structures. Existing art classification methods mostly treat image style as a single feature for global pooling or self-attention computation [10], failing to explicitly introduce the “linear” structural prior, which defines the spatial logic of the artwork, into the network architecture.

To overcome the texture bias of CNNs and bridge this representational gap, we propose the Edge-Guided Spatial Attention Network (ESA-Net) as the technical implementation of Linear-Aware Attention to bridge the gap between feature extraction and aesthetic structure. This network aims to explicitly reintroduce the structural priors of artworks into general deep learning models. Unlike existing methods that treat style as an indivisible characteristic, ESA-Net decouples visual representation into two complementary concepts based on Wölfflin’s theory: “semantic texture” (RGB stream) to capture local features, and “structural contours” (Edge stream) to reflect global logic. We utilize an edge detection operator to extract high-fidelity edge maps, serving as color-independent “linear” constraints.

To efficiently fuse these two modalities, we designed a core component: the Edge-Guided Convolutional Block Attention Module (EG-CBAM). This module breaks the limitation of traditional attention modules that rely solely on internal feature statistics by treating the introduced structural edge map as an explicit gating signal. Through this module, the network is forced to recalibrate the spatial attention, focusing on salient regions with clear geometric contours while suppressing background texture noise generated by intense brushstrokes. Furthermore, addressing the common issues of “spatial aliasing” and edge fragmentation when aligning high-resolution edge maps with deep semantic features, we propose the Consecutive Average Pooling (CAP) strategy. Through multi-stage smooth downsampling, CAP preserves the topological coherence of the structural guidance signal without increasing the network’s parameter overhead.

We comprehensively evaluated ESA-Net on the large-scale WikiArt dataset, which is characterized by a high degree of class imbalance and a long-tail distribution. The experimental results demonstrate that ESA-Net achieves a top 1 accuracy of 69.40%, establishing a new state-of-the-art (SOTA) performance. In addition, qualitative analysis via Grad-CAM visualizations confirms that the model effectively broadens its decision-making basis, with its attention distribution closely aligning with the structural layouts favored by human art experts.

Distinct from conventional attention paradigms—such as Squeeze-and-Excitation (SE) networks or the vanilla Convolutional Block Attention Module (CBAM)—which rely on endogenous feature statistics to implicitly compute importance weights, the proposed ESA-Net introduces an exogenous structural constraint. By leveraging non-parametric geometric operators to inject external spatial priors, our framework shifts from a purely data-driven heuristic to a cognition-based structural attention regime. This approach ensures that the network’s focus is not merely an emergent property of internal activations but is explicitly governed by the formalist logic of artistic composition, thereby bridging the gap between deep learning engineering and art-theoretical principles.

The main contributions of this paper are summarized as follows:

Dual-Stream Feature Fusion Architecture: We establish a novel network paradigm that structurally separates color texture from physical contours. By introducing explicit edge information as a structural prior supplement to the CNN, it effectively mitigates the inherent “texture bias” of deep neural networks in art image processing.
Edge-Guided Attention Module (EG-CBAM): We propose a novel spatial attention module that utilizes pure edge priors as spatial signals to refine the model’s receptive focus. Combined with our proposed Consecutive Average Pooling (CAP) strategy, this module ensures that the model can resist the interference of high-frequency brushstroke noise and precisely locate the structural skeleton of artworks.
Superior Classification Performance and Theoretical Interpretability: ESA-Net achieves SOTA results superior to existing baseline models on the highly challenging WikiArt dataset. More importantly, visual explanations prove that the model’s decision-making logic is highly consistent with classical art historical theory (Wölfflin’s dichotomy), providing a theoretically grounded methodology for computational connoisseurship.

2. Related Work

2.1. Art Style Classification

Computing and analyzing art styles has a long history starting with basic matching to complex deep semantic understanding. Early research in this field primarily relied on heuristic feature extraction, utilizing low-level visual features such as color histograms, Gabor filter, Scale-Invariant Feature Transform (SIFT), capture fine detail in brushstrokes and color [11]. While these methods established a computational foundation for art classification and analysis, they were heavily dependent on empirical rules and specialized knowledge in computer graphics, suffering from a lack of flexibility. The advent of large scale annotated datasets (such as WikiArt), represented a breakthrough point, shifting the field from an algorithm-driven to a data-driven paradigm [12].

This shift is further accelerated by the arrival of deep learning. Early work explored whether deep representations trained on general object recognition tasks could be effectively transferred to the domain of art [13]. With the advancement of deep neural networks, methods involving fine-tuned CNNs have achieved substantial performance, as these models can automatically learn texture features corresponding to stylistic elements [14,15]. More recently, the landscape of visual representation has been expanded by Vision Transformers (ViTs) and Swin Transformers [16], which utilize multi-head self-attention mechanisms to implicitly capture the global structural logic of images, providing a more holistic perspective than the local receptive fields of CNNs.

Parallel to these developments, image processing has witnessed significant progress through Diffusion-based models. In the field of super-resolution (SR), advanced frameworks such as TTST [17] and EDiffSR [18] have demonstrated the powerful capability of these architectures for structural reconstruction. While structural reconstruction in SR tasks focuses on restoring high-fidelity geometric details to ensure pixel-level consistency, it shares a fundamental objective with the structural feature extraction in our art classification task: the identification and preservation of the underlying skeletal layout. However, whereas SR models leverage these priors for explicit image restoration, our proposed ESA-Net utilizes them as high-level stylistic constraints to navigate the “Linear vs. Painterly” dichotomy.

Concurrently, State Space Models, notably Mamba-based architectures, have redefined how global context is captured; for instance, PCa-Mamba [19] leverages state-space modeling to efficiently capture long-range dependencies across complex imagery. Despite these successes, the inherent “texture bias” of standard backbones remains a critical bottleneck. ESA-Net distinguishes itself by explicitly injecting geometric constraints through structural edge priors to address this challenge, ensuring that the decision-making process is aligned with the formalist structures favored by art historians.

To address the features of art, model architectures have moved beyond single-backbone networks. For instance, Mao et al. [20] proposed a framework for learning shared embeddings across different artistic tasks, while Garcia et al. [21] introduced context-aware embeddings to integrate metadata such as artist identity and creation period. Furthermore, Sun et al. [22] developed a dual-path convolutional network to analyze content and style features in parallel. However, most of these approaches treat style as a holistic characteristic, and the explicit integration of structural priors—particularly the “linear” structure that defines the spatial logic of a work—remains an under-explored research direction. Research preserving content in the source domain, as demonstrated by recent work on style transfer which shows that structural preservation is far more important than merely synthesizing textures, is crucial for generating visually credible artistic images [23]. ESA-Net addresses this deficiency in structural consistency by reintroducing edge-based structural constraints into the deep learning pipeline.

2.2. Visual Explanation and Interpretability

Understanding the decision-making process of deep neural networks is essential. In art-related tasks, this is particularly important as the nuances of structure and texture play a significant role in classification. Early visual explanation methods primarily focused on backpropagation-based saliency maps, such as Saliency Maps [24] and Guided Backpropagation [25]. These techniques highlight influential regions by calculating the gradient of the output score with respect to input pixels. However, the resulting visualizations are often fragmented and contain noise, lacking the coherent structural information necessary to validate art-theoretical hypotheses.

Another more robust research trajectory centers on Class Activation Mapping (CAM) [26], which pioneered the use of Global Average Pooling (GAP) to project class weights back onto convolutional feature maps. To overcome the architectural limitations of GAP, Grad-CAM [27] was proposed, utilizing the gradients of the target concept flowing into the final convolutional layer to generate localized heatmaps that highlight discriminative regions. Subsequent refinements, such as Grad-CAM++ [28] for pixel-level localization and Score-CAM [29] for gradient-free activation, have further enhanced the reliability of these heatmaps. Moreover, Layer-CAM [30] demonstrated that integrating multi-scale hierarchical features enables precise visualization of fine-grained textures.

The advancement of these interpretation tools provides an available pathway for understanding the reason of neural network decision-making. In art style classification, these tools effectively validate the feasibility of our proposed approach: decoupling artistic representation into edge-based structural information and color-based textural information.

2.3. Heinrich Wölfflin’s Theory: Linear vs. Painterly

The theoretical foundation of the proposed method is deeply rooted in formalist art history, particularly the dichotomy of the “linear” (zeichnerisch) versus the “painterly” (malerisch) proposed by Heinrich Wölfflin in 1915 [31]. Wölfflin posited that the “linear” style delineates objects through clear, tactile contours and structural logic (e.g., Ukiyo-e or Renaissance works), utilizing explicit and continuous edges to guide the viewer’s gaze. Conversely, the “painterly” style (e.g., Impressionism) merges objects with their surroundings through loose brushstrokes and atmospheric effects. In its visual representation, it prioritizes light-and-shadow textures as well as mass, rather than rigid boundaries [32].

To a certain extent, this binary opposition mirrors the representational challenges faced by standard Convolutional Neural Networks (CNNs), namely their inherent “texture bias” [7,9]. Because convolution operations perform well at capturing local, high-frequency signals (such as color gradients), these models naturally gravitate toward extracting “painterly” features, while exhibiting a pronounced perceptual blind spot toward global “linear” contours [8].

Previous computational approaches to art analysis (such as early neural style transfer [10]) have largely treated style as a holistic, texture-driven metric, thereby neglecting the essential geometric structural foundation. Although some studies have attempted to bridge this gap by algorithmically quantifying Wölfflin’s concepts [33], how to explicitly integrate decoupled structural priors into network architectures remains an urgently under-explored domain.

3. Methodology

As stated in Wölfflin’s classical theory of “Linear and Painterly”, art appreciation necessitates analysis from both linear and textural perspectives. Previous research indicates that while CNNs are adept at extracting semantic textures (aligning with the “painterly” style), they suffer from an inherent limitation known as “texture bias”. This bias implies that networks often over-rely on local color patterns and fail to process global structural contours (the “linear” logic) with the same success.

To conceptually bridge this art history theory with computational design, we formulate a conceptual framework that maps Wölfflin’s principles to our network architecture. To rectify the representational imbalance caused by traditional CNNs, we propose the Edge-guided Spatial Attention Network (ESA-Net). By explicitly decoupling artistic representation into two formalized components—semantic texture (RGB stream) and structural contours (Edge stream)—this dual-stream paradigm explicitly incorporates edge structural priors into the learning process, thereby achieving a theoretically grounded approach to art style classification.

To facilitate a precise mathematical formalization of the architecture, the key symbols and their respective dimensions are summarized in Table 1. These notations are utilized consistently throughout the subsequent descriptions and align with the schematic representations in Figure 2.

3.1. Architectural Overview

As illustrated in Figure 2a and detailed in Table 1, ESA-Net adopts a dual-stream structure to formalize and reconstruct two essential properties of artistic styles. The concept of decoupled feature encoders is increasingly used in modern neural architectures to isolate and capture independent trends within complex data [34]. In ESA-Net, we apply this principle to artistic images by decoupling color texture from structural contours. Regarding semantic texture, we aim to preserve texture details to the greatest extent by providing the backbone network with complete RGB image information. Regarding structural contours, we provide the neural network with precise edge information as a reference, thereby compelling the model to learn structural contour features. Specifically, given a complete artwork

I \in R^{3 \times H \times W}

as input for the semantic texture stream (RGB stream), we extract its edges

E \in R^{H \times W}

via an edge detection algorithm to serve as input for the edge contour stream (Edge stream).

A CNN is employed as the backbone network to extract deep features from the RGB stream. Specifically, the backbone network

M (\cdot)

is utilized to process the image to derive the deep semantic representation

X \in R^{C^{'} \times H^{'} \times W^{'}}

, formulated as follows:

X = M (I) .

(1)

To achieve effective feature fusion, we propose the Edge-Guided Convolutional Block Attention Module (EG-CBAM). Unlike the standard CBAM, which relies solely on internal feature statistics, our module explicitly injects structural priors into the attention generation process. This module consists of three core stages:

Channel Attention: Identical to the conventional CBAM, this stage focuses on what features are meaningful by modeling the inter-dependencies among RGB image features, thereby emphasizing specific stylistic patterns and generating the refined features

F

;

Edge-Guided Spatial Attention: The traditional CBAM generates the spatial attention map

W_{s}

by synthesizing internal statistics with external edge features

F_{E}

by aggregating internal feature statistics to determine the regions that require more focus. In contrast, by injecting contour features here, the network is more effectively guided to focus on contour regions. The experimental results, particularly the Grad-CAM visualizations, well validate this hypothesis;

Feature Integration: Utilizing a standard image classifier, this stage maps the deep features to specific image classification y.

To provide a holistic view of the information flow within our proposed framework, the complete forward pass of ESA-Net—integrating edge extraction, dual-stream feature recalibration, and final classification—is formally synthesized in Algorithm 1.

Algorithm 1 Inference pipeline of ESA-Net

Require: Input artwork $I \in R^{3 \times H \times W}$
Ensure: Predicted style category y
// Step 1: Feature Decoupling
1: $E = Sobel (I)$ ▹ Extract structural edge prior
2: $X = Backbone (I)$ ▹ Extract deep semantic features
// Step 2: Edge-Guided Feature Recalibration
3: Channel Attention:
4: $W_{c} = σ (MLP (AvgPool (X)) + MLP (MaxPool (X)))$
5: $F = X \otimes W_{c}$ ▹ Refine channel importance
6: Edge-Guided Spatial Attention:
7: $F_{E} = CAP (E)$ ▹ Downsample edge via CAP
8: $F_{c o n c a t} = [Avg (F); Max (F); F_{E}]$
9: $W_{s} = σ ({Conv}_{7 \times 7} (F_{c o n c a t}))$
10: $F_{o u t} = F \otimes W_{s}$ ▹ Inject structural constraints
// Step 3: Classification
11: $v = GlobalAvgPool (F_{o u t})$
12: $y = Softmax (FC (v))$
13: return y

3.2. Channel Attention

Channel attention module is designed to prioritize stylistic features such as specific color palettes while suppressing noise. Specifically speaking, the channel attention evaluates the relative importance of different feature channels, thereby selectively enhancing task-relevant features and suppressing irrelevant ones. Since each channel in a neural network typically represents specific semantic attributes, this module can dynamically optimize feature representations by modeling cross-channel inter-dependencies.

As detailed in Figure 2b, to enable the network to evaluate individual channels from a global perspective, the module first utilizes Adaptive Average Pooling to compress the spatial dimensions of the input feature

X

. This step eliminates local spatial details, aggregating the global information of each channel into a

1 \times 1

descriptor. Subsequently, a Bottleneck structure is employed to learn the non-linear interactions among channels, outputting a modulation vector

W_{c}

that contains the importance weights for each channel:

W_{c} = σ ({Conv}_{2} (δ ({Conv}_{1} (AvgPool (X)))))

(2)

where

σ

and

δ

denote the Sigmoid and ReLU activation functions, respectively. Two

1 \times 1

convolutions,

{Conv}_{1}

and

{Conv}_{2}

, constitute the bottleneck structure:

{Conv}_{1}

performs dimensionality reduction with a reduction ratio

r = 16

to lower computational complexity, while

{Conv}_{2}

is responsible for restoring the dimensionality to the original number of channels. The Sigmoid function

σ

ensures that the generated weight values are normalized to the range of 0 to 1.

Finally, using the learned modulation vector

W_{c}

as weights, the original input feature

X

is recalibrated through element-wise multiplication (broadcasted along the spatial dimensions), yielding the final output feature

F

:

F = X \otimes W_{c} .

(3)

3.3. Edge-Guided Spatial Attention

In computer vision architectures, integrating explicit structural information, such as object boundaries, into deep semantic features can significantly enhance the precise localization of target regions. This module aims to leverage the structural priors as a spatial gate to guide the model’s focus. As illustrated in the left part of Figure 2d, to effectively align the high-resolution external edge map

E

with the corresponding internal semantic features

F

in the spatial domain, we introduce a mechanism termed Consecutive Average Pooling (CAP). This operation is formally defined as follows:

F_{E} = CAP (E) .

(4)

Rather than utilizing a single pooling operation with a large kernel size, which often leads to abrupt downsampling and loss of structural connectivity, CAP employs five sequential average pooling layers, each with

2 \times 2

kernels. The objective of this consecutive downsampling strategy is to effectively suppress spatial aliasing. Spatial aliasing is a common degradation artifact in single-stage pooling, where high-frequency structural details, such as continuous thin edges, become disjointed or heavily distorted. By reducing the resolution step-by-step, CAP ensures a smoother transition, thereby preserving the integrity of the edge guidance. We discussed the benefits of this specific design in detail in Section 4.4.

The core target of this module is to identify and highlight salient spatial regions by merging inherent feature statistics (deep representations derived from the backbone network) with external structural features (the processed edge map

F_{E}

). To extract the spatial statistics from the semantic feature map

F

, we apply both average pooling and maximum pooling operations independently across the channel dimension. This process collapses the multi-channel tensor into two distinct single-channel descriptors: the average-pooled map captures the overall distribution of spatial activations, while the max-pooled map highlights the most discriminative spatial responses. Specifically, we aggregate

F

through these two pooling operations, and then concatenate the resulting maps with the structurally preserved edge feature

F_{E}

along the channel axis:

F_{c o n c a t} = [Avg (F); Max (F); F_{E}] \in R^{3 \times H^{'} \times W^{'}}

(5)

This concatenation constructs a unified 3-channel spatial descriptor

F_{c o n c a t}

that simultaneously encapsulates background context, salient object responses, and explicit boundary priors. To generate the final spatial weights from this composite representation, a convolution is then applied to capture extensive spatial context and local inter-pixel relationships, yielding the attention map.

{Conv}_{7 \times 7}

denotes a convolution operation with a kernel size of

7 \times 7

:

W_{s} = σ (f^{7 \times 7} (F_{c o n c a t}))

(6)

The architectural rationale behind this design is to establish a structural information bottleneck that counteracts the inherent texture bias of CNNs. As argued by Geirhos et al. [7], standard neural architectures predominantly rely on local texture statistics, which often prove insufficient or even misleading in the context of art style classification, where brushstroke patterns (painterly features) may overlap across disparate genres.

By injecting exogenous edge-based geometric constraints into the spatial attention mechanism, EG-CBAM forces the model to achieve structural alignment between the extracted semantic features and the underlying physical contours. While the channel attention module remains responsible for modeling the global importance of stylistic attributes (what features to focus on), the edge-guided spatial attention acts as a gatekeeper (where to focus), ensuring that the most salient activations are anchored to the artwork’s fundamental structural logic rather than high-frequency textural noise. This synergy facilitates a conditional feature selection mechanism: the network is encouraged to prioritize features that possess both stylistic significance and structural coherence, effectively operationalizing Wölfflin’s theory through a decoupled yet synergistic dual-stream optimization.

3.4. Classification

The final refined feature is computed by spatially re-weighting

F

with

W_{s}

. The classification result y is then derived:

By utilizing the learned spatial weight matrix

W_{s}

as a gating signal, a weighted computation is applied to the primary feature maps

F

. This process integrates the edge feature information into the feature vectors, ultimately generating a complete feature representation. In the subsequent stages, the complete feature representation is compressed into a denser information space, and the classification result is yielded through a fully connected layer.

Specifically, by performing element-wise multiplication (⊗) between

F

and

W_{s}

, followed by feature aggregation and projection, the classification result y is derived:

y = FC (GlobalAvgPool (F \otimes W_{s}))

(7)

Here, the Global Average Pooling (GlobalAvgPool) operation compresses the complete feature maps into a compact, one-dimensional feature vector. Subsequently, the Fully Connected (FC) layer serves as a linear classifier, mapping the global feature vector to the final categorical probability distribution y.

In summary, the dual-stream architecture of ESA-Net is designed to process two complementary modalities of visual information. The RGB stream is responsible for extracting dominant textures and color distribution features. In contrast, the edge stream focuses on applying a robust structural prior to feature identification and selection, enforcing the model to concentrate on edge information. Our proposed EG-CBAM block effectively fuses edge information and supplements the Convolutional Neural Network’s (CNN) comprehension of edge features, which largely demonstrates the critical importance of integrating edge feature information in art style classification tasks.

4. Experiments

4.1. Dataset and Preprocessing

We evaluate our model on the WikiArt dataset [12]. This is a large-scale dataset that covers a long historical scope and a large number of art categories.

4.1.1. Dataset Statistics

The dataset contains 81,446 unique high-resolution images categorized into 27 distinct styles. As shown in Figure 3, the dataset exhibits a severe long-tail distribution. The Pareto curve explicitly illustrates this imbalance, revealing that the top 12 majority classes account for 80% of the cumulative data share. The significance of this Pareto distribution lies in its direct impact on feature representation: it visually quantifies the risk of the model developing a bias. Because the feature space is overwhelmingly dominated by these few head classes, standard CNNs tend to optimize for their prevalent patterns (such as dominant textures), thereby neglecting the nuanced structural logic of the tail categories. Furthermore, the compositional breakdown in Figure 4 highlights that the top 10 styles dominate approximately 76.7% of the entire dataset, with the remaining 17 minority styles comprising only 23.3% (labeled as “Others”). While mainstream styles such as Impressionism and Realism contribute over 20,000 samples each, minority genres like Action Painting and Analytical Cubism contain fewer than a few hundred images. Such an intrinsically imbalanced data distribution poses a significant challenge for model performance.

4.1.2. Data Partitioning

To ensure experimental consistency and reproducibility, the dataset is partitioned into training and test sets at a 9:1 ratio using a fixed random seed. The final training set comprises 73,302 RGB-edge image pairs, while the test set contains 8144 pairs. Each pair consists of an RGB image and its corresponding edge map.

4.1.3. Preprocessing and Edge Extraction

In line with existing studies, all images are resized to a standard resolution of

480 \times 480

pixels using bicubic interpolation. To generate the structural priors, we employ the Sobel operator for edge extraction. The selection of the Sobel operator over other edge detection methodologies is grounded in both technical robustness and theoretical alignment with art history.

Technically, the Sobel operator, as a first-order derivative-based gradient filter, offers superior stability compared to second-order operators like the Laplacian. While the Laplacian is highly sensitive to fine details, it tend to disproportionately amplify high-frequency textural noise and local brushstroke artifacts inherent in high-resolution digitized artworks, leading to fragmented and chaotic edge maps. In contrast, the Sobel operator effectively suppresses such noise while preserving the continuity of salient structural boundaries.

Furthermore, although deep-learning-based detectors (e.g., HED [35] or PiDiNet [36]) excel in general computer vision tasks, they are typically pre-trained on semantic datasets such as BSDS500. Consequently, these models are biased toward identifying “semantic object boundaries” in natural scenes. According to Wölfflin’s theory, artistic “linearity” does not strictly equate to the physical segmentation of objects; it encapsulates the artist’s subjective structural logic and formalist contours. By utilizing the Sobel operator—a non-parametric, mathematically transparent gradient tool—we ensure that the extracted structural constraints remain “semantically neutral.” This approach avoids the introduction of modern semantic biases, allowing the EG-CBAM module to focus purely on the geometric skeletons of the artworks as originally intended by the artists.

4.2. Implementation Details

The proposed method is optimized using SGD with a momentum of

0.9

and a weight decay of

1 \times 10^{- 4}

. The training process spans a total of 50 epochs with a batch size of 32. The initial learning rate is set to

0.01

and adjusted with a cosine annealing scheduler. To ensure reproducibility, the random seed is fixed at 42 for all experiments. Computational efficiency is enhanced through the application of Automatic Mixed Precision (AMP). All experiments are conducted on a single NVIDIA RTX 5090 GPU, utilizing eight worker threads for accelerated data loading. The classification performance is quantitatively evaluated using top 1 and top 5 accuracy, alongside Precision, Recall, and F1-score for a comprehensive assessment. For qualitative interpretation of the model’s decision-making process, Grad-CAM is employed to visualize the regions critical to the final predictions.

4.3. Experimental Results and Comparison

Table 2 presents a comprehensive quantitative comparison of our proposed ESA-Net against several established baseline methods on the WikiArt dataset. All metrics are reported in percentages (%), with the best results highlighted in bold. The results demonstrate that ESA-Net achieves a new state-of-the-art performance across the majority of evaluation metrics.

Specifically, ESA-Net achieves a top 1 accuracy of 69.40%, establishing a significant margin of 4.96% over the best-performing baseline by Wu et al. [40]. Although the method by Wu et al. obtains a marginally higher top 5 accuracy (96.58% vs. 95.97%), the substantial improvement in our top 1 accuracy indicates that our model possesses superior discriminative capability to pinpoint the exact artistic style.

Furthermore, the evaluations based on Precision, Recall, and F1-Score highlight the robustness of ESA-Net against the inherent long-tail distribution of the WikiArt dataset (detailed in Figure 3). As evidenced in Table 2, early CNN-based methods (e.g., Karayev et al. [11] and Tan et al. [15]) struggle to effectively handle class imbalance, exhibiting an imbalanced phenomenon where recall significantly exceeds precision. In contrast, ESA-Net significantly mitigates this issue, achieving a highly balanced performance with a Precision of 70.39% and a Recall of 68.01%. This balanced and superior performance empirically validates our architectural design: by explicitly incorporating edge structure features to guide model learning, ESA-Net effectively overcomes the common tendency of networks to over-rely on dominant texture patterns prevalent in head classes. Instead, the model successfully learns more fundamental, structural style representations, thereby maintaining high classification efficacy even for minority classes with limited samples.

4.4. Ablation Studies

We conducted a series of ablation studies on the WikiArt dataset, which not only demonstrate the superiority of ESA-Net but also validate our previous hypotheses regarding texture and contour. Specifically, we systematically investigated: (1) Selection of backbone network; (2) Generalizability of EG-CBAM; (3) Ablation of components; and (4) The effectiveness of the consecutive average pooling layers (CAP) strategy.

4.4.1. Selection of Backbone Network

The feature extraction capability of the backbone network largely determines the upper performance limit of artistic style classification tasks. To identify the most suitable foundational architecture for our proposed model, we comparatively evaluated three representative convolutional neural networks under the condition of integrating the same core modules: ResNet50 [41], DenseNet121 [42], and EfficientNet-B0 [43].

The detailed comparison results are presented in Table 3. The empirical data demonstrate that EfficientNet-B0 achieved the optimal performance across four core evaluation metrics: top 1 Accuracy (69.40%), Precision (70.39%), Recall (68.01%), and F1-Score (68.80%). Although DenseNet121 holds a marginal advantage in top 5 Accuracy (96.14% vs. 95.97%), when comprehensively considering the model’s precise recognition capability for single style categories (evidenced by the significantly higher top 1 and F1-Score), we ultimately selected EfficientNet-B0 as the default backbone network for ESA-Net.

We attribute this superiority to the architectural alignment between EfficientNet-B0 and the intrinsic characteristics of artistic images. The Mobile Inverted Bottleneck Convolution (MBConv) utilizes depthwise separable convolutions to decouple spatial and channel feature extraction. Compared to ResNet50, this mechanism is significantly more efficient in capturing the discriminative, subtle brushstrokes and multi-scale textures prevalent in paintings.

In contrast, although DenseNet121 achieves a high top 5 accuracy via dense feature reuse, it lacks explicit attention guidance for targeted feature selection. ResNet50 similarly exhibits limited representational capacity for highly abstract stylistic features. Therefore, EfficientNet-B0 provides the optimal capability for feature selectivity and detail capture.

4.4.2. Generalizability of EG-CBAM

To validate the architectural adaptability and model-agnostic nature of the proposed EG-CBAM, we incorporated it into two distinct convolutional neural network paradigms with varying parameter scales: ResNet18 and our standard backbone, EfficientNet-B0. As demonstrated in Table 4, integrating EG-CBAM consistently enhances classification metrics across both backbones. Specifically, for ResNet18, the F1-Score increases from 55.82% to 57.02%. Similarly, when applied to EfficientNet-B0, the precision and F1-Score experience notable improvements of 1.67% and 1.17%, respectively. These empirical results substantiate that EG-CBAM can effectively recalibrate feature representations and improve the discriminative capacity of diverse baseline models without requiring customized structural modifications.

Furthermore, ESA-Net exhibits distinct advantages in deployment efficiency and architectural generality when compared to recent state-of-the-art approaches, such as the work by Luo et al. [39]. Luo et al. employ a knowledge distillation framework to elevate art classification accuracy. While effective, knowledge distillation inherently necessitates a cumbersome two-stage training pipeline, a strict reliance on a pre-trained high-capacity teacher model, and extensive computational overhead during the training phase. Consequently, its practical application is often constrained by these external structural dependencies. In contrast, the superiority of EG-CBAM lies in its self-contained feature refinement mechanism. Rather than relying on external knowledge transfer, EG-CBAM adaptively accentuates salient features along both spatial and channel dimensions intrinsically within the network. This internal feature enhancement paradigm allows EG-CBAM to be seamlessly embedded into existing feed-forward architectures, achieving competitive performance gains with negligible additional computational cost and eliminating the need for complex auxiliary training strategies.

4.4.3. Ablation of Components

To identify the source of the performance, specifically, to determine whether the improvement stems from the inherent feature recalibration capability of the attention module or the introduction of external edge priors, we conducted a comprehensive ablation study on the network components. The original EfficientNet-B0, without any auxiliary modules, was established as the baseline (Base Model). We subsequently evaluated the classification performance of the baseline integrated with the vanilla CBAM [44] (Base + CBAM) and with our proposed Edge-Guided CBAM (Base + EG-CBAM).

The experimental results presented in Table 5 reveal a counter-intuitive phenomenon: the direct incorporation of the vanilla CBAM module into the baseline model fails to yield the anticipated performance gains; rather, it leads to a degradation in generalization capability. Specifically, the addition of the vanilla CBAM decreases the top 1 accuracy from 69.25% to 68.65%, and the F1-Score from 67.63% to 67.26%. Conversely, improvements across key metrics are only observed when the external edge guidance is introduced. Notably, the Base + EG-CBAM model achieves a Precision of 70.39% and an elevated F1-Score of 68.80%.

These performance discrepancies substantiate our “Texture Bias” hypothesis. Unlike natural images, artworks contain intense brushstrokes and complex textures. Lacking spatial constraints, the vanilla CBAM assigns high attentional weights to these semantically irrelevant high-frequency noises, amplifying stylistic interference and degrading classification performance.

Conversely, EG-CBAM introduces explicit Structural Normalization via edge priors. By utilizing object contours as spatial constraints, it aligns the attention map with actual physical boundaries, suppressing the network’s tendency to overfit local textures. Ultimately, EG-CBAM successfully decouples “semantic texture” and “structural contours”, demonstrating that edge-guided attention is crucial for enhancing feature discriminability and generalization in texture-heavy artistic domains.

4.4.4. The Effectiveness of the Consecutive Average Pooling Layers (CAP) Strategy

As illustrated in Figure 5, the transformation of high-resolution edge maps into deep-seated edge features requires a carefully designed downsampling pipeline to ensure spatial and structural alignment. In this section, we provide a rigorous quantitative evaluation of our proposed Consecutive Average Pooling (CAP) strategy against three distinct regimes: the baseline (No Edge), single-step large-stride pooling, and learnable convolutional downsampling.

The empirical results in Table 6 yield several critical insights. Most notably, the “Single Pooling Layer” approach—which utilizes a single large-stride operation to match the spatial dimensions as depicted in Figure 5—achieved a top 1 accuracy of only 68.37%. This result is paradoxically 0.88% lower than the baseline model without any edge information (69.25%). This performance degradation provides strong evidence for our “structural noise” hypothesis: aggressive, single-stage downsampling ignores the Nyquist-Shannon sampling theorem, leading to severe spatial aliasing. This aliasing effect disintegrates the thin, continuous contours of the edge map into disjointed, non-semantic artifacts, which “misguides” the attention mechanism and introduces destructive noise into the feature fusion process.

In contrast, the proposed CAP strategy, which employs a multi-stage smoothing approach (see Figure 5), significantly reverses this trend. It achieves a top 1 accuracy of 69.40% and an F1-score of 68.80%—marking a substantial +1.03% accuracy gain and a +1.28% F1-score improvement over the single-layer pooling alternative. Mathematically, the CAP strategy acts as a hierarchical low-pass filter. By decomposing the global downsampling task into five sequential

2 \times 2

operations, it ensures that the topological connectivity of artistic outlines is preserved across varying scales.

Furthermore, while the “Convolutional Layer” approach introduces learnable parameters, it achieved a suboptimal top 1 accuracy of 67.96%. This confirms that the extreme spatial sparsity of Sobel-extracted edges makes them ill-suited for parameterized kernels, which tend to overfit local noise. The superiority of CAP demonstrates that a parameter-free, smoothness-preserving downsampling logic is the most robust method for injecting structural priors into the EG-CBAM module, ensuring that the “Linear” constraints remain coherent and meaningful for the spatial attention gate.

4.5. Visual Interpretability via Grad-CAM

To demonstrate that ESA-Net indeed leverages the structure prior and not merely memorizes solely on texture information, we employ Grad-CAM [27] to visualize and analyze the model’s attention maps. Figure 6 presents a comparative visualization between the baseline EfficientNet-B0 and our proposed model enhanced by the EG-CBAM module.

Due to the texture bias nature of vanilla CNNs, its attention tends to focus on regions with high luminance or primary semantic subjects. For instance, in the “High Renaissance” sample, the activation of the baseline model is localized on the characters’ faces. In contrast, the attention of ESA-Net is more diffusely distributed across the torso, revealing an elegant attention to the textures of the drapery. Likewise for the “Ukiyo-e” sample with our method suppresses high-brightness background noise and focuses more accurately on subjects with well-defined edges compared to the baseline. These results indicate that the EG-CBAM module suppresses the background noise well, which can limit the model’s attention to painterly structures defined by edge information.

4.6. Cost Analysis

To quantify the computational overhead introduced by the structural prior and the attention mechanism, we evaluate the parameter volume and Floating Point Operations (FLOPs) of ESA-Net. Table 7 provides a comparative analysis between the baseline backbones and their corresponding edge-guided versions (Base + EG-CBAM).

The empirical data demonstrates that the Edge-Guided Spatial Attention module incurs a minimal computational penalty. For the EfficientNet_B0 backbone, the addition of the EG-CBAM module increases the parameter count by approximately 0.205 M (5.07%), while the FLOPs increase by a negligible margin of 0.001 G. In the case of the ResNet18 backbone, the FLOPs remain virtually unchanged at 16.747 G, with only a slight increment in parameters. This high level of efficiency is primarily achieved because the edge extraction process utilizes the non-parametric Sobel operator and the spatial alignment is handled by the parameter-free Consecutive Average Pooling (CAP) strategy. These results confirm that ESA-Net enhances classification accuracy without compromising the model’s suitability for resource-constrained deployment in digital heritage archives.

4.7. Supplementary Experiments

4.7.1. Stability Analysis and Reproducibility

To ensure the scientific rigor and reproducibility of the proposed ESA-Net, we conducted a stability analysis across multiple independent trials. While standard benchmarks in art style classification often report performance from a single execution, we evaluate the consistency of our model by performing three independent training runs using different random seeds (42, 123, and 456). All other experimental configurations, including hyperparameters, data partitioning, and the cosine annealing scheduler, remained strictly identical to those described in Section 4.2.

The results of these trials, along with the calculated Mean and Standard Deviation (Mean ± Std), are summarized in Table 8.

As evidenced by the data, the standard deviation for the core metric, top 1 accuracy, is remarkably low at

\pm 0.35 %

, and the F1-score variance remains within

\pm 1.01 %

. These results demonstrate that the performance improvements in ESA-Net are not the result of fortuitous weight initialization but are rooted in the robust structural guidance provided by the EG-CBAM module. The high degree of stability and low variance across trials confirm that our proposed method is both reliable and highly reproducible for large-scale art analysis.

4.7.2. Comparison with Transformer-Based Architectures

To rigorously evaluate the architectural adaptability of the EG-CBAM module and determine whether explicit structural priors remain necessary alongside the implicit global modeling of Vision Transformers (ViTs), we substituted the CNN backbone with a Swin Transformer (Swin-T) [16]. This comparison assesses if the self-attention mechanism in Transformers can bypass the need for edge-guided spatial attention.

As shown in Table 9, ESA-Net (w/Swin-T) achieves a top 1 accuracy of 69.51%, establishing the empirical upper bound for our framework. This indicates that EG-CBAM is highly versatile, successfully complementing the long-range dependency modeling of Transformers to further refine feature localization. However, the performance gain over the EfficientNet-B0 variant is a marginal 0.11%, suggesting that our explicit structural prior provides a discriminative signal robust enough to rival the implicit modeling of much larger Transformer models.

4.7.3. Training Dynamics and Efficiency Analysis

Beyond final accuracy, we analyzed the training stability and computational cost of the candidate backbones. Figure 7 illustrates the convergence behavior over 50 epochs. While Swin-T converges relatively early (around epoch 30), its test accuracy exhibits higher volatility, likely due to the higher complexity of self-attention optimization on the long-tailed WikiArt dataset. In contrast, EfficientNet-B0 demonstrates a superior optimization profile, characterized by a smooth, monotonic decrease in loss and steady accuracy improvement.

Table 10 summarizes the efficiency metrics. Although ResNet18 is marginally faster per epoch, its F1-score is non-competitive. Critically, the Swin-T variant entails a significant computational penalty: it possesses 6.6 times more parameters and requires 22.4% more training time per epoch than the EfficientNet-B0 version, yet offers no substantial improvement in F1-score (68.65% vs. 68.80%). Consequently, EfficientNet-B0 is selected as the optimal foundational backbone for ESA-Net, providing an elite balance of precision, stability, and resource economy for large-scale digital heritage applications.

4.7.4. Cross-Dataset Generalization

To evaluate the robustness and domain-invariance of the learned structural representations, we conducted a cross-dataset evaluation using the SemArt dataset. This dataset is particularly suitable for assessing generalization as its imagery is sourced exclusively from the Web Gallery of Art, thereby ensuring an absence of data overlap with WikiArt-based archives. Such a setup is critical to avoid the data contamination frequently encountered in other art benchmarks that share overlapping web sources.

We employed a linear probing protocol to assess the quality of the features extracted by the pre-trained ESA-Net. The backbone, previously optimized on the WikiArt dataset, was frozen to function as a static feature extractor, while a single linear classification layer was trained to categorize the ten artistic genres (Types) defined in SemArt. Without any fine-tuning of the internal weights, the model achieved a top 1 accuracy of 69.97% on the SemArt test set. This significant performance under zero-shot transfer conditions demonstrates that the Edge-Guided Spatial Attention module successfully captures an intrinsic structural logic that is consistent across disparate digital archives. The result confirms that by anchoring the attention mechanism to explicit geometric priors rather than dataset-specific textural artifacts, ESA-Net develops a generalized stylistic representation that is highly transferable to independent art domains.

5. Discussion

5.1. Error Analysis via Confusion Matrix

We further investigated the classification characteristics of ESA-Net through the normalized confusion matrix (Figure 8). The model demonstrates high efficacy in styles characterized by clear structural logic, but faces challenges in styles primarily defined by color and texture.

Ukiyo-e (95%), Synthetic Cubism (91%), and Northern Renaissance (86%) all exhibit strong performance. Consistent with our architectural design, the model shows high classification accuracy in styles that rely heavily on lines and structural features.

Challenges in color-and-texture-dominant styles are mainly concentrated in categories dominated by color and texture.

Lack of Formal Definition: Significant confusion (55%) exists between Action Painting and Abstract Expressionism. As visualized in the left panel of Figure 9, both styles are characterized by spontaneous, gestural paint application rather than the depiction of physical forms. Consequently, the Sobel operator extracts chaotic, high-frequency noise derived from paint splatters and canvas textures rather than coherent object boundaries. Since our ESA-Net heavily relies on consistent contour structures (the “linear” prior) to guide the EG-CBAM module, the absence of stable geometric topologies in these abstract genres fundamentally neutralizes the advantage of structural guidance, causing the model to default back to learning unstructured textural noise.
Evolutionary Overlap: Moderate confusion was observed among Impressionism, Post-Impressionism, and Fauvism. Rather than a simple computational error, this overlap macroscopically mirrors the authentic evolutionary logic of art history. As shown in the right panel of Figure 9, there is a clear structural transition across these movements. Impressionism dissolves physical boundaries into fragmented, light-driven brushstrokes (resulting in a highly noisy and dense edge map). As art evolved chronologically into Post-Impressionism and eventually Fauvism, artists began to reintroduce distinct, explicit outlines (e.g., Cloisonnism), which is clearly reflected in the increasingly continuous and bold white contours in their respective edge maps. Because these styles share a continuous trajectory of structural reshaping, the discrete artificial labels of the dataset inherently conflict with their visual continuity, reducing class separability.

Unlike traditional general image classification tasks (such as distinguishing cats from dogs in ImageNet), where objective and absolute physical boundaries exist between categories, the taxonomy of artworks is inherently characterized by expert subjectivity and chronological ambiguity. Consequently, the model’s confusion matrix reflects the continuous evolutionary trajectory of art history rather than mere computational errors. For example, in the historical progression from Impressionism to Post-Impressionism and Fauvism, the mid-to-late works of masters like Cézanne or Van Gogh frequently integrate the atmospheric light-and-shadow textures of the former with the structural reshaping of the latter.

Thus, the classification confusion, or probability overlap, produced by ESA-Net among these highly correlated styles offers a valuable insight: the model avoids rigidly overfitting the discrete artificial labels of the dataset. Instead, it captures the historical continuity embedded within the visual features of these art movements. This further substantiates that by integrating Wölfflin’s theoretical priors, the representation space learned by ESA-Net not only maintains robust discriminative capacity but also macroscopically resonates with the authentic evolutionary logic of art history.

5.2. Quantitative Analysis of Texture Bias Reduction

To rigorously quantify the extent to which ESA-Net mitigates the inherent texture bias of CNNs, we perform a class-wise accuracy divergence analysis between genres aligned with Wölfflin’s “Linear” and “Painterly” paradigms. If the structural priors introduced by the EG-CBAM module are effectively utilized, the model should exhibit a significant performance ceiling in genres where stylistic identity is rooted in geometric clarity, while demonstrating expected confusion in genres characterized by amorphous textures.

As evidenced by the quantitative data in the normalized confusion matrix (Figure 8), our model achieves near-optimal discriminative power in “Linear” styles with well-defined contours: Ukiyo-e (95%), Synthetic Cubism (91%), and Northern Renaissance (86%). These high precision values indicate that the structural guidance effectively anchors the network’s attention to the artwork’s skeletal layout rather than local brushstroke noise.

In stark contrast, styles defined primarily by spontaneous color application and the absence of formal boundaries—where “Painterly” features dominate—exhibit a substantial quantitative degradation in classification accuracy. For instance, Action Painting yields an accuracy of only 45%, with a 55% probability of misclassification into Abstract Expressionism. This massive performance gap (exceeding 40%) between structure-dominant and texture-dominant categories provides concrete empirical evidence that ESA-Net’s decision-making logic has shifted from a reliance on local texture statistics to global structural logic. This divergence quantitatively validates that the network has successfully incorporated the “Linear” prior as a primary discriminative feature, thereby effectively operationalizing a theoretically-grounded reduction in texture bias.

5.3. Limitations and Future Directions

Despite the performance gains achieved by ESA-Net, the explicit integration of structural priors introduces specific theoretical and technical constraints that warrant further discussion.

Sensitivity to Exogenous Edge Noise. The efficacy of the EG-CBAM module is fundamentally contingent upon the fidelity of the extracted structural priors. In this study, the Sobel operator serves as a deterministic gradient filter; however, it exhibits susceptibility to high-frequency artifacts. For digitized artworks characterized by significant JPEG compression noise or physical degradation—such as surface craquelure, canvas aging, or pigment cracking—the edge detection process may inadvertently capture stochastic noise rather than meaningful stylistic contours. Such non-semantic signals can introduce “structural interference” into the spatial attention gate, potentially polluting the feature recalibration process and diminishing classification robustness in low-quality digital archives.

Generalization Bottlenecks in Amorphous Styles. Following Wölfflin’s “Linear vs. Painterly” dichotomy, the proposed architecture is inherently optimized for styles possessing identifiable geometric skeletons. However, its generalizability is constrained when encountering genres defined by “informalism” or a total absence of delineated forms. As evidenced by the performance deficit in categories like Color Field Painting and Abstract Expressionism, the structural branch faces a state of informational sparsity when boundaries are deliberately dissolved into tonal masses. In these instances, the “Linear-Aware” mechanism lacks stable anchors for spatial gating, causing the model to default to unstructured textural patterns, which limits the marginal utility of the dual-stream architecture.

Future Prospects. To mitigate these limitations, future research will focus on transitioning from static edge extraction to dynamic, content-aware methodologies. A promising trajectory involves the implementation of adaptive thresholding mechanisms or learnable gradient encoders that can distinguish between aesthetic contours and stochastic noise. Furthermore, exploring the integration of multi-scale structural descriptors may provide a more comprehensive representation for transitional styles that occupy the threshold between the linear and the painterly, thereby enhancing the model’s adaptability across the full spectrum of art history.

6. Conclusions

In this paper, we introduce the Edge-Guided Spatial Attention Network (ESA-Net), a novel architecture designed to mitigate the “texture bias” in traditional convolutional neural networks (CNNs) for artistic style classification. By formalizing the problem through Heinrich Wölfflin’s “Linear vs. Painterly” theory, we introduce the Edge-Guided Convolutional Block Attention Module (EG-CBAM), which integrates edge priors with deep semantic features.

Experimental evaluations on the WikiArt dataset demonstrate that ESA-Net achieves substantial performance, reaching a top 1 accuracy of 69.40% and an F1-score of 68.80%. Moreover, the qualitative analysis using Grad-CAM proves that our model effectively aligns its focus with the structural contours that define artistic genres. This alignment not only enhances classification robustness, particularly across long-tail distributions, but also provides empirical validation for classical art historical theories.

Despite these advancements, analysis of the confusion matrix reveals persistent challenges in cases characterized by a “Lack of Formal Definition” and “Evolutionary Overlap,” where structural boundaries are deliberately obscured. Future research will explore the integration of multimodal features or more refined semantic texture extraction schemes to address these boundary cases.

Author Contributions

Conceptualization, W.Y. and X.L.; methodology, W.Y. and X.L.; software, W.Y.; validation, W.Y.; formal analysis, W.Y.; investigation, W.Y. and X.L.; resources, W.Y.; data curation, W.Y.; writing—original draft preparation, W.Y.; writing—review and editing, W.Y. and X.L.; visualization, W.Y.; supervision, X.L.; project administration, X.L.; funding acquisition, X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data that support the findings of this study are available. The WikiArt dataset used in our experiments can be accessed and downloaded from https://github.com/cs-chan/ArtGAN/blob/master/WikiArt Dataset/README.md (accessed on 21 May 2026). For more details regarding the creation and structure of this dataset, please refer to [12].

Acknowledgments

The authors acknowledge the use of AI strictly for language editing and phrasing refinement to enhance the readability of this manuscript. All experimental designs, data analyses, interpretations, and scientific conclusions are entirely the original work of the human authors. The authors have carefully reviewed all AI-assisted text modifications and take absolute responsibility for the accuracy and integrity of the final article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chroni, A. Digital Humanities for Preserving Cultural Heritage. In Proceedings of the International Conference on Transdisciplinary Multispectral Modeling and Cooperation for the Preservation of Cultural Heritage; Springer: Berlin/Heidelberg, Germany, 2025; pp. 403–415. [Google Scholar]
Fiorucci, M.; Khoroshiltseva, M.; Pontil, M.; Traviglia, A.; Del Bue, A.; James, S. Machine learning for cultural heritage: A survey. Pattern Recognit. Lett. 2020, 133, 102–108. [Google Scholar] [CrossRef]
Cetinic, E.; She, J. Understanding and creating art with AI: Review and outlook. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 2022, 18, 1–22. [Google Scholar] [CrossRef]
Burger, W.; Burge, M.J. Scale-invariant feature transform (SIFT). In Digital Image Processing: An Algorithmic Introduction; Springer: Berlin/Heidelberg, Germany, 2022; pp. 709–763. [Google Scholar]
Khan, I.M.; Zahoor, F. Intelligent Fire Recognition for Surveillance Control Using Cascaded Multi-Scale Attention Framework. ICCK Trans. Sens. Commun. Control 2026, 3, 15–26. [Google Scholar] [CrossRef]
Hassan, M.Z.; Gazis, A.; Khan, A.; Ghazanfar, Z. Learning Cross-Modal Collaboration via Pyramid Attention for RGB Thermal Sensing in Saliency Detection. ICCK Trans. Sens. Commun. Control 2026, 3, 1–14. [Google Scholar] [CrossRef]
Geirhos, R.; Rubisch, P.; Michaelis, C.; Bethge, M.; Wichmann, F.A.; Brendel, W. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Baker, N.; Lu, H.; Erlikhman, G.; Kellman, P.J. Deep convolutional networks do not classify based on global object shape. PLoS Comput. Biol. 2018, 14, e1006613. [Google Scholar] [CrossRef] [PubMed]
Hermann, K.; Chen, T.; Kornblith, S. The origins and prevalence of texture bias in convolutional neural networks. Adv. Neural Inf. Process. Syst. 2020, 33, 19000–19015. [Google Scholar]
Gatys, L.A.; Ecker, A.S.; Bethge, M. Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2414–2423. [Google Scholar]
Karayev, S.; Trentacoste, M.; Han, H.; Agarwala, A.; Darrell, T.; Hertzmann, A.; Winnemoeller, H. Recognizing image style. arXiv 2013, arXiv:1311.3715. [Google Scholar]
Saleh, B.; Elgammal, A. Large-scale classification of fine-art paintings: Learning the right metric on the right feature. arXiv 2015, arXiv:1505.00855. [Google Scholar]
Bar, Y.; Levy, N.; Wolf, L. Classification of artistic styles using binarized features derived from a deep neural network. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2014; pp. 71–84. [Google Scholar]
Cetinic, E.; Lipic, T.; Grgic, S. Fine-tuning convolutional neural networks for fine art classification. Expert Syst. Appl. 2018, 114, 107–118. [Google Scholar] [CrossRef]
Tan, W.R.; Chan, C.S.; Aguirre, H.E.; Tanaka, K. Ceci n’est pas une pipe: A deep convolutional network for fine-art paintings classification. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP); IEEE: Piscataway, NJ, USA, 2016; pp. 3703–3707. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Xiao, Y.; Yuan, Q.; Jiang, K.; He, J.; Lin, C.W.; Zhang, L. TTST: A top-k token selective transformer for remote sensing image super-resolution. IEEE Trans. Image Process. 2024, 33, 738–752. [Google Scholar] [CrossRef] [PubMed]
Xiao, Y.; Yuan, Q.; Jiang, K.; He, J.; Jin, X.; Zhang, L. EDiffSR: An efficient diffusion probabilistic model for remote sensing image super-resolution. IEEE Trans. Geosci. Remote Sens. 2023, 62, 5601514. [Google Scholar] [CrossRef]
Zhao, K.; Hung, A.L.Y.; Pang, K.; Hajipour, P.; Wu, H.; Raman, S.; Sung, K. PCa-Mamba: Spatiotemporal state space models for prostate cancer detection in multi-parametric MRI. Med. Image Anal. 2026, 111, 104033. [Google Scholar] [CrossRef] [PubMed]
Mao, H.; Cheung, M.; She, J. Deepart: Learning joint representations of visual arts. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; pp. 1183–1191. [Google Scholar]
Garcia, N.; Renoust, B.; Nakashima, Y. Context-aware embeddings for automatic art analysis. In Proceedings of the 2019 on International Conference on Multimedia Retrieval, Ottawa, ON, Canada, 10–13 June 2019; pp. 25–33. [Google Scholar]
Sun, T.; Wang, Y.; Yang, J.; Hu, X. Convolution neural networks with two pathways for image style recognition. IEEE Trans. Image Process. 2017, 26, 4102–4113. [Google Scholar] [CrossRef] [PubMed]
Cheng, M.M.; Liu, X.C.; Wang, J.; Lu, S.P.; Lai, Y.K.; Rosin, P.L. Structure-preserving neural style transfer. IEEE Trans. Image Process. 2019, 29, 909–920. [Google Scholar] [CrossRef] [PubMed]
Simonyan, K.; Vedaldi, A.; Zisserman, A. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv 2013, arXiv:1312.6034. [Google Scholar]
Springenberg, J.T.; Dosovitskiy, A.; Brox, T.; Riedmiller, M. Striving for simplicity: The all convolutional net. arXiv 2014, arXiv:1412.6806. [Google Scholar]
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2921–2929. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Chattopadhay, A.; Sarkar, A.; Howlader, P.; Balasubramanian, V.N. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV); IEEE: Piscataway, NJ, USA, 2018; pp. 839–847. [Google Scholar]
Wang, H.; Wang, Z.; Du, M.; Yang, F.; Zhang, Z.; Ding, S.; Mardziel, P.; Hu, X. Score-CAM: Score-weighted visual explanations for convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 24–25. [Google Scholar]
Jiang, P.T.; Zhang, C.B.; Hou, Q.; Cheng, M.M.; Wei, Y. Layercam: Exploring hierarchical class activation maps for localization. IEEE Trans. Image Process. 2021, 30, 5875–5888. [Google Scholar] [CrossRef] [PubMed]
Wölfflin, H. Kunstgeschichtliche Grundbegriffe: Das Problem der Stilentwicklung in der Neueren Kunst; Bruckmann: Munich, Germany, 1921. [Google Scholar]
Minor, V.H. Art History’s History; Prentice Hall: Englewood Cliffs, NJ, USA, 1994. [Google Scholar]
Jha, D.; Chang, H.H.; Elhoseiny, M. Wölfflin’s Affective Generative Analysis for Visual Art. In Proceedings of the ICCC, Xiamen, China, 28–30 July 2021; pp. 429–433. [Google Scholar]
Su, T.; Wang, G.; Bai, Y.; Wan, R. M-SAITS: A Dual-Stage Time Series Imputation Network via Decoupled Large-Kernel Convolution and Diagonally-Masked Attention. ICCK Trans. Mach. Intell. 2026, 2, 106–115. [Google Scholar] [CrossRef]
Xie, S.; Tu, Z. Holistically-nested edge detection. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1395–1403. [Google Scholar]
Su, Z.; Liu, W.; Yu, Z.; Hu, D.; Liao, Q.; Tian, Q.; Pietikäinen, M.; Liu, L. Pixel difference networks for efficient edge detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 5117–5127. [Google Scholar]
Mohammadi, M.R.; Rustaee, F. Hierarchical classification of fine-art paintings using deep neural networks. Iran. J. Comput. Sci. 2021, 4, 59–66. [Google Scholar] [CrossRef]
Lecoutre, A.; Negrevergne, B.; Yger, F. Recognizing art style automatically in painting with deep learning. In Proceedings of the Asian Conference on Machine Learning; PMLR: New York, NY, USA, 2017; pp. 327–342. [Google Scholar]
Luo, M.; Liu, L.; Lu, Y.; Suen, C.Y. Art style classification via self-supervised dual-teacher knowledge distillation. Appl. Soft Comput. 2025, 174, 112964. [Google Scholar] [CrossRef]
Wu, Y.; Nakashima, Y.; Garcia, N. Not only generative art: Stable diffusion for content-style disentanglement in art analysis. In Proceedings of the 2023 ACM International Conference on Multimedia Retrieval, Thessaloniki, Greece, 12–15 June 2023; pp. 199–208. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning; PMLR: New York, NY, USA, 2019; pp. 6105–6114. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]

Figure 1. Representative samples from WikiArt illustrating the ’Linear vs. Painterly’ dichotomy, encompassing contour-led (e.g., Ukiyo-e), texture-heavy styles (e.g., Impressionism), and various transitional works between these two poles.

Figure 2. Detailed architecture of the proposed ESA-Net. (a) Overall architecture illustrating the dual-stream feature extraction and classification pipeline. (b) Channel Attention Module, which calculates channel weights

W_{c}

to produce refined features

F

. (c) Edge-Guided Spatial Attention Module, which concatenates pooled semantic features with structural edge features

F_{E}

to generate the spatial attention map

W_{s}

. (d) Detailed illustrations of the Consecutive Average Pooling (CAP) strategy and (e) the Attention Visualization process using Grad-CAM.

Figure 2. Detailed architecture of the proposed ESA-Net. (a) Overall architecture illustrating the dual-stream feature extraction and classification pipeline. (b) Channel Attention Module, which calculates channel weights

W_{c}

to produce refined features

F

. (c) Edge-Guided Spatial Attention Module, which concatenates pooled semantic features with structural edge features

F_{E}

to generate the spatial attention map

W_{s}

. (d) Detailed illustrations of the Consecutive Average Pooling (CAP) strategy and (e) the Attention Visualization process using Grad-CAM.

Figure 3. Distribution of images across the 27 categories in the WikiArt dataset. The bar chart shows the number of images per style, while the orange line represents the cumulative share (Pareto curve), indicating that the top 12 classes account for 80% of the dataset.

Figure 4. Proportion of art styles in the WikiArt dataset. The central value indicates the total number of images (81,446). The top 10 dominant styles account for approximately 76.7% of the dataset, while the remaining 17 minority styles are aggregated as “Others” (23.3%), further visually demonstrating the severe class imbalance.

Figure 5. Consecutive Average Pooling (CAP) and convolutional layers strategy for edge feature extraction.

Figure 6. Grad-CAM comparison between baseline EfficientNet-B0 and our ESA-Net. Baseline tends to focus on textures area, whereas ESA-Net is better aligned with the structure contour especially for some Linear style such as High Renaissance or Ukiyo-e.

Figure 7. Training dynamics of ESA-Net across different backbones. EfficientNet-B0 exhibits the most stable optimization trajectory.

Figure 8. Normalized confusion matrix on the WikiArt dataset. The model shows high efficacy in structurally distinct categories (e.g., Ukiyo-e, Cubism) but exhibits confusion in Texture-Dominant Styles (e.g., Action Painting vs. Abstract Expressionism).

Figure 9. Visual analysis of edge map representations for highly confused or historically overlapping art styles. (a) Lack of Formal Definition: Action Painting and Abstract Expressionism yield chaotic, non-semantic edge signals primarily derived from spontaneous paint splatters rather than object boundaries, limiting the effectiveness of structural priors. (b) Evolutionary Overlap: The visual trajectory from Impressionism (fragmented, high-frequency brushstrokes) to Post-Impressionism and Fauvism (increasingly distinct, continuous, and bold outlines) illustrates the chronological transition of structural logic, which explains the model’s probability overlap among these temporally adjacent movements.

Table 1. Summary of key notations used in Figure 2 and the architectural description.

Symbol	Dimension	Description
$I$	$3 \times H \times W$	Input RGB artwork image.
$E$	$1 \times H \times W$	Structural edge map extracted via Sobel operator.
$X$	$C^{'} \times H^{'} \times W^{'}$	Raw semantic feature maps extracted by the backbone network.
$W_{c}$	$C^{'} \times 1 \times 1$	Channel attention weights used to recalibrate feature importance.
$F$	$C^{'} \times H^{'} \times W^{'}$	Refined features after channel-wise recalibration.
$F_{E}$	$1 \times H^{'} \times W^{'}$	Downsampled structural edge features (output of CAP strategy).
$F_{c o n c a t}$	$3 \times H^{'} \times W^{'}$	Concatenated tensor of spatial statistics and edge priors.
$W_{s}$	$1 \times H^{'} \times W^{'}$	Edge-guided spatial attention map (spatial gate).
y	Scalar/Vector	Final predicted artistic style category.

Table 2. Performance comparison of the proposed ESA-Net with baseline methods on the WikiArt dataset. All values are reported in percentage (%). Best results are highlighted in bold.

Method	Top 1	Top 5	Precision	Recall	F1-Score
Karayev et al. [11]	47.30	90.26	38.87	56.39	51.65
Tan et al. [15]	54.52	93.38	49.73	58.41	55.96
Cetinic et al. [14]	56.43	–	–	–	–
Mo et al. [37]	59.53	94.27	58.64	60.58	59.14
Lecoutre et al. [38]	62.80	95.15	63.28	61.63	59.92
Luo et al. [39]	63.37	95.74	66.76	62.02	61.45
Wu et al. [40]	64.44	96.58	68.10	62.96	62.81
Ours (ESA-Net)	69.40	95.97	70.39	68.01	68.80

Table 3. Comparison of different backbone architectures integrated with the proposed modules. All values are reported in percentage (%). Best results are highlighted in bold.

Backbone	Top 1	Top 5	Precision	Recall	F1-Score
ResNet50	66.42	95.64	66.52	63.90	64.89
DenseNet121	68.22	96.14	68.53	65.92	66.89
EfficientNet-B0	69.40	95.97	70.39	68.01	68.80

Table 4. Ablation results of EG-CBAM integration across different backbones. Best results are highlighted in bold.

Backbone	Condition	Top 1	Top 5	Precision	Recall	F1-Score
ResNet18	Original	59.42	93.81	57.77	55.21	55.82
	+EG-CBAM	60.20	93.95	58.13	56.69	57.02
EfficientNet-B0	Original	69.25	95.95	68.72	67.24	67.63
	+EG-CBAM	69.40	95.97	70.39	68.01	68.80

Table 5. Comparison of different attention mechanisms on EfficientNet-B0. Best results are highlighted in bold.

Mechanism	Top 1	Top 5	Precision	Recall	F1-Score
Base Model	69.25	95.95	68.72	67.24	67.63
Base + CBAM	68.65	96.14	68.48	66.57	67.26
Base + EG-CBAM	69.40	95.97	70.39	68.01	68.80

Table 6. Ablation of edge feature extraction strategies. Best results are highlighted in bold.

Strategy	Top 1	Top 5	Precision	Recall	F1-Score
No Edge Information	69.25	95.95	68.72	67.24	67.63
Single Pooling Layer	68.37	95.80	68.60	67.18	67.52
Convolutional Layer	67.96	96.11	67.43	66.19	66.42
Consecutive Average Pooling (CAP)	69.40	95.97	70.39	68.01	68.80

Table 7. Comparison of computational complexity and parameter volume between the baseline models and the proposed Edge-Guided models.

Model	Params	FLOPs
EfficientNet_B0 + EG-CBAM	4.247 M	3.797 G
EfficientNet_B0	4.042 M	3.796 G
ResNet18 + EG-CBAM	11.223 M	16.747 G
ResNet18	11.190 M	16.747 G

Table 8. Stability analysis of ESA-Net over three independent runs with different random seeds.

Trial	Top 1 (%)	Top 5 (%)	Precision (%)	Recall (%)	F1-Score (%)
ESA-Net (seed = 42)	69.40	95.97	70.39	68.01	68.80
ESA-Net (seed = 123)	69.44	96.50	70.45	68.63	69.11
ESA-Net (seed = 456)	68.81	96.17	68.98	66.39	67.22
Mean ± Std	69.22 ± 0.35	96.21 ± 0.27	69.94 ± 0.83	67.68 ± 1.16	68.38 ± 1.01

Table 9. Comprehensive performance comparison across different backbone architectures. Best results are highlighted in bold.

Backbone Type	Top 1 (%)	Top 5 (%)	Precision (%)	Recall (%)	F1-Score (%)
ESA-Net (w/ResNet18)	63.96	95.00	63.27	60.77	61.53
ESA-Net (w/Swin-T)	69.51	96.81	70.47	67.69	68.65
ESA-Net (w/EffNet-B0)	69.40	95.97	70.39	68.01	68.80

Table 10. Comparison of parameter volume and training throughput. The smallest model parameters and the fastest training speed is highlighted in bold.

Backbone Architecture	Parameters (M)	Training Speed (s/Epoch)
ESA-Net (w/ResNet18)	11.7	292
ESA-Net (w/Swin-T)	28.3	366
ESA-Net (w/EfficientNet-B0)	4.25	299

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yu, W.; Liu, X. Linear-Aware Attention: Enhancing Art Style Classification with Structural Edge Priors. Electronics 2026, 15, 2314. https://doi.org/10.3390/electronics15112314

AMA Style

Yu W, Liu X. Linear-Aware Attention: Enhancing Art Style Classification with Structural Edge Priors. Electronics. 2026; 15(11):2314. https://doi.org/10.3390/electronics15112314

Chicago/Turabian Style

Yu, Wanglong, and Xuefeng Liu. 2026. "Linear-Aware Attention: Enhancing Art Style Classification with Structural Edge Priors" Electronics 15, no. 11: 2314. https://doi.org/10.3390/electronics15112314

APA Style

Yu, W., & Liu, X. (2026). Linear-Aware Attention: Enhancing Art Style Classification with Structural Edge Priors. Electronics, 15(11), 2314. https://doi.org/10.3390/electronics15112314

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Linear-Aware Attention: Enhancing Art Style Classification with Structural Edge Priors

Abstract

1. Introduction

2. Related Work

2.1. Art Style Classification

2.2. Visual Explanation and Interpretability

2.3. Heinrich Wölfflin’s Theory: Linear vs. Painterly

3. Methodology

3.1. Architectural Overview

3.2. Channel Attention

3.3. Edge-Guided Spatial Attention

3.4. Classification

4. Experiments

4.1. Dataset and Preprocessing

4.1.1. Dataset Statistics

4.1.2. Data Partitioning

4.1.3. Preprocessing and Edge Extraction

4.2. Implementation Details

4.3. Experimental Results and Comparison

4.4. Ablation Studies

4.4.1. Selection of Backbone Network

4.4.2. Generalizability of EG-CBAM

4.4.3. Ablation of Components

4.4.4. The Effectiveness of the Consecutive Average Pooling Layers (CAP) Strategy

4.5. Visual Interpretability via Grad-CAM

4.6. Cost Analysis

4.7. Supplementary Experiments

4.7.1. Stability Analysis and Reproducibility

4.7.2. Comparison with Transformer-Based Architectures

4.7.3. Training Dynamics and Efficiency Analysis

4.7.4. Cross-Dataset Generalization

5. Discussion

5.1. Error Analysis via Confusion Matrix

5.2. Quantitative Analysis of Texture Bias Reduction

5.3. Limitations and Future Directions

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI