A Woodblock New Year Painting Style Classification Method Based on Structural-Aware Attention and Frequency-Domain Style Statistics

Wei, Hua; Diao, Zhihua; Diao, Junxiang; Wen, Liqin; Sun, Binbin; Chen, Xiaoxuan; Yin, Luping

doi:10.3390/electronics15102158

Open AccessArticle

A Woodblock New Year Painting Style Classification Method Based on Structural-Aware Attention and Frequency-Domain Style Statistics

by

Hua Wei

¹,

Zhihua Diao

^2,*,

Junxiang Diao

^1,*,

Liqin Wen

¹,

Binbin Sun

¹,

Xiaoxuan Chen

¹ and

Luping Yin

¹

School of Art and Design, Zhengzhou University of Light Industry, Zhengzhou 450002, China

²

College of Electrical and Information Engineering, Zhengzhou University of Light Industry, Zhengzhou 450002, China

^*

Authors to whom correspondence should be addressed.

Electronics 2026, 15(10), 2158; https://doi.org/10.3390/electronics15102158

Submission received: 27 March 2026 / Revised: 13 May 2026 / Accepted: 14 May 2026 / Published: 18 May 2026

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

To address the problems of subtle style differences, high inter-class similarity, and complex structural and texture features in woodblock New Year paintings, this paper proposes a style classification method for woodblock New Year paintings based on an improved ResNeXt-50. The method introduces SA-CBAM at the middle- and high-level feature stages. Through the synergistic effect of channel attention and edge-enhanced spatial attention, the model is guided to focus on key structural regions such as human contours. Furthermore, single-stage 2D-DWT is introduced to separate deep features into low-frequency global structural components and high-frequency local detail components, thereby enabling effective representation of overall composition information and fine-grained carving textures. The Gram matrix is introduced to conduct statistical modeling of the fusion features, so as to characterize the overall style distribution from the perspective of channel correlation. The model is trained and tested on a dataset of 4043 independent images across six categories, achieving an overall classification accuracy of 97.68%, which is significantly superior to mainstream models such as Vision Transformer. Ablation experiments further verify the complementary effects of each module in structural perception, frequency-domain feature representation, and style statistical modeling, demonstrating the effectiveness and application potential of the proposed method for digital preservation and fine-grained style recognition of woodblock New Year paintings.

Keywords:

woodblock new year paintings; style classification; deep learning; improved ResNeXt-50

1. Introduction

As a treasure of Chinese folk art, woodblock New Year paintings embody the cultural wisdom and aesthetic tastes of the Chinese nation accumulated over thousands of years, and stand as a highly representative intangible cultural heritage [1,2,3]. In the digital era, the scientific inheritance and preservation of woodblock New Year paintings have become a crucial research topic, and the establishment of a comprehensive woodblock New Year painting database serves as the foundation for achieving this goal. Among related efforts, style classification is a core component of database construction, and its accuracy directly affects the application value of the database. Therefore, optimizing style classification methods holds key significance for in-depth research on and efficient utilization of woodblock New Year paintings [4,5,6].

In image classification tasks, the core challenge lies in accurately categorizing input images into predefined classes. Traditional style classification methods for woodblock New Year paintings often adopt manually designed feature extraction approaches, such as HSV histograms based on color features [7], gray-level co-occurrence matrices based on texture features, and edge detection operators based on shape features. These methods then combine with traditional machine learning algorithms [8] like Support Vector Machines (SVMs) [9] and decision trees [10] to complete the classification process. While these methods can work to a certain extent when dealing with simple style differences, the style of woodblock New Year paintings is influenced by multiple factors including region, history, and folk customs, resulting in complex and delicate style features. Traditional methods rely excessively on manual experience, making it difficult to fully capture the deep semantic information behind the styles. When faced with large-scale, multi-style woodblock New Year painting datasets, their classification performance is often unsatisfactory [11].

The emergence of deep learning-based image classification methods has brought revolutionary breakthroughs to solving complex image classification problems [12,13,14]. Convolutional neural networks (CNNs), with their unique hierarchical structure, can automatically learn feature representations of images through the local receptive fields and weight-sharing mechanism of convolutional layers. They gradually extract low-level pixel features to high-level semantic features, eliminating the need for manual intervention in the feature extraction process [15,16,17]. The end-to-end learning mode enables them to directly learn classification rules from raw image data. Training on large-scale datasets allows them to grasp complex feature correlations, and they have demonstrated outstanding performance in numerous image classification tasks [18,19,20].

In view of this, and addressing the particularities of woodblock New Year painting style classification, this study proposes an improved model based on ResNeXt (Residual Networks with Next). By optimizing the network structure, the model enhances its ability to learn the subtle style features of woodblock New Year paintings and deeply explores the differential information between different styles. This thereby improves the accuracy of woodblock New Year painting style classification and provides more robust technical support for the digital preservation and inheritance of woodblock New Year paintings.

2. Materials and Methods

2.1. Dataset Collection and Preprocessing

Traditional Chinese woodblock New Year paintings have a long history and a rich variety of categories. In view of this, the image dataset constructed in this study mainly covers six major woodblock New Year painting styles. Based on thematic characteristics, these styles are divided into six categories—Fengqing (FQ), Fuxiang (FX), Jiqing (JQ), Menshen (MS), Xichu (XC), and Zahua (ZH)—as shown in Figure 1. To ensure the universality of the dataset in diverse application scenarios, samples of each style exhibit systematic differences in dimensions such as color saturation, composition perspective, and printing techniques.

The raw data collected in this study consists of 4043 independent images. All images are manually screened to ensure accurate category annotation. The images cover six categories in total, including 684 images of FQ, 659 images of FX, 667 images of JQ, 652 images of MS, 667 images of XC, and 714 images of ZH. This dataset strictly follows the best practices in the field of computer vision and is organized in the standard ImageNet format. To guarantee the reliability and stability of model evaluation, a stratified sampling strategy is adopted to divide the raw data into a training set, a validation set and a test set at a ratio of 4:1:1. At the data processing stage, data augmentation is only performed on the training set to improve the generalization ability of the model. After the augmentation of the training set, the total data scale in a single round of experiments increases from 4043 to approximately 6734 images. Meanwhile, in each round of independent experiments, the validation set and the test set maintain the original division and do not participate in any data augmentation operations. This ensures that the three types of data are completely mutually exclusive and there is no data leakage.

It is worth noting that all woodblock New Year picture images used in the dataset construction and illustration descriptions in this paper are from Collection of Chinese Woodblock New Year Pictures, edited by Feng Jicai and published by Zhonghua Book Company. As a systematically compiled authoritative academic publication, most of the New Year pictures included in this collection are historical public domain works. The citation and processing of these images in this paper are strictly limited to academic research and explanatory purposes to ensure full compliance with academic norms and copyright requirements.

2.2. Proposed Method

2.2.1. ResNeXt Network

ResNeXt [21], as a highly influential convolutional neural network model in the field of deep learning, integrates the residual learning framework of ResNet (Residual Neural Network) [22] with the multi-path feature extraction capability of Inception. The core innovation of ResNeXt lies in the introduction of the concept of “cardinality,” which refers to the number of parallel paths or groups within each module of the network. This adds a new dimension to the scalability of the network, and when combined with the traditional dimensions of network depth and width, it greatly improves the model’s performance. Unlike traditional convolutional neural networks, ResNeXt adopts grouped convolution operations inside its building blocks. This enables ResNeXt modules to have multiple branches due to the grouping method, and all branches share a consistent network structure. Compared with the design of ordinary Inception modules where the topological structures of different branches vary, this design strategy of ResNeXt effectively simplifies the network construction process and significantly reduces the number of hyperparameters. Its simplified Inception structure can be expressed by Equation (1).

F (x) = \sum_{i = 1}^{C} Γ_{i} (x)

(1)

Here, Γ(x) denotes the transformation function that maps x to other spaces, C represents the number of grouped convolutions, and F(x) indicates the summation of the C transformations.

In its design, the ResNeXt module fully integrates the core idea of ResNet, which is specifically reflected in the introduction of a shortcut mechanism inside the module. This shortcut directly connects the input end of the module to the output ends of each branch, allowing the input features to be fused with the output features obtained through the computation of each branch in a summation manner. The specific computation process can be expressed by Equation (2). This feature fusion method effectively alleviates the gradient vanishing problem that may occur in deep networks, helps retain the original information in the input features, and achieves complementary enhancement with the features extracted by each branch. Thus, it improves the module’s ability to express features.

F (x) = x + \sum_{i = 1}^{C} Γ_{i} (x)

(2)

Meanwhile, grouped convolution operations enable the model to learn more complex feature representations without a significant increase in computational complexity. Experimental results show that ResNeXt significantly improves classification accuracy while keeping the model complexity and the number of parameters relatively stable. In the large-scale ImageNet image classification task, compared with the traditional ResNet model, the ResNeXt model achieves a noticeable reduction in both validation error and training error. This advantage is particularly prominent when dealing with large datasets.

The ResNeXt module has three equivalent structural forms, as shown in Figure 2. Among them, each branch of the Grouped Residual Block consists of a 1 × 1 convolution for channel dimensionality reduction, a 3 × 3 convolution for spatial feature extraction, and a 1 × 1 convolution for channel dimensionality increase, arranged in sequence. The output features of each branch are first summed, and then residual connection is performed with the original input features, as shown in Figure 2a. This structure allows different branches to focus on style features of woodblock New Year paintings at different spatial scales. The 3 × 3 convolution is highly effective in extracting spatial features and can capture style information across these different spatial ranges. The summation operation effectively integrates the features of each branch, and when combined with residual connection, it enables the model to better learn the differences between style features.

Figure 2b shows the Aggregated Residual Block. In this structure, each branch consists of a 1 × 1 convolution and a 3 × 3 convolution arranged in sequence. The outputs of the branches undergo feature fusion through concatenation, followed by channel integration via a 1 × 1 convolution, and are finally added to the original input features. The style of woodblock New Year paintings encompasses multi-dimensional features. The concatenation-based fusion method can preserve the independence and integrity of features from each branch, ensuring that style features such as color, texture, and pattern learned by different branches are fully retained. Subsequent channel integration via a 1 × 1 convolution then organically combines these multi-dimensional style features, thereby reflecting the style characteristics of woodblock New Year paintings more comprehensively and facilitating improvements in classification accuracy.

For the Simplified Group Convolution Block, the input features first pass through a 1 × 1 convolution, then go through a 3 × 3 depthwise separable convolution with the number of groups set to 32, and finally pass through another 1 × 1 convolution. They ultimately complete the residual connection with the original input features, as shown in Figure 2c. Depthwise separable convolution can effectively extract features while reducing the amount of computation. For woodblock New Year paintings, extracting their style features requires a certain amount of computation to ensure accuracy, but excessive computation will increase the model’s complexity. On the premise of ensuring that sufficient style features are extracted, this structure reduces computational costs, making the model more efficient when handling the woodblock New Year painting style classification task. Meanwhile, the setting of 32 groups enables the model to more meticulously explore the subtle style differences in woodblock New Year paintings.

2.2.2. SA-CBAM

To enhance the model’s ability to understand and strengthen features of images with distinct structural characteristics like woodblock New Year paintings, this study proposes a Structure-Aware Convolutional Block Attention Module (SA-CBAM) based on the traditional CBAM (Convolutional Block Attention Module) [23]. First, it innovatively reconstructs the spatial attention module of the traditional CBAM by using the gradient edge map extracted by the Sobel operator [24] to replace the max-pooled features and fusing it with the channel average-pooled features to construct the SA-CBAM. This enables the model to explicitly perceive structural information in images such as line directions and contour boundaries. Thus, it improves the focusing accuracy on key visual elements and the ability to decompose complex traditional patterns.

The Sobel operator mainly consists of two directional convolution kernels, which are used to detect edge gradients in the horizontal and vertical directions. The implementation formula is expressed by Equation (3).

K_{x} = [\begin{matrix} 1 & 0 & - 1 \\ 2 & 0 & - 2 \\ 1 & 0 & - 1 \end{matrix}], K_{y} = [\begin{matrix} 1 & 2 & 1 \\ 0 & 0 & 0 \\ - 1 & - 2 & - 1 \end{matrix}]

(3)

Here, K_x denotes the horizontal gradient kernel, which calculates the changes in the left–right direction of the image and detects vertical edges. K_y represents the vertical gradient kernel, which computes the changes in the up–down direction of the image and detects horizontal edges.

Since the Sobel operation along the channel dimension has no practical semantic meaning, this study first performs channel averaging on the feature maps to obtain a grayscale feature map G, which is expressed by Equation (4).

G = \frac{1}{C} \sum_{c = 1}^{C} X_{c}

(4)

Here, X ∈ R^B×C×H×W denotes the feature map, G ∈ R^B×^1×H×W represents the grayscale map, C indicates the number of channels of the input feature map, and X_c stands for the 2D feature map of channel C.

Next, the grayscale feature map is convolved with the two Sobel convolution kernels in two dimensions, resulting in first-order gradients G_x and G_y in the two directions, respectively, expressed by Equation (5).

G_{x} = G * K_{x}, G_{y} = G * K_{y}

(5)

Subsequently, the Euclidean norm (L2 norm) is used to integrate the gradients from the two directions, resulting in an edge map E, which is expressed by Equation (6).

E = \sqrt{G_{x}^{2} + G_{y}^{2} + ε}

(6)

Here, E ∈ R^B×^1×H×W;

ε

is a small positive number used to prevent numerical instability.

Finally, the edge map extracted by Sobel and the averaged map after channel compression are concatenated to serve as the input for spatial attention, which is expressed by Equation (7).

A_{S p a t i a l} = σ (C o n v_{7 \times 7} (C o n c a t ((A v g P o o l (X), E))))

(7)

Here, σ(·) denotes the sigmoid activation function, Conv_7×7 refers to a 7 × 7 convolutional layer with padding, and A_spatial represents the spatial attention map.

The detailed pipeline of the SA-CBAM is shown in Figure 3. Structurally, SA-CBAM also consists of two cascaded sub-modules: the channel attention module is used to model the feature importance across channels, while the spatial attention module in this work is improved into the Edge-Enhanced Spatial Attention Module (EESAM). Based on the traditional pooled feature maps, EESAM introduces the edge map extracted by the Sobel operator as a structural prior to guide the attention mechanism to focus on discriminative regions such as contours and edges, effectively enhancing the model’s response capability to fine-grained structural features. The two attention modules are still combined in a cascaded manner to synergistically enhance feature representation, thereby improving the overall classification performance.

The channel attention module (CAM) primarily enhances feature representation capability by modeling inter-channel correlations, with the specific process shown in Figure 4. Specifically, the module first applies Global Max Pooling and Global Average Pooling operations to the input feature map, thereby obtaining two sets of channel description vectors with a size of 1 × 1 × C. Subsequently, these two sets of vectors are input into a two-layer Multi-Layer Perceptron (MLP) with shared weights. The first layer of this MLP contains C/r neurons (where r denotes the reduction ratio) and uses ReLU as the activation function; the second layer restores to C neurons to match the channel dimension. After obtaining two channel attention maps, the module performs an element-wise summation operation on them and normalizes via the sigmoid function to generate the final channel attention weights. Finally, the attention weights are element-wise-multiplied with the input feature map along the channel dimension to achieve adaptive recalibration of channel features, thus outputting the enhanced feature map.

The Edge-Enhanced Spatial Attention Module (EESAM) takes the feature map output by the channel attention module (CAM) as its input and aims to further explore the spatial distribution information and structural features of the image, with the specific process shown in Figure 5. First, the module performs two operations on the input feature map: one is to use average pooling to obtain an average feature map in the channel dimension, and the other is to extract the edge map of the input feature map using the Sobel operator to enhance the expression capability of structural information. These two generated feature maps are concatenated along the channel dimension to form a new feature map that integrates texture and edge structure information. Subsequently, this feature map is fed into a convolutional layer with a kernel size of 7 × 7. Through batch normalization and the sigmoid activation function, a spatial attention map is obtained. This attention map not only contains texture distribution but also integrates the edge structure information of the image. Finally, the attention map undergoes an element-wise multiplication operation with the input feature map, and through a residual gating mechanism, the output feature map of the SA-CBAM is generated.

2.2.3. 2D-DWT Frequency-Domain Feature Modeling

Naturally collected and digitally preserved woodblock New Year painting images often suffer from paper aging, pigment fading, scanning noise, texture wear and other problems. Meanwhile, New Year paintings of different styles show high similarity in composition layout, line density and decorative patterns. Relying only on spatial-domain features may easily lead to confusion in style discrimination. To improve the model’s ability to represent global structural information and local texture details, this study introduces 2D-DWT (2D discrete wavelet transform) [25] for frequency-domain modeling in the process of deep feature extraction. Image features are decomposed into different frequency subbands to better characterize the global structure and local high-frequency details in a targeted manner.

The 2D-DWT maps the attention-enhanced deep features into different frequency subbands through single-stage frequency decomposition along the horizontal and vertical directions, effectively separating global structural information from local texture details. Given an input two-dimensional feature map X ∈ R^H^×W (where H and W denote the height and width of the feature map, respectively), its single decomposition process yields four subbands, expressed as DWT(X) = {LL, LH, HL, HH}. Specifically, the 2D-DWT performs convolution (*) and downsampling (↓₂) operations on X using a 1D low-pass filter g(·) and a high-pass filter h(·) along the row and column directions. The specific calculations for each subband are expressed in Equations (8)–(11).

L L = ↓_{2} (g * ↓_{2} (g * X))

(8)

L H = ↓_{2} (h * ↓_{2} (g * X))

(9)

H L = ↓_{2} (g * ↓_{2} (h * X))

(10)

H H = ↓_{2} (h * ↓_{2} (h * X))

(11)

Here, the low-frequency approximation subband LL mainly contains the overall structure and contour information of the image. The high-frequency subbands LH, HL, and HH describe detailed textures, edge structures, and line variations in the vertical, horizontal, and diagonal directions, respectively. In this study, the Haar wavelet is adopted as the basis function. Due to its simple filter structure and high computational efficiency, it can effectively suppress noise interference while preserving essential visual standards, style, and structure information.

In order to intuitively illustrate the decomposition principle of style information of woodblock New Year paintings by the two-dimensional discrete wavelet transform, this study conducts visual analysis with original woodblock New Year painting images as examples, as shown in Figure 6. Through one 2D-DWT decomposition, the input woodblock New Year painting image can be divided into one low-frequency subband (LL) and three high-frequency subbands (LH, HL and HH). Here, the low-frequency subband mainly retains the overall brightness distribution and spatial structure information of the image, which can reflect the global characteristics of woodblock New Year paintings such as the composition layout, figure proportion and overall style framework. The high-frequency subbands describe the local variation characteristics of the image in different directions, and mainly retain detailed information closely related to painting styles including line drawing and carving traces, decorative patterns, edge contours and local patterns.

Since a single frequency subband only contains one aspect of image style information, this study combines all frequency subbands in the channel dimension to construct a multi-channel frequency-domain feature representation, so as to enhance expression ability for complex style patterns. It should be noted that, different from the traditional cascaded wavelet transform, the proposed model adopts single-stage 2D-DWT to model low-frequency global structures and high-frequency local details in the frequency domain. The structural scale-aware representation of this architecture does not rely on multi-level spatial pyramid decomposition. In contrast, the generated frequency subbands themselves correspond to different types of structural scale components. Low-frequency subbands capture the coarse-scale global composition framework, while high-frequency subbands acquire fine-scale local depiction details. With the help of this multi-scale frequency-domain representation, the model can not only effectively suppress redundant noise that is not related to style, but also fully strengthen the stable structure and texture features of woodblock New Year paintings. Finally, the reconstructed fused frequency-domain features are combined with high-level spatial-domain features extracted by the backbone network to achieve dual-domain collaboration. This provides more comprehensive and robust feature support for subsequent style statistical modeling and classification decision making.

2.2.4. Gram Matrix-Based Style Statistical Modeling

To further characterize the global co-occurrence relationship and correlation structure of woodblock New Year painting styles in the channel dimension, this study introduces the Gram matrix [26] to conduct style statistical modeling on features, so as to improve the model’s ability to model overall style patterns and classification robustness. The Gram matrix is a classic tool used to characterize the similarity and correlation between a set of vectors, and is widely adopted in style modeling and statistical feature analysis for deep learning. It measures the correlation strength between different feature dimensions through vector inner products, and systematically organizes these correlations in matrix form, thus forming a global statistical description that is relatively independent of spatial location.

In deep neural networks, multi-channel feature maps are utilized to characterize the response patterns of an input image across different semantic dimensions. Given a fused feature map F ∈ R^C×H×W (where C, H, and W denote the number of feature channels, spatial height, and width, respectively), it is first flattened along the spatial dimension to obtain a two-dimensional matrix

\tilde{F} \in R^{C \times N}

, where N = H × W. Here, the vector

{\tilde{F}}_{i} \in R^{N}

represents the vectorized spatial representation of the i-th feature channel. To measure the correlation between different channel features at the overall distribution level, we compute the Gram matrix G ∈ R^C×C. By modeling the inner product relationship between feature channels, the definitions of G and G_ij (which represents the global spatial correlation degree between the i-th and j-th channels) are shown in Equations (12) and (13).

G = \tilde{F} {\tilde{F}}^{T}

(12)

G_{ij} = \sum_{k = 1}^{N} \tilde{F_{i, k}} {\tilde{F}}_{j, k}

(13)

To further clarify the role of the Gram matrix in feature channel correlation modeling, Figure 7 provides a schematic illustration of its calculation process. By performing matrix multiplication between the flattened feature matrix and its transpose matrix, the model can statistically model the global cooperative response relationship of different feature channels without explicitly relying on spatial position information, thereby forming a global description of the overall style distribution characteristics.

Unlike first-order features that only focus on local response intensity, the Gram matrix can effectively characterize the overall consistent pattern layout, line organization, and decorative structure features in woodblock New Year painting styles through second-order statistical relationships between channels. This statistical modeling method weakens the influence of specific spatial positions to a certain extent and focuses more on the global co-occurrence relationship of style patterns. In this stage, the global parameter-free Gram matrix is introduced to capture the statistical representation of overall styles by modeling the correlation between channels. Considering the high-dimensional characteristics of the deep feature space with 2048 channels, this study adopts the representation form of parameter-free statistical operators, rather than introducing higher-order statistics or additional parameterized pooling structures. This design effectively avoids the overfitting risk caused by the sharp increase in parameter quantity, and achieves a balance between representation accuracy and generalization performance under the current dataset scale. It ensures the stability and robustness of the extracted style attributes.

2.2.5. Improved ResNeXt-50 Network

To enhance the representation ability of texture details, structural features, and overall style patterns in the style classification task of woodblock New Year paintings, this paper conducts customized improvements on the ResNeXt-50 network structure. The overall structure is shown in Figure 8.

Considering that woodblock New Year painting has prominent structural and texture features in character modeling, line carving traces, and decorative patterns, this study conducts multi-level improvements on the basis of the ResNeXt-50 (32 × 4 d) backbone network. First, the structure-aware attention module SA-CBAM is introduced into layer3 (Conv3) and layer4 (Conv4) at the middle- and high-level feature stages to jointly model the feature importance of channel and spatial dimensions. Among them, channel attention is used to highlight feature channels that contribute greatly to style discrimination, and edge-enhanced spatial attention strengthens the network’s perception of line edges and local structural regions by introducing Sobel edge response. Through this layer-by-layer feature enhancement, the layer4 stage outputs a spatial-domain high-level feature map with a structural prior (spatial feature, denoted as F_sp), whose tensor dimension is B × 2048 × 7 × 7 (where B is the batch size).To extract independent frequency details, the above feature map F_sp is aligned to B × 2048 × 8 × 8 through zero-padding, and then single-stage two-dimensional discrete wavelet decomposition is introduced. Haar wavelets are used to generate four frequency subbands (LL, LH, HL, HH), each with a dimension of B × 2048 × 4 × 4, so as to realize the effective separation of the global low-frequency framework and local high-frequency details. Subsequently, these four subbands are concatenated in the channel dimension to form a joint tensor of B × 8192 × 4 × 4, and are reduced in dimension and reconstructed into fused frequency-domain features (frequency feature, denoted as F_freq) through 1 × 1 convolution, with a dimension of B × 2048 × 4 × 4. To achieve deep collaboration between spatial and frequency domains and fully retain the spatial structure information enhanced by SA-CBAM, we upsample the fused frequency-domain features F_freq to a spatial resolution of 7 × 7 through bilinear interpolation, and perform element-wise addition with the spatial-domain high-level feature map F_sp. The spatial–frequency-domain collaboration process is expressed as Equation (14).

F_{j o i n t} = F_{s p} + U p s a m p l e (F_{f r e q})

(14)

The resulting spatial- and frequency-domain joint feature (joint feature, denoted as F_joint) possesses both fine spatial structures and rich frequency-domain details, and its dimension remains B × 2048 × 7 × 7. Finally, the global correlation modeling of fused features is completed through the style statistical method based on the Gram matrix. F_joint is flattened into a two-dimensional feature matrix in the spatial dimension and recorded as M. The dimension of matrix M is B × 2048 × 49, and 49 = 7 × 7 represents the total number of pixels in the spatial dimension of the feature map. The Gram matrix G with a dimension of B × 2048 × 2048 is obtained by directly calculating the inner product of matrix M and its own transpose. This method strengthens the second-order collaborative description ability of the overall style pattern at the statistical level. Global Average Pooling is performed on the Gram matrix G to compress it into a feature matrix with a dimension of B × 2048, and the processed feature matrix is fed into the fully connected layer. The Softmax function is adopted for activation to output the final classification probability.

In summary, on the basis of maintaining the efficient feature extraction capability of the ResNeXt backbone network, the improved ResNeXt-50 model proposed in this paper integrates a structure-aware attention mechanism, frequency-domain structural and detail analysis, and style statistical modeling. It effectively enhances the comprehensive modeling ability of the network for the style features of woodblock New Year paintings, so as to provide more discriminative and stable feature representation for the final style classification.

3. Experiment and Result Analysis

3.1. Experimental Environment and Parameter Settings

The experimental platform for this study utilized a 64-bit Windows 11 operating system. The hardware configuration included an Intel(R) Core(TM) i5-11260H 2.6 GHz CPU, an NVIDIA GeForce RTX 3050 12 GB GPU, and 16 GB RAM. The software environment was configured with CUDA version 11.8, Python version 3.8, and PyTorch version 2.0.0.

To ensure the fairness of the experiments and the reproducibility of the results, consistent foundational settings were applied across all experiments in this study. All input images are uniformly scaled and center-cropped to 224 × 224 pixels through preprocessing before being fed into the network for training and inference. To guarantee the objectivity of the evaluation, standardized data augmentation strategies were applied exclusively to the training set to expand the diversity of the data distribution and enhance the models’ robustness to complex input features. In terms of hardware and training specifications, due to the 12 GB VRAM limitation of the experimental GPU and to prevent out-of-memory (OOM) errors, the batch size for all models was strictly unified to 8. Additionally, because the scale of the woodblock New Year painting dataset is relatively limited and insufficient to drive the convergence of large models trained from scratch, all models were uniformly initialized using ImageNet pre-trained weights. The total number of training epochs was set to 150 for all models, and a cosine annealing scheduler was uniformly adopted for smooth learning rate decay.

Regarding specific optimization strategies, appropriate distinctions were made based on the experimental objectives. The ablation study strictly adhered to the single-variable control principle; all ablation models adopted the ResNeXt-50 architecture and uniformly utilized the SGD optimizer with an initial learning rate of 0.002 and a weight decay of 1 × 10⁻⁴. In the comparison experiments, considering that different architectures exhibit varying sensitivities to optimization strategies, enforcing identical optimizers and learning rates may hinder certain models from reaching their optimal performance. Therefore, model-specific hyperparameter configurations were adopted, tailored to each model to ensure stable convergence and reliable performance, while keeping the overall training protocol consistent. The detailed hyperparameter settings for each model are summarized in Table 1.

3.2. Evaluation Metrics

To comprehensively evaluate the performance of the proposed woodblock New Year painting style classification model, this study employs multiple evaluation metrics, including accuracy (Acc), precision (Pre), recall (Rec), and F1-score. Accuracy is used to measure the proportion of the model’s overall correct predictions, reflecting the model’s general classification capability across all samples. Precision represents the proportion of samples predicted as a specific category that actually belong to that category, and it is used to assess the accuracy of classification results. Recall is employed to measure the model’s ability to identify all samples that are actually of the positive category, reflecting the model’s detection rate. The F1-score denotes the harmonic mean of precision and recall, which is used to evaluate the balance of the classification model’s prediction results. The specific definitions of each metric are shown in Equations (15)–(18):

A c c = \frac{T P + T N}{T P + T N + F P + F N}

(15)

P r e = \frac{T P}{T P + F P}

(16)

R e c = \frac{T P}{T P + F N}

(17)

F 1 = \frac{2 P r e \cdot R e c}{P r e + R e c}

(18)

Here, TP denotes the number of samples correctly predicted as positive class; FP represents the number of negative-class samples incorrectly classified as positive class; FN stands for the number of positive-class samples incorrectly classified as negative class; and TN indicates the number of samples correctly predicted as negative class.

3.3. Experimental Results

The proposed method achieves an overall classification accuracy of 97.68% on the six-category woodblock New Year painting test set, which shows that the method has high discriminative ability and good generalization performance in the fine-grained woodblock New Year painting style recognition task. The performance indexes of the six categories of woodblock New Year paintings are shown in Figure 9. There are certain differences in various indexes for different style categories, which reflects the intrinsic style differences in woodblock New Year paintings in composition layout, line drawing and carving density, and decorative pattern complexity. Among them, the Fu Xiang category (FX) and Men Shen category (MS) show outstanding overall performance, with both accuracy and F1-score at a high level. This benefits from the significant characteristics of the above categories in the standardization of figure modeling, symbolic patterns, and decorative pattern structure. In contrast, the Feng Qing category (FQ) and Za Hua category (ZH) show relatively low classification accuracy, mainly because these categories cover a wide range of themes and have diverse expression forms. They lack unified paradigms in composition layout, figure posture, and decorative element combination, resulting in large intra-class differences.

As shown in Figure 10, the confusion matrix further reveals the specific recognition performance and main error sources of the six categories of woodblock New Year painting styles in the classification process. Overall, the confusion matrix shows obvious diagonal aggregation characteristics, indicating that the proposed method can accurately distinguish most samples and the model has good style discrimination ability as a whole. This demonstrates that the method significantly enhances the model’s ability to distinguish style differences in woodblock New Year paintings through the collaborative modeling of spatial structure, frequency-domain details and global correlation between feature channels.

Although the overall classification performance is relatively stable, a certain degree of confusion still exists among some style categories, mainly between categories with relatively blurred style boundaries or diverse expression forms. For example, a small number of samples in the Feng Qing category (FQ) are misclassified into the Ji Qing category (JQ) or Za Hua category (ZH). This is mainly because the Feng Qing category covers a wide range of themes and lacks unified paradigms in composition layout, figure quantity and decorative element organization, resulting in large intra-class differences. At the same time, its expression forms in festival scenes and folk narratives are visually similar to those of the Ji Qing category, which increases the difficulty of model discrimination.

In addition, a small number of misclassified samples also appear between the Xi Chu category (XC) and the Za Hua category (ZH). This phenomenon is mainly related to the similarity of the two categories in dynamic figure posture, complex scene composition and line drawing density. Especially in samples with exaggerated figure movements and rich background elements, style boundaries are prone to overlap. In contrast, the Fu Xiang category (FX) and Men Shen category (MS) show a more concentrated diagonal distribution in the confusion matrix with fewer misclassified samples. This benefits from their relatively stable and prominent style characteristics in figure modeling standardization, symmetric structure and symbolic patterns, enabling the model to form a clear discrimination basis.

Overall, the confusion matrix analysis results show that the proposed method achieves good recognition performance on most woodblock New Year painting style categories. The small number of misclassifications mainly comes from the natural similarity of different styles in theme expression and visual characteristics, which provides a reference direction for the subsequent introduction of finer style constraints or inter-category discrimination mechanisms.

3.4. Ablation Experiment

To verify the effectiveness of the proposed SA-CBAM structure-aware attention mechanism, 2D-DWT frequency-domain modeling, and Gram matrix style statistical branch in the six-category classification task of woodblock New Year paintings, systematic ablation experiments are conducted on the test set. To reduce the accidental influence caused by a single random division of the dataset, this study sets different random seeds to independently repartition the dataset and train the model in each round. The experiment is repeated five times in total. The average value of the evaluation results on the test set in each round is taken as the final performance index. The performance indicators and computational efficiency are shown in Table 2 and Table 3.

First, when ResNeXt-50 is adopted as the baseline model, the overall accuracy is 88.72 ± 0.55%. This indicates that relying only on the deep semantic features in the spatial domain of the backbone network is insufficient to fully characterize the differences in highly similar composition paradigms, line carving traces, and decorative patterns in woodblock New Year painting styles. On this basis, the accuracy is significantly improved to 96.74 ± 0.60% after the introduction of SA-CBAM. To further verify its structure perception ability, SA-CBAM is compared with the traditional CBAM module, which achieves an accuracy of 92.76 ± 0.75%. The results show that SA-CBAM can more effectively highlight feature channels that contribute greatly to style discrimination and enhance the network’s response ability to line edges, carving contours, and local structural regions, so as to improve the discrimination stability of fine-grained style features. From the perspective of computational efficiency, SA-CBAM maintains a parameter count of 25.4 M and FLOPs of 4.27 G, which are equivalent to those of the baseline, and achieves a processing latency of 12.05 ms. This proves that edge-enhanced spatial attention does not introduce a significant computational burden. In contrast, when only 2D-DWT is introduced, the accuracy increases to 90.97 ± 0.64%, which gains a certain improvement compared with the baseline but shows a limited overall performance gain. For comparison, alternative schemes for frequency-domain and texture modeling, including 2D-DCT (85.76 ± 0.82%) [27] and Differentiable LBP (88.32 ± 0.68%) [28], are also evaluated. The performance advantages of 2D-DWT indicate that although simple frequency-domain decomposition can suppress noise and redundant information and highlight high-frequency details to a certain extent, its contribution to final discrimination is still relatively limited without visual criteria and saliency modeling for key structural regions. In terms of computation, 2D-DWT requires a parameter count of 25.35 M and operates efficiently at a frame rate of 83.68 FPS. Similarly, the accuracy is 89.69 ± 0.57% when only the Gram matrix is used. To highlight the advantages of global correlation statistics, this method is compared with standard feature fusion techniques, especially Concat (86.39 ± 0.78%) and GAP (82.80 ± 0.64%). Relying only on correlation statistics between channels can supplement information on overall texture distribution and decoration rules. However, it still lacks the ability to characterize complex style boundaries in the absence of spatial structure guidance and frequency-domain separation of global structures and local details. The introduction of the Gram matrix slightly increases the FLOPs to 4.46G but still maintains a highly practical inference latency of 11.94 ms.

Further results of combination experiments show that each module exhibits clear complementary characteristics. When SA-CBAM is combined with single-stage 2D-DWT, the accuracy reaches 96.12 ± 0.45% with a parameter count of 25.4 M and a frame rate of 81.63 FPS. This indicates that, under the guidance of SA-CBAM, frequency-domain separation of low-frequency global structures and high-frequency local details can more effectively extract discriminative style features related to line drawing, carving traces, and pattern density, thereby improving the robustness of feature representation. When SA-CBAM is combined with the Gram matrix, the accuracy is 94.12 ± 0.48% with 4.47G FLOPs and a frame rate of 81.70 FPS. This demonstrates that features enhanced by the attention mechanism possess stronger discrimination in global correlation statistics. Nevertheless, due to the lack of further distinction between details and noise through frequency-domain decomposition, its overall performance is still lower than the method combined with frequency-domain modeling. The combination of 2D-DWT and the Gram matrix achieves an accuracy of 93.10 ± 0.52%. This shows that frequency-domain details and style statistical correlation can indeed provide complementary information, but the ability to focus on key local style regions is still insufficient without structural saliency modeling.

When SA-CBAM, 2D-DWT, and the Gram matrix are introduced simultaneously, the model achieves the best performance, and its overall classification accuracy reaches 97.68 ± 0.44%. The experimental results reveal that the above three modules form effective synergies in structural information enhancement, high-frequency detail representation, and channel correlation modeling. In terms of computational efficiency, although the fusion model proposed in this study integrates multiple functional modules, its parameter count of 25.4 M and computational cost of 4.48 G FLOPs only increase slightly compared with the baseline model. Its inference latency is 12.44 ms, and the frame rate is stably maintained at 80.38 FPS. It fully balances the requirements of high-precision classification and real-time performance in practical application scenarios.

3.5. Comparison with Other Classification Models

To further evaluate the performance of the proposed model in the six-category style classification task of woodblock New Year paintings, comparative experiments are carried out with a variety of mainstream classification models of different types. Consistent with the rigorous evaluation protocol established in the ablation study, all comparative models are subjected to five independent experimental runs using distinct random data partitions, with the final performance metrics derived from the averaged evaluation results. The evaluated models include traditional convolutional neural networks such as VGG19 [29] and MobileNetV2 [30], lightweight networks such as GhostNetV2 [31] and EfficientNetV2-B0 [32], and mainstream advanced architectures for complex feature modeling including ViT-Lite [33], ConvNeXt-T [34], and MobileViT-S [35]. The comparative results of performance indicators and computational efficiency are presented in Table 4 and Table 5, respectively.

In the comparative experiments, different mainstream networks show obvious performance differences in the style classification task of woodblock New Year paintings. Traditional CNN models such as VGG19 and MobileNetV2 exhibit relatively limited overall performance, with classification accuracies of 85.19 ± 0.52% and 82.91 ± 0.84%, respectively. In terms of network structure design, these models focus more on the layer-by-layer superposition of local receptive fields and lack sufficient modeling capability for complex line carving traces, decorative patterns, and cross-regional style correlations contained in woodblock New Year paintings. In addition, from the perspective of efficiency, VGG19 has a huge parameter count of 143.6 M and operates slowly at 31.1 FPS. Although MobileNetV2 runs extremely fast at 160.08 FPS, it sacrifices significant feature representation capability. Lightweight networks, including GhostNetV2 (87.28 ± 0.75%) and EfficientNetV2-B0 (84.78 ± 0.63%), achieve high inference speed through lightweight operator design and feature redundancy mining, reaching 155.72 FPS and 88.67 FPS, respectively. However, when dealing with fine-grained style features with high similarity, it is still difficult to fully represent multi-scale structural and texture information in painting styles by relying only on efficient convolution structures, which results in relatively restricted classification accuracy.

This study further analyzes high-performance mainstream models oriented to fine-grained tasks. ConvNeXt-T fully adopts the modern design paradigm of Vision Transformers. Nevertheless, its accuracy of 86.54 ± 0.57% indicates that pure convolutional architectures still have limitations in extracting highly similar global style paradigms of woodblock New Year paintings. The heterogeneous architecture MobileViT-S is designed to deeply integrate the local perception capability of convolution operators and the global modeling advantages of Transformers, achieving efficient inference at 97.56 FPS. Its feature representation precision is still limited when capturing extremely fine-grained line carving traces and local decorative differences, with an accuracy of 87.32 ± 0.92%. In contrast, ViT-Lite, with its global modeling capabilities, significantly increases the accuracy to 90.12 ± 1.18%, which effectively verifies the crucial role of long-range dependency capture in style discrimination. Meanwhile, its significant standard deviation of ±1.18% reflects that without local structure guidance, this architecture lacks sufficient robustness when coping with dense and complex textures of New Year paintings. Furthermore, the computational load of 4.6 G FLOPs and the inference frame rate of 68.97 FPS also indicate that the acquisition of such global context is accompanied by a relatively high computational cost.

As shown in Figure 11 and the related tables, among all the comparison models, the method proposed in this study achieves the best performance across all evaluation metrics. The overall classification accuracy reaches 97.68 ± 0.44%, while the precision, recall, and F1-score reach 93.06 ± 0.48%, 93.07 ± 0.49%, and 93.06 ± 0.44%, respectively. Notably, the error bars in Figure 11 indicate that the proposed model maintains excellent stability, effectively overcoming the significant performance fluctuations observed in global architectures like ViT-Lite when processing complex fine-grained textures. From the perspective of computational efficiency, the proposed model achieves a practical inference latency of 12.44 ms and a stable processing speed of 80.38 FPS while maintaining a reasonable number of parameters (25.35 M) and computational cost (4.48 G FLOPs).

Compared with the aforementioned mainstream architectures, this method demonstrates significant advantages in feature discriminability, robustness, and efficiency balance. The results indicate that by introducing the SA-CBAM structure-aware attention mechanism, the network effectively compensates for the lack of local structure guidance in pure Transformer models and significantly enhances the perception of key local details such as dense carving traces. Simultaneously, combining single-stage 2D-DWT for frequency-domain feature separation and utilizing the Gram matrix to characterize global channel correlations systematically overcomes the limitations of traditional CNNs in modeling highly similar global style paradigms. This multi-dimensional feature synergy mechanism effectively balances the modeling of global style patterns and the representation of local fine textures, thereby achieving superior overall performance in the fine-grained style classification task of woodblock New Year paintings.

4. Discussion

From the comprehensive analysis of the confusion matrix and various performance metrics, it can be found that the improved ResNeXt-50 model proposed in this study exhibits strong discrimination stability and structural sensitivity in the woodblock New Year painting style classification task. The model can effectively focus on regions with significant style discrimination significance during the prediction process, such as facial contours, line drawing and carving traces, costume patterns and local decorative structures, while responding weakly to background blank areas, border elements and noise regions. This phenomenon shows that SA-CBAM plays a positive role in guiding the network to focus on key spatial structural features. Meanwhile, the introduction of 2D-DWT for frequency-domain feature modeling enables the model to separate low-frequency global composition information from high-frequency local details, effectively suppressing redundant noise and enhancing the stable representation of texture and line drawing features. In addition, the style statistical modeling based on the Gram matrix further characterizes the feature distribution pattern from the perspective of channel correlation, allowing the network to obtain a more global and consistent discrimination basis at the overall style modeling level. The synergistic effect of multiple feature modeling strategies enables the model to achieve consistent and reliable discrimination performance in the six-category woodblock New Year painting style classification task.

Although the experimental results verify the effectiveness and robustness of the proposed method, certain limitations still exist. First, the dataset adopted in the experiment is limited in scale with a relatively single source, and it lacks cross-dataset validation. Due to the scarcity of high-quality, fine-grained public woodblock New Year picture databases at home and abroad, as well as the high cost of cross-institutional data collection and expert annotation, external test sets have not been introduced for cross-source validation at present. Second, the introduced SA-CBAM mainly strengthens structural and contour features through edge-enhanced spatial attention and channel attention. It has strong representation ability for character modeling, line engraving marks and composition boundaries, but its modeling ability for style differences dominated by color gradients, pigment layers or overall tone changes is still limited. In addition, although 2D-DWT has obvious advantages in structural texture modeling, its expression ability is still limited when modeling style differences dominated by color changes such as color gradients and rendering levels. The Gram matrix focuses on the modeling of overall statistical correlation. When capturing extremely subtle local semantic differences, it may still form a complementary relationship with spatial structural features, and the fusion strategy needs to be further optimized.

In light of these limitations, future research can be advanced in several key directions. First, efforts should focus on expanding the scale and diversity of the dataset by incorporating independent catalogs and previously unsorted folk collections to establish a cross-source validation set, thereby providing a more robust evaluation of model generalization. Second, building upon the current structural and frequency-domain frameworks, more refined mechanisms for color perception or cross-channel relationship modeling could be integrated to enhance the representation of multi-dimensional style features. Furthermore, on the premise of maintaining high classification performance, lightweight optimization of the network structure should be conducted utilizing technologies such as model pruning, knowledge distillation, and inference acceleration. Given the model’s current stable processing speed of 80.38 FPS, these optimizations will further enhance its application potential and deployment feasibility in practical digital protection and intelligent analysis scenarios.

5. Conclusions

This study proposes a robust and high-precision deep learning method for woodblock New Year painting style classification based on the improved ResNeXt-50 network. The proposed method systematically integrates the SA-CBAM at the middle- and high-level feature stages. By adopting the Sobel operator to coordinate channel attention and edge-enhanced spatial attention, the model effectively strengthens the perception and modeling capability for key structural features such as character contours, line drawings, carving traces and decorative patterns. Meanwhile, single-stage 2D-DWT is introduced to separate global low-frequency structural information from local high-frequency detail features, thereby effectively suppressing redundant noise while preserving discriminative style characteristics. In addition, this study introduces a global parameter-free Gram matrix to conduct style statistical modeling on spatial–frequency fused features. It characterizes the overall style distribution from the perspective of feature channel correlation and significantly enhances the model’s capability to represent complex global style patterns.

The proposed model achieves outstanding performance on the six-category woodblock New Year painting dataset. The overall classification accuracy reaches 97.68%, and the precision, recall and F1-score reach 93.06%, 93.07% and 93.06%, respectively. Comprehensive comparative experiments show that the proposed method significantly outperforms a variety of mainstream network architectures including traditional CNNs such as VGG19 and MobileNetV2, lightweight networks such as GhostNetV2 and EfficientNetV2-B0, and advanced vision models such as ViT-Lite, ConvNeXt-T and MobileViT-S. It exhibits stronger stability and discriminative capability, especially when dealing with categories with subtle style differences and high structural similarity. Ablation experiments further verify the complementary effects of SA-CBAM, 2D-DWT and Gram matrix style statistical modeling at different feature levels. The results prove that the combination of the above modules can greatly reduce the misclassification rate among easily confused categories.

Importantly, the network maintains reasonable complexity while achieving excellent classification performance. The model parameter quantity is 25.35 M, with a stable inference speed of 80.38 FPS and a single-image latency of only 12.44 milliseconds. In summary, the method proposed in this paper effectively improves the representation capability of woodblock New Year painting style features while maintaining reasonable network complexity. It provides a feasible, robust and efficient technical solution for the digital classification and intelligent analysis of intangible cultural heritage images. The method has important application and promotion value, and lays a solid foundation for cross-dataset verification and model lightweight deployment in future research.

Author Contributions

Conceptualization, H.W., J.D., L.W., B.S., X.C. and L.Y.; Methodology, H.W., Z.D., J.D. and L.Y.; Software, Z.D. and J.D.; Validation, X.C.; Formal analysis, B.S.; Resources, Z.D.; Data curation, J.D.; Writing–original draft, H.W.; Supervision, H.W., Z.D. and L.W.; All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Henan Provincial Department of Education (Nos. 2022BYS045, 2021SJGLX112Y and 2025-JCZD-16u.cn).And The APC was funded by (Nos. 2022BYS045).

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, S.-F.; Chen, C.-C. Revitalizing Intangible cultural heritage via derivative design: A focus on chinese woodblock printing. PLoS ONE 2025, 20, e0318807. [Google Scholar] [CrossRef] [PubMed]
Lei, H. The digital protection and inheritance of Chinese woodblock new year prints. In Proceedings of the 2014 IEEE 3rd International Conference on Cloud Computing and Intelligence Systems, Shenzhen, China, 27–29 November 2014; pp. 277–281. [Google Scholar] [CrossRef]
Lu, C. Study on Yangjiabu Village Woodblock Spring Festival Paintings-Culture Tourism Under Vision of Intangible Cultural Heritage. In Proceedings of the 2017 International Conference on Education, Economics and Management Research (ICEEMR 2017), Singapore, 29–31 May 2017; pp. 315–317. [Google Scholar] [CrossRef]
Obeso, A.M.; Benois-Pineau, J.; Vázquez, M.G.; Acosta, A.R. Saliency-based selection of visual content for deep convolutional neural networks: Application to architectural style classification. Multimed. Tools Appl. 2019, 78, 9553–9576. [Google Scholar] [CrossRef]
Sun, M.; Zhang, F.; Duarte, F.; Ratti, C. Understanding architecture age and style through deep learning. Cities 2022, 128, 103787. [Google Scholar] [CrossRef]
Chen, S.; Bao, L. Research on classification of woodblock New Year pictures based on ResNet improved model. In Proceedings of the 2024 5th International Conference on Computer Vision, Image and Deep Learning (CVIDL), Zhuhai, China, 19–21 April 2024; pp. 603–607. [Google Scholar] [CrossRef]
Nazir, A.; Ashraf, R.; Hamdani, T.; Ali, N. Content based image retrieval system by using HSV color histogram, discrete wavelet transform and edge histogram descriptor. In Proceedings of the 2018 International Conference on Computing, Mathematics and Engineering Technologies (iCoMET), Sukkur, Pakistan, 3–4 March 2018; pp. 1–6. [Google Scholar] [CrossRef]
Jordan, M.I.; Mitchell, T.M. Machine learning: Trends, perspectives, and prospects. Science 2015, 349, 255–260. [Google Scholar] [CrossRef] [PubMed]
Chandra, M.A.; Bedi, S.S. Survey on SVM and their application in image classification. Int. J. Inf. Technol. 2021, 13, 1–11. [Google Scholar] [CrossRef]
Song, Y.Y.; Lu, Y. Decision tree methods: Applications for classification and prediction. Shanghai Arch. Psychiatry 2015, 27, 130. [Google Scholar] [CrossRef] [PubMed]
Fang, Y.; Jiang, Z.; Zhang, R.; Gu, L. Construction and Design Application of Taohuawu New Year Woodblock Prints Color Database. In Cross-Cultural Design, Proceedings of the International Conference on Human-Computer Interaction, Copenhagen, Denmark, 23–28 July 2023; Springer Nature Switzerland: Cham, Switzerland, 2023; pp. 38–51. [Google Scholar] [CrossRef]
Affonso, C.; Rossi, A.L.D.; Vieira, F.H.A.; de Leon Ferreira, A.C.P. Deep learning for biological image classification. Expert Syst. Appl. 2017, 85, 114–122. [Google Scholar] [CrossRef]
Obaid, K.B.; Zeebaree, S.; Ahmed, O.M. Deep learning models based on image classification: A review. Int. J. Sci. Bus. 2020, 4, 75–81. [Google Scholar] [CrossRef]
Wang, Y.; Lin, X. A study of the classification, protection, and inheritance of woodblock New Year painting using a deep learning image recognition model. In Proceedings of the Fourth International Conference on Electronics Technology and Artificial Intelligence (ETAI 2025), Harbin, China, 21–23 February 2025; pp. 902–909. [Google Scholar] [CrossRef]
Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J. Big Data 2021, 8, 53. [Google Scholar] [CrossRef] [PubMed]
Chauhan, R.; Ghanshala, K.K.; Joshi, R.C. Convolutional neural network (CNN) for image detection and recognition. In Proceedings of the 1st International Conference on Secure Cyber Computing and Communications (ICSCCC), Jalandhar, India, 15–17 December 2018; pp. 278–282. [Google Scholar] [CrossRef]
Hossain, M.A.; Sajib, M.S.A. Classification of image using convolutional neural network (CNN). Glob. J. Comput. Sci. Technol. 2019, 19, 13–14. [Google Scholar] [CrossRef]
Wei, X.; Li, W.; Zhang, M.; Li, Q. Medical hyperspectral image classification based on end-to-end fusion deep neural network. IEEE Trans. Instrum. Meas. 2019, 68, 4481–4492. [Google Scholar] [CrossRef]
Schwartz, E.; Giryes, R.; Bronstein, A.M. DeepISP: Toward learning an end-to-end image processing pipeline. IEEE Trans. Image Process. 2018, 28, 912–923. [Google Scholar] [CrossRef] [PubMed]
Gordo, A.; Almazan, J.; Revaud, J.; Larlus, D. End-to-end learning of deep visual representations for image retrieval. Int. J. Comput. Vis. 2017, 124, 237–254. [Google Scholar] [CrossRef]
Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar] [CrossRef]
Han, L.; Tian, Y.; Qi, Q. Research on edge detection algorithm based on improved sobel operator. In MATEC Web of Conferences; EDP Sciences: Les Ulis, France, 2020; Volume 309, p. 03031. [Google Scholar] [CrossRef]
Ayyavoo, T.; John Suseela, J. Illumination pre-processing method for face recognition using 2D DWT and CLAHE. IET Biom. 2018, 7, 380–390. [Google Scholar] [CrossRef]
Sastry, C.S.; Oore, S. Detecting out-of-distribution examples with gram matrices. In Proceedings of the International Conference on Machine Learning (ICML), Vienna, Austria, 12–18 July 2020; pp. 8491–8501. Available online: https://proceedings.mlr.press/v119/sastry20a.html (accessed on 10 May 2026).
Qin, Z.; Zhang, P.; Wu, F.; Li, X. FcaNet: Frequency channel attention networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 783–792. [Google Scholar] [CrossRef]
Juefei-Xu, F.; Boddeti, V.N.; Savvides, M. Local binary convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 19–28. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar] [CrossRef]
Tang, Y.; Han, K.; Guo, J.; Xu, C.; Xu, C.; Wang, Y. GhostNetV2: Enhance cheap operation with long-range attention. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 28 November–9 December 2022; pp. 9969–9982. [Google Scholar]
Tan, M.; Le, Q. EfficientNetV2: Smaller models and faster training. In Proceedings of the International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021; pp. 10096–10106. Available online: https://proceedings.mlr.press/v139/tan21a.html (accessed on 10 May 2026).
Hassani, A.; Walton, S.; Shah, N.; Abuduweili, A.; Li, J.; Shi, H. Escaping the big data paradigm with compact transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Virtual, 19–25 June 2021. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 11976–11986. [Google Scholar] [CrossRef]
Mehta, S.; Rastegari, M. MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 25–29 April 2022. [Google Scholar]

Figure 1. Six examples of woodblock New Year picture styles.

Figure 2. ResNeXt-50 module.

Figure 3. SA-CBAM network structure.

Figure 4. CAM network structure.

Figure 5. EESAM network structure.

Figure 6. Visual examples of 2D-DWT decomposition for six categories of woodblock New Year paintings.

Figure 7. Illustration of Gram matrix-based style modeling.

Figure 8. Woodblock New Year painting style classification network.

Figure 9. Performance indicators of six types of woodblock New Year pictures.

Figure 10. Confusion matrix of six types of woodblock New Year pictures.

Figure 11. Performance comparison of different algorithms (numbers ① to ⑧ represent GhostNetV2, VGG19, ViT-Lite, MobilenetV2, EfficientNetV2-B0, ConvNeXt-T, MobileViT-S, and the proposed model, respectively).

Table 1. Detailed optimization hyperparameter settings for different evaluated models.

Model	Optimizer	Initial Learning Rate	Weight Decay
GhostNetV2	SGD	0.002	1 × 10⁻⁴
VGG19	SGD	0.002	1 × 10⁻⁴
ViT-Lite	AdamW	1 × 10⁻⁴	0.05
MobilenetV2	SGD	0.002	1 × 10⁻⁴
EfficientNetV2-B0	SGD	0.002	1 × 10⁻⁴
ConvNeXt-T	AdamW	1 × 10⁻⁴	0.05
MobileViT-S	AdamW	1 × 10⁻⁴	0.05
Ours	SGD	0.002	1 × 10⁻⁴

Table 2. Results of the ablation experiment.

Model	Accuracy/%	Precision/%	Recall/%	F1/%
ResNeXt-50	88.72 ± 0.55	85.93 ± 0.71	83.76 ± 0.76	84.83 ± 0.75
+SA-CBAM	96.74 ± 0.6	90.22 ± 0.52	90.21 ± 0.58	90.21 ± 0.53
+2D-DWT	90.97 ± 0.64	87.85 ± 0.62	86.79 ± 0.68	87.32 ± 0.6
+Gram Matrix	89.69 ± 0.57	86.98 ± 0.62	85.34 ± 0.61	86.15 ± 0.56
+CBAM	92.76 ± 0.75	88.95 ± 0.73	87.33 ± 0.77	88.13 ± 0.83
+2D-DCT	85.76 ± 0.82	82.84 ± 0.79	84.19 ± 0.83	83.51 ± 0.89
+Differentiable LBP	88.32 ± 0.68	85.74 ± 0.63	86.58 ± 0.66	86.16 ± 0.72
+Concat	86.39 ± 0.78	83.17 ± 0.73	81.54 ± 0.75	82.35 ± 0.82
+GAP	82.8 ± 0.64	77.6 ± 0.87	78.9 ± 0.85	78.2 ± 0.84
+SA-CBAM + 2D-DWT	96.12 ± 0.45	90.79 ± 0.51	90.48 ± 0.49	90.63 ± 0.48
+SA-CBAM + Gram Matrix	94.12 ± 0.48	89.74 ± 0.55	89.91 ± 0.53	89.82 ± 0.52
+2D-DWT + Gram Matrix	93.10 ± 0.52	88.65 ± 0.57	86.54 ± 0.59	87.58 ± 0.54
+SA-CBAM + 2D-DWT + Gram Matrix	97.68 ± 0.44	93.06 ± 0.48	93.07 ± 0.49	93.06 ± 0.44

Table 3. Computational efficiency of the ablation experiment.

Model	Params (M)	FLOPs (G)	GPU Mem (MB)	Latency (ms)	FPS (img/s)
ResNeXt-50	25.3	4.24	1285	11.75	85.11
+SA-CBAM	25.4	4.27	1291	12.05	82.99
+2D-DWT	25.35	4.27	1290	11.95	83.68
+Gram Matrix	25.35	4.46	1288	11.94	83.75
+CBAM	25.34	4.26	1289	11.98	83.47
+2D-DCT	25.3	4.26	1293	12.15	82.3
+Differentiable LBP	25.3	4.28	1305	12.35	80.97
+Concat	25.3	4.24	1286	11.8	84.75
+GAP	25.3	4.24	1285	11.76	85.03
+SA-CBAM + 2D-DWT	25.4	4.28	1296	12.25	81.63
+SA-CBAM + Gram Matrix	25.4	4.47	1294	12.24	81.7
+2D-DWT + Gram Matrix	25.35	4.47	1293	12.14	82.37
+SA-CBAM + 2D-DWT + Gram Matrix	25.4	4.48	1299	12.44	80.38

Table 4. Results of the comparative experiment.

Model	Accuracy/%	Precision/%	Recall/%	F1/%
GhostNetV2	87.28 ± 0.75	85.49 ± 0.93	84.67 ± 0.9	85.08 ± 0.82
VGG19	85.19 ± 0.52	83.36 ± 0.55	82.48 ± 0.59	82.91 ± 0.58
ViT-Lite	90.12 ± 1.18	87.41 ± 1.2	86.03 ± 1.41	86.71 ± 1.25
MobilenetV2	82.91 ± 0.84	80.22 ± 0.85	80.73 ± 0.74	80.47 ± 0.91
EfficientNetV2-B0	84.78 ± 0.63	82.54 ± 0.64	83.12 ± 0.67	82.83 ± 0.67
ConvNeXt-T	86.54 ± 0.57	83.47 ± 0.56	84.72 ± 0.82	84.09 ± 0.72
MobileViT-S	87.32 ± 0.92	84.21 ± 1	85.64 ± 1.09	84.92 ± 1.08
Ours	97.68 ± 0.44	93.06 ± 0.48	93.07 ± 0.49	93.06 ± 0.44

Table 5. Computational efficiency of the comparative experiment.

Model	Params (M)	FLOPs (G)	GPU Mem (MB)	Latency (ms)	FPS (img/s)
GhostNetV2	5.2	0.17	650	6.42	155.72
VGG19	143.6	19.6	3850	32.15	31.1
ViT-Lite	22	4.6	1420	14.5	68.97
MobilenetV2	3.5	0.3	610	6.25	160.08
EfficientNetV2-B0	7.1	0.72	800	11.28	88.67
ConvNeXt-T	28.6	4.5	1380	13.8	72.46
MobileViT-S	5.6	2	1150	10.25	97.56
Ours	25.35	4.48	1299	12.44	80.38

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wei, H.; Diao, Z.; Diao, J.; Wen, L.; Sun, B.; Chen, X.; Yin, L. A Woodblock New Year Painting Style Classification Method Based on Structural-Aware Attention and Frequency-Domain Style Statistics. Electronics 2026, 15, 2158. https://doi.org/10.3390/electronics15102158

AMA Style

Wei H, Diao Z, Diao J, Wen L, Sun B, Chen X, Yin L. A Woodblock New Year Painting Style Classification Method Based on Structural-Aware Attention and Frequency-Domain Style Statistics. Electronics. 2026; 15(10):2158. https://doi.org/10.3390/electronics15102158

Chicago/Turabian Style

Wei, Hua, Zhihua Diao, Junxiang Diao, Liqin Wen, Binbin Sun, Xiaoxuan Chen, and Luping Yin. 2026. "A Woodblock New Year Painting Style Classification Method Based on Structural-Aware Attention and Frequency-Domain Style Statistics" Electronics 15, no. 10: 2158. https://doi.org/10.3390/electronics15102158

APA Style

Wei, H., Diao, Z., Diao, J., Wen, L., Sun, B., Chen, X., & Yin, L. (2026). A Woodblock New Year Painting Style Classification Method Based on Structural-Aware Attention and Frequency-Domain Style Statistics. Electronics, 15(10), 2158. https://doi.org/10.3390/electronics15102158

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Woodblock New Year Painting Style Classification Method Based on Structural-Aware Attention and Frequency-Domain Style Statistics

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Collection and Preprocessing

2.2. Proposed Method

2.2.1. ResNeXt Network

2.2.2. SA-CBAM

2.2.3. 2D-DWT Frequency-Domain Feature Modeling

2.2.4. Gram Matrix-Based Style Statistical Modeling

2.2.5. Improved ResNeXt-50 Network

3. Experiment and Result Analysis

3.1. Experimental Environment and Parameter Settings

3.2. Evaluation Metrics

3.3. Experimental Results

3.4. Ablation Experiment

3.5. Comparison with Other Classification Models

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI