EDKSANet: An Efficient Dual-Kernel Split Attention Neural Network for the Classification of Tibetan Medicinal Materials

Jindong Qi; Bianba Wangdui; Jun Jiang; Jie Yang; Yanxia Zhou

doi:10.3390/electronics12204330

,

and

¹

School of Information Science and Technology, Tibet University, Lhasa 850000, China

²

National Experimental Teaching Demonstration Center for Information Technology, Tibet University, Lhasa 850000, China

^*

Author to whom correspondence should be addressed.

Electronics2023, 12(20), 4330;https://doi.org/10.3390/electronics12204330

This article belongs to the Special Issue Applications of Artificial Intelligence in Computer Vision

Version Notes

Order Reprints

Abstract

Tibetan medicine has received wide acclaim for its unique diagnosis and treatment methods. The identification of Tibetan medicinal materials, which are a vital component of Tibetan medicine, is a key research area in this field. However, traditional deep learning-based visual neural networks face significant challenges in efficiently and accurately identifying Tibetan medicinal materials due to their large number, complex morphology, and the scarcity of public visual datasets. To address this issue, we constructed a computer vision dataset with 300 Tibetan medicinal materials and proposed a lightweight and efficient cross-dimensional attention mechanism, the Dual-Kernel Split Attention (DKSA) module, which can adaptively share parameters of the kernel in both spatial and channel dimensions. Based on the DKSA module, we achieve efficient unification of convolution and self-attention under the CNN architecture and develop a new lightweight backbone architecture, EDKSANet, to provide enhanced performance for various computer vision tasks. As compared to RedNet, the top-1 accuracy is improved by 1.2% on an ImageNet dataset, and a larger margin of +1.5 box AP for object detection and an improvement of +1.3 mask AP for instance segmentation on MS-COCO dataset are obtained. Moreover, EDKSANet achieved excellent classification performance on the Tibetan medicinal materials dataset, with an accuracy of up to 96.85%.

Keywords:

Tibetan medicinal materials; efficient cross-dimensional attention; efficient unification; EDKSANet; excellent classification performance

1. Introduction

Tibetan medicine is not only a treasure of the traditional culture of the Chinese nation but also a shining pearl in the treasure house of world medicine. Tibetan medicinal materials are the main carrier of clinical treatment of Tibetan medicine and have unique curative effects on many diseases. Since the authenticity, pros, and cons of Tibetan medicinal materials are directly related to the safety and effectiveness of the clinical medication, it is very necessary to carry out objective and scientific identification research on Tibetan medicinal materials. Traditional identification methods usually use technologies such as infrared spectroscopy, fingerprint pattern, and chemical pattern recognition [1,2], but these methods have significant defects: they need expert experience for manual design or feature extraction, and the workload is very large in the early stage. These models have high accuracy on the training set but poor generalization performance on the test set. Furthermore, these methods rely on specific technologies, requiring a substantial amount of specialized knowledge and equipment, which may pose practical limitations in their application.

In contrast, neural networks [3] have a strong nonlinear mapping capability, which can effectively address the shortcomings of traditional methods, especially when considering the relationship between features with excellent performance. They are also better able to automatically extract complex image features, reducing the burden of manual feature extraction. However, the current informatization of Tibetan medicinal materials is relatively lagging behind, resulting in a lack of public standard datasets in the field of image recognition, and the scarcity of data has become a significant challenge. Therefore, there is an urgent need to build a large-scale computer vision dataset of Tibetan medicinal materials to support the training and evaluation of deep learning models.

According to the Chinese Tibetan Medicinal Materials Compendium, the Tibetan medicinal materials it contains include 928 species of plants, 174 species of animals and 140 species of minerals, totaling more than 1200 species. There are many types of these Tibetan medicinal materials, which require significant time and labor costs to collect and label their data. Additionally, each type of medicinal material has a unique morphology, color, texture, and characteristics, which makes the task complex. At the same time, traditional deep learning networks may become bulky when processing large-scale data and require substantial hardware resources. In addition, in practical applications, images of Tibetan medicinal materials may be subject to various limitations, such as the shooting conditions, angle, and resolution, which may affect the reliability of the recognition task. To overcome the above challenges, we need to adopt innovative computer vision techniques, especially to develop lightweight and efficient network structures to improve the accuracy and robustness of Tibetan medicinal material recognition in order to achieve good performance in practical applications.

Convolutional neural networks (CNNs) [3] set off a wave of visual deep learning and are widely used in research fields such as image classification, target detection, instance segmentation, and semantic segmentation [4,5,6,7,8,9,10,11], which fully demonstrates the effectiveness of CNN in computer vision. A CNN has two main properties: space invariance and channel specificity. Spatial invariance means that the CNN uses a shared convolution kernel to process different positions in the image, thereby reducing the number of parameters and ensuring the translation equivariance of visual features. Channel specificity means that the convolution kernels in different output channels capture different semantic information and improve the learning ability of the model. However, convolutions cannot model the long-distance relationship between different visual modalities in the spatial dimension, and there is information redundancy between different output channels. Therefore, involution [12], which is the opposite technique, came into being. These two techniques focus on different aspects: convolution emphasizes the interaction between channels, while involution focuses on modeling information within the spatial range. The former implements parameter sharing in the spatial dimension, while the latter implements parameter sharing in the channel dimension, both of which are extreme forms of kernel parameter sharing. Therefore, it is very meaningful research work to achieve adaptive parameter sharing in terms of spatial range and channel to balance speed and accuracy.

In previous studies, the Bottleneck Attention module (BAM) [13] and Convolutional Block Attention module (CBAM) [14] effectively combined spatial and channel attention, respectively, enhanced the calibration ability of these two dimensions, and realized adaptive feature optimization. However, two important challenges remain to be addressed. On the one hand, although these modules can effectively focus on local information, they still need to efficiently aggregate contextual semantic information at different scales to overcome the difficulty of capturing long-range dependencies. On the other hand, although using the same convolution kernel in the same batch can reduce the computational cost, it will sacrifice the performance and feature expression ability of the model. Therefore, an efficient and low-cost dynamic kernel is proposed to solve this problem. EPSANet [15], DMSANet [16], and Res2Net [17] are based on multi-scale representation and capture of long-distance features. CondConvNet [18], Dynamic ConvNet [19], and Deformable ConvNet [20] use the dynamic kernel for feature extraction. However, the network architecture described above focuses on designing complex attention modules to perform specific functions, inevitably leading to a significant increase in model complexity. To overcome the above shortcomings, this paper proposes a high-performance and low-cost Dual-Kernel Split Attention (DKSA) module. We replaced the 3 × 3 convolution kernel in ResNet [5] bottleneck blocks with the DKSA module, obtained an Efficient Dual-Kernel Split Attention (EDKSA) block, and used EDKSA blocks to stack in ResNet style to form a new type of lightweight-level EDKSANet. As shown in Figure 1, EDKSANet performs better in top-1 accuracy and is more efficient in parameter usage. The main contributions of this work are summarized as follows:

Figure 1. Comparing the accuracy of different methods. The circle area reflects the network parameters and FLOPs of different models.

A total of 300 common Tibetan medicinal materials were photographed, and with the help of data enhancement strategies, a computer vision dataset of Tibetan medicinal materials was successfully constructed. This provides a new opportunity for the further use of visual deep learning to scientifically identify Tibetan medicinal materials and promote the development of the Tibetan medicinal material industry. Furthermore, this block has very flexible and scalable performance and can be directly embedded into network architectures for various computer vision tasks.
A novel Efficient Dual-Kernel Split Attention (EDKSA) block is proposed to construct dynamic kernel-extracted features using reciprocal filters at different scales, effectively fusing symmetric feature information at multiple scales and generating long-term dependencies in channel interactions and spatially adaptive relationships, thereby increasing the richness and accuracy of feature representation.
A novel lightweight backbone architecture, EDKSANet, is proposed. It achieves efficient unification of convolution and self-attention under the CNN architectures. EDKSANet can learn richer symmetric feature maps and dynamically calibrate the modeling between symmetric dimensions according to task requirements, thereby improving the learning and generalization capabilities of the model. After a large amount of experimental data verification, EDKSANet not only significantly improved the performance of image classification, target detection, and instance segmentation tasks on the ImageNet and MS COCO datasets but also showed amazing results in the classification tasks of Tibetan medicinal materials.

2. Related Work

In this section, we will provide a detailed introduction to a novel dynamic convolutional kernel design, aimed at further enhancing model performance for the more efficient execution of various visual tasks.

2.1. Grouped/Depthwise/Shuffle/Dilated Convolutions

Group convolution is the process of grouping the feature maps first and then performing convolutions in the groups, which will reduce the computational cost, so it is often used in lightweight and efficient networks. It was first used in AlexNet [21] to distribute the model on 2 GPUs with limited computing resources. Subsequently, IGCNet [22] proposed an interleaved group convolution block, which consisted of a pair of consecutive and complementary group convolutions, which effectively reduced the parameter number and computational complexity of the network. The extreme of group convolution is depthwise separable convolution, for which the number of parameters is further reduced. Its performance has been verified in MobileNets [23,24,25] and Xception [26]. However, group convolution lacks cross-group information exchange. As a result, shuffleNet [27] introduced an information exchange mechanism between groups, the channel shuffle operation. In addition, dilated convolution [28] can effectively expand the receptive field without adding additional parameters and calculation costs and can better handle the long-range dependencies between pixels. In EDKSANet, larger convolution kernels are designed as a collection of grouped/shuffle/dilated convolution to improve performance while reducing computational costs.

2.2. Dynamic Convolutions

CNNs have been widely used in various tasks of computer vision. However, conventional CNNs employ fixed filters, which cannot adapt to changes in input data, resulting in limited performance. To remedy this deficiency, some dynamic convolutional neural networks have been proposed in recent years, which have been shown to significantly improve the ability and robustness to adapt to complex input data. Deformable Convolutional Networks [20] improve the ability to model the geometric features of the target object by learning the position offset of the convolution kernel on different objects. Based on the gating mechanism of the input image, Conditionally Parameterized Convolutions [18] and Dynamic Convolution [19] can adaptively adjust the convolution kernel parameters and effectively improve the expressive ability of the model. Involution [12] can dynamically generate convolution kernels according to the adaptive relationship between different feature map pixels and can efficiently capture spatial domain feature information.

To meet the different needs of various visual tasks, we have borrowed the advantages of the convolution kernel and involution kernel and designed an innovative dynamic kernel that can further improve the performance of the model and reduce the number of parameters and computational complexity of the model, thus completing various vision tasks more efficiently.

2.3. Multi-Scale Feature Representations and Attention Mechanisms

In the field of computer vision, the application of multi-scale feature representation ability and attention mechanisms is crucial. The multi-scale feature representation ability means that the model can extract feature information from multiple different visual scales, thereby enhancing the model’s ability to understand and process complex images. The attention mechanism can focus on important areas in the image, making the model more adaptive and accurate. The effective combination of these two technologies can help computer vision achieve better performance and results in various tasks.

Squeeze-and-Excitation Networks [29] capture the correlation between channels by adaptively adjusting the weights of channels. On this basis, ECANet [30] uses a one-dimensional convolutional layer to generate the weights and reduce the redundancy of the fully connected layers. FCANet [31] uses an efficient multi-spectral channel attention module, which can better utilize the rich spectral features between channels, thereby improving the information richness of the model. EPSANet [15] introduces an Efficient Pyramid Split attention module to effectively capture long-term dependencies among multi-scale channel attention. BAM [13] and CBAM [14] add spatial attention, which can be adaptively weighted in both space and channel, making the model expression more comprehensive and effective. DANet [32] flexibly combines local features with global context information by summing position attention and channel attention from different branches. SKNet [33] introduces a dynamic selective attention mechanism, which allows each neuron to adapt its receptive field according to the multi-scale adaptation of the input feature map. ResNeSt [34] proposes a Split-Attention Block, which supports performing attention operations across groups of feature maps to better capture the interrelationships between different features. The Non-Local [35] Network proposes a pioneering global dependency capture method; that is, by introducing a Non-Local block, it can effectively capture the dependencies between distant pixels while ensuring high efficiency. GCNet [36] introduces a novel global context modeling module to capture and model global context features. DMSANet [16] can integrate features of different scales through multi-scale feature extraction and a parallel spatial channel attention mechanism and can adaptively capture the connection between its local features and global dependencies. PVTv1 [37] is the first visual Transformer model with a pyramid structure, through which multi-scale information is captured to further enhance the performance of the Transformer. PVTv2 [38] introduces an enhanced self-attention mechanism and augmented convolutional operations to further improve feature capture and local perception. The core of FastViT [39] is to significantly reduce computations by introducing a local-awareness mechanism and fully utilizing the characteristics of ViT. MPViT [40] introduces a multi-scale embedding method with a multipath structure, allowing for the capture of fine and coarse features in dense prediction tasks. Additionally, it incorporates a global-to-local feature interaction strategy that optimally leverages the local connectivity of convolution and the global context of the Transformer for a more comprehensive expression of feature information. UniFormer [41,42] strives to seamlessly incorporate convolution and self-attention mechanisms in a Transformer-style approach, harnessing their individual strengths while addressing the twin challenges of local redundancy and global dependency, ultimately facilitating efficient feature learning.

Although designing a variety of attention modules can achieve the above functions, it also increases the complexity of the model. To this end, we propose a novel module, which aims to automatically aggregate multi-scale symmetric feature information with low complexity and dynamically model the relationship between features in different dimensions to capture the association of local features and global information for the efficient unification of convolution and self-attention. Additionally, it enhances long-range dependency and improves feature expression capability.

3. Method

In this section, we will provide a comprehensive overview of the Dual-Kernel Split Attention (DKSA) module, offering an in-depth explanation of its construction steps.

3.1. DKSA Module

The motivation of this work is to build a more efficient and effective cross-dimensional interactive attention mechanism for complementary features. To this end, we propose a novel Dual-Kernel Split Attention (DKSA) module, as shown in Figure 2. This DKSA module is realized by the following 5 steps. First, by implementing the proposed Split Transform Concat (STC) module, multi-scale symmetrical information feature maps are obtained. Second, a MultiLayer Perceptron [43] MLP is used to extract the attention vectors of the symmetric feature maps to obtain the channel attention vectors of the complementary kernel. Then, the channel attention vector is recalibrated using the cross-dimensional attention mechanism to obtain the recalibration weights of the symmetric feature maps. Then, the calibrated weights are multiplied element-wise by the corresponding feature maps. Finally, the interaction of symmetrical information between groups is completed through channel shuffle.

Figure 2. The structure of the proposed Dual-Kernel Split Attention (DKSA) module.

The basic module in the DKSA module to realize the extraction of multi-scale symmetrical feature information is STC. Its structure is shown in Figure 3. This module is realized by three operators: Split, Transform, and Concat.

Figure 3. A detailed illustration of the proposed Split Transform Concat (STC) module.

Split: Any feature map

X \in R^{C \times H \times W}

is divided into 2 groups in the channel dimension, denoted as

[X_{0}, X_{1}]

. Each group has

C^{'} = C / 2

channels, where C must be divisible by 2; then, the feature map group is

X_{i} \in R^{C^{'} \times H \times W}, i = 0, 1 .

Transform: Based on the split, we process input tensors in parallel using complementary filters of different scales. However, as the size of the kernel increases, the number of parameters will also increase significantly, so we introduce a method of group convolution to perform two transformations:

\tilde{F} : F_{0} = C o n v | I n v o (K_{0} \times K_{0}, G_{0}) (X_{0})

and

\hat{F} : F_{1} = I n v o | C o n v (K_{1} \times K_{1}, G_{1}) (X_{1})

. A symmetric feature map generated by the complementary filter is thus obtained. Among them,

C o n v

and

I n v o

represent the convolution kernel and Involution kernel, respectively, K represents different receptive fields, and G represents different groups. | is a selection symbol; if

\tilde{F}

performs

C o n v

, then

\hat{F}

performs

I n v o

, and vice versa. The relationship between K and G is shown in Formula (1).

\begin{matrix} \{\begin{matrix} C o n v, & G = 4 \times (K + 1) \\ I n v o, & G = C / 4 \times (K + 1) \end{matrix} \end{matrix}

(1)

The generation method of the 3 × 3

I n v o

kernel refers to RedNet [12], where the

7 \times 7

I n v o

kernel chooses

3 \times 3

dilated as 2 convolutions of nonlinear transformation instead of focusing on a single pixel, which can better capture the farther pixel information and low-level features in the image, and enhance the model’s perception. By default,

\tilde{F} : F_{0} = C o n v (3 \times 3, 32) X_{0}

and

\hat{F} : F_{1} = I n v o (7 \times 7, C / 16) (X_{1})

are performed, where the

7 \times 7

C o n v

kernel selects a

3 \times 3

dilated to 2-dilated convolution.

Concat: Splicing in the channel dimension

F_{0} \oplus F_{1} \Rightarrow F \in R^{C \times H \times W}

produces a multi-scale symmetrical pre-processed feature map where C, H, and W represent channel, height, and width, respectively.

MLP consists of two parts: the squeeze layer and the excitation layer. The former is responsible for encoding global information, and the latter is used for the adaptive recalibration of channel relations. To incorporate global spatial information into channel descriptions, we implement the simplest global average pooling (GAP) using Equation (2). The attention weight

W_{C}

of the Cth channel in MLP is shown in Equation (3).

\begin{matrix} G_{C} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} F_{C} (i, j) \end{matrix}

(2)

\begin{matrix} W_{c} = σ (W_{1} δ (W_{0} (G_{c}))) \end{matrix}

(3)

The Rectified Linear Unit(ReLU) [44] operator is represented by

δ

. The weight matrices of the MLP layer are denoted by

W_{0} \in R^{c \times \frac{C}{r}}

and

W_{1} \in R^{\frac{C}{r} \times C}

, respectively. The sigmoid activation function is denoted by the symbol

σ

. MLP can adaptively assign channel attention weights to

F \in R^{C \times H \times W}

, perform cross-channel interaction, and fuse context information.

Z_{i} \in R^{C^{'} \times 1 \times 1}

is the vector of attention weights for the two types of complementary filters, as represented in Equation (4), where the

Z_{i}

attention values are derived from the

F_{i}

feature map.

\begin{matrix} Z_{i} = M L P (F_{i}), i = 0, 1 \end{matrix}

(4)

Second, a cross-dimensional soft attention mechanism is used to recalibrate the weights of the two types of complementary filters, adaptively adjusting the modeling of the different dimensional relationships. In Equations (5) and (6), Softmax is used to obtain the attention weights of the recalibrated channels

a t t_{i}, i = 0, 1

. This effectively enables the interaction of local and global attention between different filters, forming long-range cross-dimensional channel dependencies.

\begin{matrix} a t t_{0} = Softmax (Z_{0}) = \frac{e x p (Z_{0})}{e x p (Z_{0}) + e x p (Z_{1})} \end{matrix}

(5)

\begin{matrix} a t t_{1} = Softmax (Z_{1}) = \frac{e x p (Z_{1})}{e x p (Z_{0}) + e x p (Z_{1})} \end{matrix}

(6)

Next, the filter channel attention weights for the feature rescaling are cascaded and fused using the ⊕ operator to obtain the entire channel attention vector, att, as shown in Equation (7).

\begin{matrix} a t t = a t t_{1} \oplus a t t_{2} \end{matrix}

(7)

We then multiply the recalibrated filter attention weights by the corresponding feature map

F_{i}

using the ⊙ operator to obtain a feature map

Y_{i}

of the attention weights of the complementary filter channels, which is represented by Equation (8).

\begin{matrix} Y_{i} = F_{i} ⊙ {att}_{i} i = 0, 1 \end{matrix}

(8)

Finally, to maintain the integrity of the feature representation and to avoid corrupting the information of the original feature mapping, the tandem operator Cat, which is more efficient than the summation operator, is used here. In addition, we used the Ⓢ channel shuffle to enable feature interaction between complementary filters. The final enriched output feature map can be expressed as Equation (9).

\begin{matrix} o u t p u t = Ⓢ (C a t [Y_{1}, Y_{2}]) \end{matrix}

(9)

Based on the above design, the DKSA module effectively implements the adaptive parameter sharing of a kernel in terms of spatial extent and channels, fuses the characteristics of complementary kernels at different scales, and outputs a richer feature information map. In addition, the module integrates cross-dimensional attention into each feature group block, enhancing the association between local features and global information and improving the representational capability of the model.

3.2. Network Design

As shown in Figure 4, the EDKSA block was further obtained by replacing the

3 \times 3

convolutional kernels with DKSA modules in the ResNet bottleneck block, which adaptively models the key information in the feature map and enables better fusion and expression of the complementary features extracted by the dynamic kernel, thus increasing the feature mapping. The EDKSA block can adaptively model the key information in the feature map to enable better fusion and expression of the complementary features extracted by the dynamic kernel, thus increasing the richness of feature expression. A new lightweight backbone architecture, EDKSANet, is designed in this paper by stacking EDKSA blocks in ResNet style. This EDKSANet can dynamically calibrate the modeling of symmetric inter-dimensional relationships to the needs of different visual tasks, improving the synergy and complementarity of feature information and capturing remote dependencies. EDKSANet-50 has two configurations, EDKSANet-50(1) and EDKSANet-50(2), as shown in Table 1.

Figure 4. Block description and comparison of ResNet, RedNet, and EDKSANet, where the two structures of DKSA Module, DKSA Module (1), and DKSA Module (2) are described in Table 1.

Table 1. Design of the proposed EDKSANet.

4. Experiments

To validate the performance of EDKSANet, image classification, target detection, and instance segmentation were first tested on the publicly available datasets ImageNet [45] and MS-COCO [46]. Next, experiments on the computer vision dataset of Tibetan medicinal materials were conducted on the classification task. Finally, to gain a deeper understanding of the network and to provide a reference and basis for further optimization of the design model, several sets of experiments were conducted to explore the effects of different kernels, kernel size, and group size.

4.1. Implementation Details

For the image classification task, this paper used ResNet [5] as the backbone and experimented on the ImageNet [45] dataset. To further improve the accuracy of the model, a variety of data expansion strategies were used on this basis, including random flipping, random rotation, random scaling, and normalization. The final size of the input tensor was cropped to

224 \times 224

. The training configuration was set with reference to EPSANet [15] and SKNet [33] and optimized by a stochastic gradient descent algorithm, with the optimizer using SGD, momentum set to 0.9, a weight decay of

1 \times 10^{- 4}

, and a batch size of 128. The initial learning rate was set to 0.1 and decreased by a factor of 10 after every 30 epochs out of 120 epochs.

For the Tibetan medicinal material classification task, the optimizer used adam when training the model. The learning rate was set to

1 \times 10^{- 3}

, the batch size was set to 64, epochs were set to 100, the decay coefficient for each period was 0.96,

β 1

and

β 2

were set to 0.9 and 0.999, respectively, and epsilon was set to

1 \times 10^{- 8}

for numerical stability.

For the target detection task, on the MS-COCO [46] dataset, this study used ResNet with the CNN architecture and Faster RCNN [7] as the backbone and detector for the experiments. The default configuration was to tune the short edge of the input image to 800. The weights of SGD were attenuated to

1 \times 10^{- 4}

; the momentum was set to 0.9, and the batch size of each GPU was 4 within 12 epochs. The learning rate was set to 0.2, and the learning rate was tuned using CosineAnnealingDecay within 120 epochs.

For the segmentation task, the Mask RCNN [11] detector was used, and the rest of the configuration was similar to the target detection. In addition, the above detectors were all implemented by PaddleDetection.

The above models were all trained on the Baidu AI Studio platform, and their specific configurations are shown in Table 2.

Table 2. AI Studio Model Training Environment Configuration.

4.2. Image Classification on ImageNet

In 50-layer and 101-layer networks under the CNN architecture, Table 3 shows the results of EDKSANet compared to other networks, with the DKSA module achieving very competitive performance at a low computational cost. Furthermore, compared to models under the CNN + Trans architectures, EDKSANet demonstrates higher cost-effectiveness and practicality, effectively striking a better balance between accuracy, parameter count, and computational complexity.

Table 3. Comparison of various methods on ImageNet in terms of network parameters (in millions), floating point operations per second (FLOPs), and top-1 validation accuracy (%).

As shown in Figure 5, under the CNN architecture, for top-1 accuracy, EDKSANet-50(2) and EDKSANet-101(2) have 27% and 30% fewer parameters and are 24% and 29% less computationally expensive, respectively, when compared to ResNet; these models achieve 3.70% and 3.03% higher accuracy than ResNet. For the same number of parameters and computations, EDKSANet-50(1) and EDKSANet-101(1) achieved a 0.84% and 0.77% improvement, respectively, compared to ResNet.

Figure 5. Comparison of network parameters(in millions), floating point operations per second (FLOPs), and top-1 validation accuracy (%) of ResNet, RedNet, and EDKSANet on different layers.

In the 50-layer networks under the CNN architecture, as shown in Figure 6, EDKSANet(2) showed the best performance in terms of accuracy compared to other attention models, achieving a great improvement. Specifically, EDKSANet-50(2) outperformed SKNet, GCNet, DANet, and CBAM by 0.9% and 1.26%, respectively, in terms of top-1 accuracy.

Figure 6. Comparison among various attention networks for top-1 validation accuracy (%) under the CNN architecture.

When comparing the top-1 accuracy and efficiency of the different models under the CNN architecture, EDKSANet-50(1) and ECANet-101 have similar top-1 accuracies, but the former can reduce the number of parameters by 58% and computational resources by 60%. EDKSANet-101(1) performs similarly to EPSANet-101(Large) in terms of top-1 accuracy but saves 37% of the number of parameters and 38% of the computing resources. In addition, compared to SENet-101 and BAM-101, EDKSANet-101 offers a significant 1.9% improvement in top-1 accuracy while reducing the number of parameters by approximately 33% and computational resources by around 29%. This is shown in Figure 7.

Figure 7. Comparison of network parameters(in millions), floating point operations per second (FLOPs), and top-1 validation accuracy (%) for EDKSANet, ECANet, EPSANet, SENet, znd BAM at different layers under the CNN architecture.

Our proposed CNN architecture model, as with the Trans + CNN architecture model, achieves an efficient unification of the self-attention mechanism and convolution. In terms of performance, as depicted in Figure 8, UniFormer-B excels in top-1 accuracy at 82.5%. However, this comes at the expense of a large number of model parameters and increased computational complexity. Contrastingly, PVTv2-B1 boasts the lowest parameter count and computational complexity, resulting in a correspondingly lower top-1 accuracy. Conversely, our proposed EDKSANet demonstrates higher cost-effectiveness in terms of performance and practicality.

Figure 8. Comparison of network parameters(in millions), floating point operations per second (FLOPs), and top-1 validation accuracy (%) for EDKSANet, PVTv2-B1, UniFormer-B.

4.3. Object Detection and Instance Segmentation on MS COCO

In Table 4, under the CNN architecture, our proposed EDKSANet achieves the best performance in the target detection and instance segmentation tasks, respectively.

Table 4. Comparison of object detection and instance segmentation results on COCO val2017.

In the target detection experiments, under the CNN architecture, as shown in Figure 9, EDSANet-50 has fewer parameters and lower computational costs than ResNet-50. EDKSANet-50(2) can improve the AP accuracy of the Faster-RCNN detector by approximately 4.4% with better performance. Furthermore, from a complexity point of view, EDKSANet-50(1) improves AP performance by 1.1% on the Faster-RCNN detector with almost the same computational complexity compared to RedNet-50. It can be seen that EDSANet-50 is a more efficient model.

Figure 9. Comparison of 50 layers of ResNet, EDKSANet, and RedNet in terms of network parameters (in millions), giga floating-point operations per Second(GFLOPs), and top-1 validation accuracy (%).

Under the CNN architecture, EDSANet-50(2) showed more outstanding performance than other attention networks, achieving the best performance in all AP metrics. According to the data shown in Figure 9, its AP performance is 0.2% better than the EPSANet-50 (Large) method, which is currently the best in the attention, while it has only 78% of the number of parameters and is 28% less expensive to compute. Also striking is the fact that the most significant performance improvement is in the middle, at 24.6%.

In the example segmentation experiments, the DKSA module showed a significant advantage over other attention methods under the CNN architecture. According to the data shown in Figure 10, Specifically, EDKSANet-50(2) outperforms EPSANet-50 (Large), thus demonstrating the best performance among existing methods, with improvements of approximately 0.4%, 0.5%, 0.2%, 0.6%, 0.5%, and 0.3% on AP,

{AP}_{50}

,

{AP}_{75}

,

{AP}_{S}

,

{AP}_{M}

, and

{AP}_{L}

, respectively.

Figure 10. Comparison of 50-layer SENet, ECANet, FCANet, EPSANet, and EDKSANet in AP,

{AP}_{S}

.

In the experiments on object detection and instance segmentation, we compared our model EDKSANet, designed under the CNN architecture, with the models PVTv2-B1 under the Trans + CNN architecture and UniFormer-B. UniFormer-B excelled in both aspects, achieving AP scores of 43.1% and 44.0% respectively. In comparison, our EDKSANet showed a slight decrease of 2.3% and 2.5% in these metrics. However, the most noteworthy improvement lies in the reduction of parameters, with our model boasting a reduction of 94.0% and 73.5% respectively.

The above experimental results all validate the effectiveness of the DKSA module, and that EDKSANet has better learning and generalization capabilities and can be easily and efficiently applied to other downstream tasks. To fully consider the model’s parameter count and computational complexity, we designed the simplest version of EDKSANet, which still achieves state-of-the-art performance under the CNN architecture. UniFormer [42] is the most advanced model under the Trans + CNN architecture, which achieves efficient unification of convolution and self-attention, fully utilizes the advantages of both, solves the problem of local redundancy and global dependency, and achieves efficient feature learning. Although our proposed EDKSANet may not be as accurate as UniFormer, it is worth mentioning that EDKSANet is the first network that realizes the efficient unification of convolution and self-attention under the CNN architecture, which is similar to the design philosophy of current state-of-the-art models. In the future, building upon the simplest version of EDKSANet, we can continue to construct a pyramid structure network, which is expected to further enhance the model’s performance and unleash greater potential.

4.4. Image Classification on Tibetan Medicinal Materials Dataset

The computer vision dataset of Tibetan medicinal material includes 300 categories of common medicinal materials including plants, animals, minerals, and jewelry. These pictures were taken at the Tibetan Medicine Factory of the Tibet Autonomous Region, the Tibetan Medicine University of the Tibet Autonomous Region and the Tibetan Herbal Medicine Company of Tibet. Based on this, this study performs enhancement strategies on the image, including random horizontal, up and down flip, rotate, zoom, and crop operations to achieve a target size of

224 \times 224

. Our computer vision dataset for Tibetan medicinal materials contains 150,000 images, 500 for each category. The dataset was divided into a training set and a test set based on a ratio of 8:2. In Figure 11, we show a selection of sample images from the dataset.

Figure 11. Sample data of some Tibetan medicinal material.

Taking the Tibetan medicinal material Rhodiola rosea as an example, the key features extracted for its classification and recognition using ResNet, RedNet, and EDKSANet under the CNN architecture differed significantly, as shown in Figure 12. ResNet extracts features with most of the information concentrated on the background, with less focus on the texture features of key parts of Rhodiola rosea. The distribution of feature information extracted by RedNet is relatively concentrated, with most of the key texture feature areas covered in red. This indicates that the self-attentive mechanism can effectively focus on key texture feature areas, thus making the network more accurate than ResNet in the recognition of Tibetan herbal images. The distribution of feature information extracted by EDKSANet is more concentrated than the two extreme networks with shared filter parameters described above, with key texture feature sites covered in red. This shows that the use of the DKSA module can effectively focus and extract features from the input images and improve the expressive power of the network, thus making EDKSANet’s classification and recognition of Tibetan medicinal materials images more accurate.

Figure 12. Heat maps drawn for the feature information of Rhodiola rosea herbal medicine in the last convolutional layer of the ResNet, RedNet, and EDKSANet intermediate modules [47], and the transition of the heat map color from blue to red indicated the importance of the feature information for classification recognition.

Taking into account both model parameters and performance in a comprehensive manner, we aim to lower the actual deployment costs. In this study, we conducted classification experiments on the Tibetan medicinal materials dataset using EDKSANet with ResNet, RedNet, and PVTv2-B1, as well as evaluated them in terms of Parameters, FLOPs, and top-1 acc (%), respectively. The experimental results show that our proposed network can achieve optimal performance with an accuracy of 96.85% when the number of parameters and computational complexity is guaranteed. The specific comparison results are shown in Table 5.

Table 5. Comparison of parameters, FLOPs, top-1 validation accuracy (%) of ResNet, RedNet, and EDKSANet on the Tibetan medicinal materials dataset.

4.5. Ablation Study

As shown in Table 6, this study used ResNet-50 under the CNN architecture as the backbone and adjusted the kernel, kernel size, and group size, respectively, to verify the effectiveness of our network on ImageNet [45]. First, this study compares the use of

C o n v

,

I n v o

, and (

C o n v

,

I n v o

). The experimental data reveals that adaptive parameter sharing of complementary filters allows better modeling of input features and effectively enhances the representation of the model. Then, to verify the impact of complementary filters at different scales on the model, we perform separate grouped null convolution of feature maps at different scales, which can effectively extract multi-scale refinement features and form remote dependencies to improve the performance of the model without adding extra computational effort. In conclusion, a synthesis of the above analysis reveals that EDKSANet can achieve a good balance between performance and model complexity.

Table 6. Effect of Different Kernels, Kernel Sizes, and Group Size on the Accuracy of Top-1 Validation.

5. Conclusions

In this paper, a computer vision dataset containing 300 species of Tibetan medicinal materials was constructed. Additonally, to address the study of and scientific identification of Tibetan medicinal materials, we proposed a plug-and-play DKSA module, which increases the richness of feature expression by constructing a dynamized kernel that adaptively fuses the features of two types of complementary kernels, through a cross-dimensional attention mechanism. Based on the DKSA module, we have successfully achieved the efficient integration of convolution and self-attention under the CNN architecture. Furthermore, our lightweight EDKSANet, which we proposed, can adaptively model symmetric cross-dimensional relationships, effectively fuse local and global information, better understand contextual feature information, form long-range dependencies, and significantly improve the model’s performance. Through extensive qualitative and quantitative experiments, we have clearly demonstrated that the proposed EDKSANet outperforms other networks in tasks such as image classification, object detection, and instance segmentation. In the Tibetan medicinal dataset, we have achieved an outstanding classification accuracy of 96.85%.

However, we should acknowledge the limitations of our research. The computer vision dataset for Tibetan medicinal materials is still limited in scale, and therefore, it is necessary to further expand the variety of medicinal materials. In the future, we will establish a multimodal dataset for Tibetan medicinal herbs and employ self-supervised learning methods to reduce reliance on labeled data, addressing the issue of scarce data for certain Tibetan medicinal herbs. Additionally, we will continue to explore innovative approaches for parameter sharing in pyramid-structured convolutional kernels to achieve a more efficient and unified network structure for convolution and self-attention.

Author Contributions

Conceptualization, J.Q. and B.W.; methodology, J.Q.; validation, J.Y. and Y.Z.; formal analysis, B.W.; investigation, Y.Z.; resources, J.J.; data curation, J.Y.; writing—original draft preparation, J.Q.; writing—review and editing, B.W.; visualization, J.Q.; supervision, B.W.; funding acquisition, J.J. All authors have read and agreed to the published version of the manuscript.

Funding

The research received financial backing and support from the National Natural Science Foundation of China, grant number 62261051.

Informed Consent Statement

In the process of constructing a computer vision dataset for Tibetan medicinal materials, we have implemented a series of rigorous ethical measures with the aim of respecting and preserving the traditional Tibetan medicine culture. These measures encompass the involvement of highly experienced Tibetan medicine experts for material identification and annotation, as well as the anonymization of the dataset. Furthermore, we have adopted a double-blind methodology, whereby independent third parties validate and annotate the dataset to ensure its objectivity and accuracy. All research participants have signed data usage agreements to ensure the legality of data utilization. These ethical safeguards ensure the compliance of our study, advance Tibetan traditional medicine, and contribute to the enhancement of human well-being.

Data Availability Statement

In this study, our team has meticulously curated a computer vision dataset comprising 300 different Tibetan medicinal materials. Currently, the dataset is being used for ongoing research and has not been made publicly available. Researchers and institutions interested in collaborating on relevant projects or conducting further analysis may request access to the dataset. To request access, please contact the corresponding author, Bianba Wangdui, at banwangg@utibet.edu.cn. Our research team will review the request, and if deemed appropriate, we will provide access to the dataset according to the terms and conditions set by our institution.

Conflicts of Interest

The authors declare that we have no competing interest regarding the publication of this research paper, “EDKSANet: An Efficient Dual-Kernel Split Attention Neural Network for the classification of Tibetan medicinal materials”, in the journal Electronics.

References

Sakhteman, A.; Keshavarz, R.; Mohagheghzadeh, A. ATR-IR fingerprinting as a powerful method for identification of traditional medicine samples: A report of 20 herbal patterns. Res. J. Pharmacogn. 2015, 2, 1–8. [Google Scholar]
Rohman, A.; Windarsih, A.; Hossain, M.A.M.; Johan, M.R.; Ali, M.E.; Fadzilah, N.A. Application of near-and mid-infrared spectroscopy combined with chemometrics for discrimination and authentication of herbal products: A review. J. Appl. Pharm. Sci. 2019, 9, 137–147. [Google Scholar]
Lecun, Y.; Bottou, L. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Girshick, R. Fast R-CNN. Comput. Sci. 2015, 2015, 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the Computer Vision & Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. In Proceedings of the International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Li, D.; Hu, J.; Wang, C.; Li, X.; She, Q.; Zhu, L.; Zhang, T.; Chen, Q. Involution: Inverting the inherence of convolution for visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12321–12330. [Google Scholar]
Park, J.; Woo, S.; Lee, J.Y.; Kweon, I.S. BAM: Bottleneck Attention Module. arXiv 2018, arXiv:1807.06514. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Zhang, H.; Zu, K.; Lu, J.; Zou, Y.; Meng, D. EPSANet: An Efficient Pyramid Squeeze Attention Block on Convolutional Neural Network. In Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 November–2 December 2021. [Google Scholar]
Sagar, A. DMSANet: Dual Multi Scale Attention Network. In Proceedings of the International Conference on Image Analysis and Processing, Lecce, Italy, 23–27 May 2022. [Google Scholar]
Gao, S.H.; Cheng, M.M.; Zhao, K.; Zhang, X.Y.; Yang, M.H.; Torr, P. Res2net: A new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 652–662. [Google Scholar] [CrossRef] [PubMed]
Yang, B.; Bender, G.; Ngiam, J.; Le, Q.V. CondConv: Conditionally Parameterized Convolutions for Efficient Inference. arXiv 2019, arXiv:1904.04971v3. [Google Scholar]
Chen, Y.; Dai, X.; Liu, M.; Chen, D.; Yuan, L.; Liu, Z. Dynamic convolution: Attention over convolution kernels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11030–11039. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. arXiv 2017, arXiv:1703.06211. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet Classification with Deep Convolutional Neural Networks. Adv. Neural Inf. Process. Syst. 2012, 25, 12. [Google Scholar] [CrossRef]
Zhang, T.; Qi, G.J.; Xiao, B.; Wang, J. Interleaved group convolutions. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4373–4382. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar]
Yu, F.; Koltun, V. Multi-Scale Context Aggregation by Dilated Convolutions. In Proceedings of the ICLR, San Juan, Puerto Rico, 2–4 June 2016. [Google Scholar]
Jie, H.; Li, S.; Gang, S. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Qin, Z.; Zhang, P.; Wu, F.; Li, X. Fcanet: Frequency channel attention networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 783–792. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
Yang, J.; Wang, W.; Li, X.; Hu, X. Selective Kernel Networks. arXiv 2019, arXiv:1903.06586. [Google Scholar]
Zhang, H.; Wu, C.; Zhang, Z.; Zhu, Y.; Zhang, Z.; Lin, H.; Sun, Y.; He, T.; Mueller, J.; Manmatha, R. ResNeSt: Split-Attention Networks. arXiv 2020, arXiv:2004.08955. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. GCNet: Non-Local Networks Meet Squeeze-Excitation Networks and Beyond. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 568–578. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pvt v2: Improved baselines with pyramid vision transformer. Comput. Vis. Media 2022, 8, 415–424. [Google Scholar] [CrossRef]
Vasu, P.K.A.; Gabriel, J.; Zhu, J.; Tuzel, O.; Ranjan, A. FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization. arXiv 2023, arXiv:2303.14189. [Google Scholar]
Lee, Y.; Kim, J.; Willette, J.; Hwang, S.J. Mpvit: Multi-path vision transformer for dense prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7287–7296. [Google Scholar]
Li, K.; Wang, Y.; Gao, P.; Song, G.; Liu, Y.; Li, H.; Qiao, Y. Uniformer: Unified transformer for efficient spatiotemporal representation learning. arXiv 2022, arXiv:2201.04676. [Google Scholar]
Li, K.; Wang, Y.; Zhang, J.; Gao, P.; Song, G.; Liu, Y.; Li, H.; Qiao, Y. Uniformer: Unifying convolution and self-attention for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12581–12600. [Google Scholar] [CrossRef] [PubMed]
Rumelhart, D.E. Learning Internal Representations by Error Propagation, Parallel Distributed Processing. In Explorations in the Microstructure of Cognition; MIT Press: Cambridge, MA, USA, 1986. [Google Scholar]
Nair, V.; Hinton, G.E. Rectified Linear Units Improve Restricted Boltzmann Machines Vinod Nair. In Proceedings of the International Conference on International Conference on Machine Learning, Haifa, Israel, 21–24 June 2010. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]

Figure 1. Comparing the accuracy of different methods. The circle area reflects the network parameters and FLOPs of different models.

Figure 2. The structure of the proposed Dual-Kernel Split Attention (DKSA) module.

Figure 3. A detailed illustration of the proposed Split Transform Concat (STC) module.

Figure 4. Block description and comparison of ResNet, RedNet, and EDKSANet, where the two structures of DKSA Module, DKSA Module (1), and DKSA Module (2) are described in Table 1.

Figure 5. Comparison of network parameters(in millions), floating point operations per second (FLOPs), and top-1 validation accuracy (%) of ResNet, RedNet, and EDKSANet on different layers.

Figure 6. Comparison among various attention networks for top-1 validation accuracy (%) under the CNN architecture.

Figure 7. Comparison of network parameters(in millions), floating point operations per second (FLOPs), and top-1 validation accuracy (%) for EDKSANet, ECANet, EPSANet, SENet, znd BAM at different layers under the CNN architecture.

Figure 8. Comparison of network parameters(in millions), floating point operations per second (FLOPs), and top-1 validation accuracy (%) for EDKSANet, PVTv2-B1, UniFormer-B.

Figure 9. Comparison of 50 layers of ResNet, EDKSANet, and RedNet in terms of network parameters (in millions), giga floating-point operations per Second(GFLOPs), and top-1 validation accuracy (%).

Figure 10. Comparison of 50-layer SENet, ECANet, FCANet, EPSANet, and EDKSANet in AP,

{AP}_{S}

.

Figure 11. Sample data of some Tibetan medicinal material.

Figure 12. Heat maps drawn for the feature information of Rhodiola rosea herbal medicine in the last convolutional layer of the ResNet, RedNet, and EDKSANet intermediate modules [47], and the transition of the heat map color from blue to red indicated the importance of the feature information for classification recognition.

Table 1. Design of the proposed EDKSANet.

Output	ResNet-50	RedNet-50	EDKSANet-50(1)	EDKSANet-50(2)
112 × 112	7 × 7, 64, stride 2
56 × 56	3 × 3, MaxPool, stride 2
56 × 56	$C o n v [\begin{matrix} 1 \times 1, & 64 \\ 3 \times 3, & 64 \\ 1 \times 1, & 256 \end{matrix}] \times 3$	$I n v o [\begin{matrix} 1 \times 1, & 64 \\ 7 \times 7, & 64 \\ 1 \times 1, & 256 \end{matrix}] \times 3$	$[\begin{matrix} 1 \times 1, & 64 \\ DKSA (\begin{matrix} 3 \times 3 I n v o, & G_{0} = C / 16 \\ 7 \times 7 C o n v, & G_{1} = 32 \end{matrix}), & 64 \\ 1 \times 1, & 256 \end{matrix}] \times 3$	$[\begin{matrix} 1 \times 1, & 64 \\ DKSA (\begin{matrix} 3 \times 3 C o n v, & G_{0} = 16 \\ 7 \times 7 I n v o, & G_{1} = C / 32 \end{matrix}), & 64 \\ 1 \times 1, & 256 \end{matrix}] \times 3$
28 × 28	$C o n v [\begin{matrix} 1 \times 1, & 128 \\ 3 \times 3, & 128 \\ 1 \times 1, & 512 \end{matrix}] \times 4$	$I n v o [\begin{matrix} 1 \times 1, & 128 \\ 7 \times 7, & 128 \\ 1 \times 1, & 512 \end{matrix}] \times 4$	$[\begin{matrix} 1 \times 1, & 128 \\ DKSA (\begin{matrix} 3 \times 3 I n v o, & G_{0} = C / 16 \\ 7 \times 7 C o n v, & G_{1} = 32 \end{matrix}), & 128 \\ 1 \times 1, & 512 \end{matrix}] \times 4$	$[\begin{matrix} 1 \times 1, & 128 \\ DKSA (\begin{matrix} 3 \times 3 C o n v, & G_{0} = 16 \\ 7 \times 7 I n v o, & G_{1} = C / 32 \end{matrix}), & 128 \\ 1 \times 1, & 512 \end{matrix}] \times 4$
14 × 14	$C o n v [\begin{matrix} 1 \times 1, & 256 \\ 3 \times 3, & 256 \\ 1 \times 1, & 1024 \end{matrix}] \times 6$	$I n v o [\begin{matrix} 1 \times 1, & 256 \\ 7 \times 7, & 256 \\ 1 \times 1, & 1024 \end{matrix}] \times 6$	$[\begin{matrix} 1 \times 1, & 256 \\ DKSA (\begin{matrix} 3 \times 3 I n v o, & G_{0} = C / 16 \\ 7 \times 7 C o n v, & G_{1} = 32 \end{matrix}), & 256 \\ 1 \times 1, & 1024 \end{matrix}] \times 6$	$[\begin{matrix} 1 \times 1, & 256 \\ DKSA (\begin{matrix} 3 \times 3 C o n v, & G_{0} = 16 \\ 7 \times 7 I n v o, & G_{1} = C / 32 \end{matrix}), & 256 \\ 1 \times 1, & 1024 \end{matrix}] \times 6$
7 × 7	$C o n v [\begin{matrix} 1 \times 1, & 512 \\ 3 \times 3, & 512 \\ 1 \times 1, & 2048 \end{matrix}] \times 3$	$I n v o [\begin{matrix} 1 \times 1, & 512 \\ 7 \times 7, & 512 \\ 1 \times 1, & 2048 \end{matrix}] \times 3$	$[\begin{matrix} 1 \times 1, & 512 \\ DKSA (\begin{matrix} 3 \times 3 I n v o, & G_{0} = C / 16 \\ 7 \times 7 C o n v, & G_{1} = 32 \end{matrix}), & 512 \\ 1 \times 1, & 2048 \end{matrix}] \times 3$	$[\begin{matrix} 1 \times 1, & 512 \\ DKSA (\begin{matrix} 3 \times 3 C o n v, & G_{0} = 16 \\ 7 \times 7 I n v o, & G_{1} = C / 32 \end{matrix}), & 512 \\ 1 \times 1, & 2048 \end{matrix}] \times 3$

Table 2. AI Studio Model Training Environment Configuration.

Platform	Framework	GPU	CPU	Video Mem	RAM	Disk
AI Studio	Paddle Paddle 2.2.2	4 × Tesla V100	16 Cores	4 × 32 GB	128 GB	100 GB

Table 3. Comparison of various methods on ImageNet in terms of network parameters (in millions), floating point operations per second (FLOPs), and top-1 validation accuracy (%).

Network	Arch.	Parameters	FLOPs	Top-1 Acc (%)
ResNet [5]	CNN-50	25.56	4.12 G	75.20
RedNet [12]		15.60	2.61 G	77.76
SENet [29]		28.07	4.13 G	76.71
SKNet [33]		27.70	4.25 G	77.70
CBAM [14]		28.07	4.14 G	77.34
GCNet [36]		28.11	4.13 G	77.70
DANet [32]		25.80	4.15 G	77.70
FCANet [31]		25.07	4.13 G	78.52
ECANet [30]		25.56	4.13 G	77.48
EPSANet [15] (Small)		22.56	3.62 G	77.49
EPSANet [15] (Large)		27.90	4.72 G	78.64
EDKSANet(1)		15.53	2.64 G	78.60
EDKSANet(2)		18.54	3.12 G	78.90
ResNet [5]	CNN-101	44.65	7.83 G	76.83
RedNet [12]		25.76	4.58 G	78.75
SENet [29]		49.29	7.86 G	77.62
SKNet [33]		49.20	8.03 G	78.70
BAM [13]		44.91	7.93 G	77.56
CBAM [14]		49.33	7.88 G	78.49
DANet [32]		45.40	8.05 G	78.70
ECANet [30]		44.55	7.86 G	78.65
EPSANet [15] (Small)		39.00	6.82 G	78.43
EPSANet [15] (Large)		49.59	8.97 G	79.38
EDKSANet(1)		25.64	4.61 G	79.52
EDKSANet(2)		31.33	5.57 G	79.86
PVTv2-B1 [38]	CNN + Trans	13.15	2.10 G	78.70
UniFormer-B [42]	CNN + Trans	50.30	8.34 G	82.50

Table 4. Comparison of object detection and instance segmentation results on COCO val2017.

Backbone	Detectors	Parameters(M)	GFLOPs(B)	AP	${AP}_{50}$	${AP}_{75}$	${AP}_{S}$	${AP}_{M}$	${AP}_{L}$
ResNet-50 [5]	Faster-RCNN	41.53	207.07	36.4	58.2	39.5	21.8	40.0	46.2
RedNet-50 [12]		29.50	135.50	39.3	61.2	42.3	22.9	42.9	52.8
SENet-50 [29]		44.02	207.18	37.6	60.2	41.0	23.1	41.8	48.4
ECANet-50 [30]		41.53	207.18	38.1	60.5	40.8	23.4	42.0	48.0
FCANet-50 [31]		44.02	215.63	39.0	61.1	42.3	23.7	42.8	49.6
EPSANet50 [15] (Small)		38.56	197.07	39.2	60.3	42.3	22.8	42.4	51.1
EPSANet50 [15] (Large)		43.85	219.64	40.6	62.1	44.6	23.6	44.5	54.0
EDKSANet-50(1)		31.50	135.68	40.4	62.1	43.5	23.7	43.8	53.3
EDKSANEt-50(2)		34.51	159.81	40.8	62.8	44.8	24.6	44.6	54.4
PVTv2-B1 [38]		29.12	134.84	40.5	62.3	43.5	23.8	43.9	53.7
UniFormer-B [42]		66.27	306.88	43.1	64.8	49.2	26.6	46.8	57.0
ResNet-50 [5]	Mask-RCNN	44.18	275.58	37.2	58.9	40.3	22.2	40.7	48.0
RedNet-50 [12]		34.22	181.30	40.2	61.2	43.5	22.6	43.5	53.0
SENet-50 [29]		46.67	275.69	38.6	60.8	42.0	23.6	42.5	49.9
GCNet-50 [36]		46.90	279.60	39.4	61.6	42.4	-	-	-
ECANet-50 [30]		44.18	275.69	39.0	61.3	42.1	24.2	42.8	49.9
FCANet-50 [31]		46.66	261.93	40.3	62.0	44.1	25.2	43.9	52.0
EPSANet50 [15] (Small)		41.20	248.53	40.0	60.9	43.3	22.3	43.2	52.8
EPSANet50 [15] (Large)		46.50	271.10	41.1	62.3	45.3	23.6	45.1	54.6
EDKSANet-50(1)		34.15	184.08	40.9	61.8	44.4	23.5	44.4	53.7
EDKSANEt-50(2)		37.16	217.56	41.5	62.8	45.5	24.2	45.6	54.9
PVTv2-B1 [38]		32.33	189.02	41.2	62.4	45.3	23.8	49.8	53.8
UniFormer-B [41]		68.48	400.92	44.0	65.1	47.2	26.2	47.8	56.8

Table 5. Comparison of parameters, FLOPs, top-1 validation accuracy (%) of ResNet, RedNet, and EDKSANet on the Tibetan medicinal materials dataset.

Network	Arch.	Parameters	FLOPs	Top-1 Acc (%)
ResNet	CNN	24.18	4.12 G	90.84
RedNet		14.16	2.61 G	92.30
EDKSANet(1)		14.09	2.64 G	95.57
EDKSANet(2)		17.11	3.12 G	96.85
PVTv2-B1	CNN + Trans	13.15	2.10 G	96.10

Table 6. Effect of Different Kernels, Kernel Sizes, and Group Size on the Accuracy of Top-1 Validation.

Kernel	Kernel Sizen	Group Size	Top-1 Acc (%)
$C o n v$	3	1	75.2
$I n v o$	3	C/16	76.9
( $C o n v$ , $I n v o$ )	(3, 3)	(1, C/16)	77.9
( $I n v o$ , $C o n v$ )	(3, 7)	(C/32, 16)	78.6
( $C o n v$ , $I n v o$ )	(3, 7)	(32, C/16)	78.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

EDKSANet: An Efficient Dual-Kernel Split Attention Neural Network for the Classification of Tibetan Medicinal Materials

Abstract

1. Introduction

2. Related Work

2.1. Grouped/Depthwise/Shuffle/Dilated Convolutions

2.2. Dynamic Convolutions

2.3. Multi-Scale Feature Representations and Attention Mechanisms

3. Method

3.1. DKSA Module

3.2. Network Design

4. Experiments

4.1. Implementation Details

4.2. Image Classification on ImageNet

4.3. Object Detection and Instance Segmentation on MS COCO

4.4. Image Classification on Tibetan Medicinal Materials Dataset

4.5. Ablation Study

5. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics