You are currently viewing a new version of our website. To view the old version click .
Electronics
  • Article
  • Open Access

Published: 19 October 2023

EDKSANet: An Efficient Dual-Kernel Split Attention Neural Network for the Classification of Tibetan Medicinal Materials

,
,
,
and
1
School of Information Science and Technology, Tibet University, Lhasa 850000, China
2
National Experimental Teaching Demonstration Center for Information Technology, Tibet University, Lhasa 850000, China
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Applications of Artificial Intelligence in Computer Vision

Abstract

Tibetan medicine has received wide acclaim for its unique diagnosis and treatment methods. The identification of Tibetan medicinal materials, which are a vital component of Tibetan medicine, is a key research area in this field. However, traditional deep learning-based visual neural networks face significant challenges in efficiently and accurately identifying Tibetan medicinal materials due to their large number, complex morphology, and the scarcity of public visual datasets. To address this issue, we constructed a computer vision dataset with 300 Tibetan medicinal materials and proposed a lightweight and efficient cross-dimensional attention mechanism, the Dual-Kernel Split Attention (DKSA) module, which can adaptively share parameters of the kernel in both spatial and channel dimensions. Based on the DKSA module, we achieve efficient unification of convolution and self-attention under the CNN architecture and develop a new lightweight backbone architecture, EDKSANet, to provide enhanced performance for various computer vision tasks. As compared to RedNet, the top-1 accuracy is improved by 1.2% on an ImageNet dataset, and a larger margin of +1.5 box AP for object detection and an improvement of +1.3 mask AP for instance segmentation on MS-COCO dataset are obtained. Moreover, EDKSANet achieved excellent classification performance on the Tibetan medicinal materials dataset, with an accuracy of up to 96.85%.

1. Introduction

Tibetan medicine is not only a treasure of the traditional culture of the Chinese nation but also a shining pearl in the treasure house of world medicine. Tibetan medicinal materials are the main carrier of clinical treatment of Tibetan medicine and have unique curative effects on many diseases. Since the authenticity, pros, and cons of Tibetan medicinal materials are directly related to the safety and effectiveness of the clinical medication, it is very necessary to carry out objective and scientific identification research on Tibetan medicinal materials. Traditional identification methods usually use technologies such as infrared spectroscopy, fingerprint pattern, and chemical pattern recognition [,], but these methods have significant defects: they need expert experience for manual design or feature extraction, and the workload is very large in the early stage. These models have high accuracy on the training set but poor generalization performance on the test set. Furthermore, these methods rely on specific technologies, requiring a substantial amount of specialized knowledge and equipment, which may pose practical limitations in their application.
In contrast, neural networks [] have a strong nonlinear mapping capability, which can effectively address the shortcomings of traditional methods, especially when considering the relationship between features with excellent performance. They are also better able to automatically extract complex image features, reducing the burden of manual feature extraction. However, the current informatization of Tibetan medicinal materials is relatively lagging behind, resulting in a lack of public standard datasets in the field of image recognition, and the scarcity of data has become a significant challenge. Therefore, there is an urgent need to build a large-scale computer vision dataset of Tibetan medicinal materials to support the training and evaluation of deep learning models.
According to the Chinese Tibetan Medicinal Materials Compendium, the Tibetan medicinal materials it contains include 928 species of plants, 174 species of animals and 140 species of minerals, totaling more than 1200 species. There are many types of these Tibetan medicinal materials, which require significant time and labor costs to collect and label their data. Additionally, each type of medicinal material has a unique morphology, color, texture, and characteristics, which makes the task complex. At the same time, traditional deep learning networks may become bulky when processing large-scale data and require substantial hardware resources. In addition, in practical applications, images of Tibetan medicinal materials may be subject to various limitations, such as the shooting conditions, angle, and resolution, which may affect the reliability of the recognition task. To overcome the above challenges, we need to adopt innovative computer vision techniques, especially to develop lightweight and efficient network structures to improve the accuracy and robustness of Tibetan medicinal material recognition in order to achieve good performance in practical applications.
Convolutional neural networks (CNNs) [] set off a wave of visual deep learning and are widely used in research fields such as image classification, target detection, instance segmentation, and semantic segmentation [,,,,,,,], which fully demonstrates the effectiveness of CNN in computer vision. A CNN has two main properties: space invariance and channel specificity. Spatial invariance means that the CNN uses a shared convolution kernel to process different positions in the image, thereby reducing the number of parameters and ensuring the translation equivariance of visual features. Channel specificity means that the convolution kernels in different output channels capture different semantic information and improve the learning ability of the model. However, convolutions cannot model the long-distance relationship between different visual modalities in the spatial dimension, and there is information redundancy between different output channels. Therefore, involution [], which is the opposite technique, came into being. These two techniques focus on different aspects: convolution emphasizes the interaction between channels, while involution focuses on modeling information within the spatial range. The former implements parameter sharing in the spatial dimension, while the latter implements parameter sharing in the channel dimension, both of which are extreme forms of kernel parameter sharing. Therefore, it is very meaningful research work to achieve adaptive parameter sharing in terms of spatial range and channel to balance speed and accuracy.
In previous studies, the Bottleneck Attention module (BAM) [] and Convolutional Block Attention module (CBAM) [] effectively combined spatial and channel attention, respectively, enhanced the calibration ability of these two dimensions, and realized adaptive feature optimization. However, two important challenges remain to be addressed. On the one hand, although these modules can effectively focus on local information, they still need to efficiently aggregate contextual semantic information at different scales to overcome the difficulty of capturing long-range dependencies. On the other hand, although using the same convolution kernel in the same batch can reduce the computational cost, it will sacrifice the performance and feature expression ability of the model. Therefore, an efficient and low-cost dynamic kernel is proposed to solve this problem. EPSANet [], DMSANet [], and Res2Net [] are based on multi-scale representation and capture of long-distance features. CondConvNet [], Dynamic ConvNet [], and Deformable ConvNet [] use the dynamic kernel for feature extraction. However, the network architecture described above focuses on designing complex attention modules to perform specific functions, inevitably leading to a significant increase in model complexity. To overcome the above shortcomings, this paper proposes a high-performance and low-cost Dual-Kernel Split Attention (DKSA) module. We replaced the 3 × 3 convolution kernel in ResNet [] bottleneck blocks with the DKSA module, obtained an Efficient Dual-Kernel Split Attention (EDKSA) block, and used EDKSA blocks to stack in ResNet style to form a new type of lightweight-level EDKSANet. As shown in Figure 1, EDKSANet performs better in top-1 accuracy and is more efficient in parameter usage. The main contributions of this work are summarized as follows:
Figure 1. Comparing the accuracy of different methods. The circle area reflects the network parameters and FLOPs of different models.
  • A total of 300 common Tibetan medicinal materials were photographed, and with the help of data enhancement strategies, a computer vision dataset of Tibetan medicinal materials was successfully constructed. This provides a new opportunity for the further use of visual deep learning to scientifically identify Tibetan medicinal materials and promote the development of the Tibetan medicinal material industry. Furthermore, this block has very flexible and scalable performance and can be directly embedded into network architectures for various computer vision tasks.
  • A novel Efficient Dual-Kernel Split Attention (EDKSA) block is proposed to construct dynamic kernel-extracted features using reciprocal filters at different scales, effectively fusing symmetric feature information at multiple scales and generating long-term dependencies in channel interactions and spatially adaptive relationships, thereby increasing the richness and accuracy of feature representation.
  • A novel lightweight backbone architecture, EDKSANet, is proposed. It achieves efficient unification of convolution and self-attention under the CNN architectures. EDKSANet can learn richer symmetric feature maps and dynamically calibrate the modeling between symmetric dimensions according to task requirements, thereby improving the learning and generalization capabilities of the model. After a large amount of experimental data verification, EDKSANet not only significantly improved the performance of image classification, target detection, and instance segmentation tasks on the ImageNet and MS COCO datasets but also showed amazing results in the classification tasks of Tibetan medicinal materials.

3. Method

In this section, we will provide a comprehensive overview of the Dual-Kernel Split Attention (DKSA) module, offering an in-depth explanation of its construction steps.

3.1. DKSA Module

The motivation of this work is to build a more efficient and effective cross-dimensional interactive attention mechanism for complementary features. To this end, we propose a novel Dual-Kernel Split Attention (DKSA) module, as shown in Figure 2. This DKSA module is realized by the following 5 steps. First, by implementing the proposed Split Transform Concat (STC) module, multi-scale symmetrical information feature maps are obtained. Second, a MultiLayer Perceptron [] MLP is used to extract the attention vectors of the symmetric feature maps to obtain the channel attention vectors of the complementary kernel. Then, the channel attention vector is recalibrated using the cross-dimensional attention mechanism to obtain the recalibration weights of the symmetric feature maps. Then, the calibrated weights are multiplied element-wise by the corresponding feature maps. Finally, the interaction of symmetrical information between groups is completed through channel shuffle.
Figure 2. The structure of the proposed Dual-Kernel Split Attention (DKSA) module.
The basic module in the DKSA module to realize the extraction of multi-scale symmetrical feature information is STC. Its structure is shown in Figure 3. This module is realized by three operators: Split, Transform, and Concat.
Figure 3. A detailed illustration of the proposed Split Transform Concat (STC) module.
Split: Any feature map X R C × H × W is divided into 2 groups in the channel dimension, denoted as [ X 0 , X 1 ] . Each group has C = C / 2 channels, where C must be divisible by 2; then, the feature map group is X i R C × H × W , i = 0 , 1 .
Transform: Based on the split, we process input tensors in parallel using complementary filters of different scales. However, as the size of the kernel increases, the number of parameters will also increase significantly, so we introduce a method of group convolution to perform two transformations: F ˜ : F 0 = C o n v | I n v o ( K 0 × K 0 , G 0 ) ( X 0 ) and F ^ : F 1 = I n v o | C o n v ( K 1 × K 1 , G 1 ) ( X 1 ) . A symmetric feature map generated by the complementary filter is thus obtained. Among them, C o n v and I n v o represent the convolution kernel and Involution kernel, respectively, K represents different receptive fields, and G represents different groups. | is a selection symbol; if F ˜ performs C o n v , then F ^ performs I n v o , and vice versa. The relationship between K and G is shown in Formula (1).
C o n v , G = 4 × ( K + 1 ) I n v o , G = C / 4 × ( K + 1 )
The generation method of the 3 × 3 I n v o kernel refers to RedNet [], where the 7 × 7   I n v o kernel chooses 3 × 3 dilated as 2 convolutions of nonlinear transformation instead of focusing on a single pixel, which can better capture the farther pixel information and low-level features in the image, and enhance the model’s perception. By default, F ˜ : F 0 = C o n v ( 3 × 3 , 32 ) X 0 and F ^ : F 1 = I n v o ( 7 × 7 , C / 16 ) ( X 1 ) are performed, where the 7 × 7   C o n v kernel selects a 3 × 3 dilated to 2-dilated convolution.
Concat: Splicing in the channel dimension F 0 F 1 F R C × H × W produces a multi-scale symmetrical pre-processed feature map where C, H, and W represent channel, height, and width, respectively.
MLP consists of two parts: the squeeze layer and the excitation layer. The former is responsible for encoding global information, and the latter is used for the adaptive recalibration of channel relations. To incorporate global spatial information into channel descriptions, we implement the simplest global average pooling (GAP) using Equation (2). The attention weight W C of the Cth channel in MLP is shown in Equation (3).
G C = 1 H × W i = 1 H j = 1 W F C i , j
W c = σ W 1 δ ( W 0 ( G c ) )
The Rectified Linear Unit(ReLU) [] operator is represented by δ . The weight matrices of the MLP layer are denoted by W 0 R c × C r and W 1 R C r × C , respectively. The sigmoid activation function is denoted by the symbol σ . MLP can adaptively assign channel attention weights to F R C × H × W , perform cross-channel interaction, and fuse context information. Z i R C × 1 × 1 is the vector of attention weights for the two types of complementary filters, as represented in Equation (4), where the Z i attention values are derived from the F i feature map.
Z i = M L P ( F i ) , i = 0 , 1
Second, a cross-dimensional soft attention mechanism is used to recalibrate the weights of the two types of complementary filters, adaptively adjusting the modeling of the different dimensional relationships. In Equations (5) and (6), Softmax is used to obtain the attention weights of the recalibrated channels a t t i , i = 0 , 1 . This effectively enables the interaction of local and global attention between different filters, forming long-range cross-dimensional channel dependencies.
a t t 0 = Softmax Z 0 = e x p Z 0 e x p Z 0 + e x p Z 1
a t t 1 = Softmax Z 1 = e x p Z 1 e x p Z 0 + e x p Z 1
Next, the filter channel attention weights for the feature rescaling are cascaded and fused using the ⊕ operator to obtain the entire channel attention vector, att, as shown in Equation (7).
a t t = a t t 1 a t t 2
We then multiply the recalibrated filter attention weights by the corresponding feature map F i using the ⊙ operator to obtain a feature map Y i of the attention weights of the complementary filter channels, which is represented by Equation (8).
Y i = F i att i i = 0 , 1
Finally, to maintain the integrity of the feature representation and to avoid corrupting the information of the original feature mapping, the tandem operator Cat, which is more efficient than the summation operator, is used here. In addition, we used the Ⓢ channel shuffle to enable feature interaction between complementary filters. The final enriched output feature map can be expressed as Equation (9).
o u t p u t = C a t Y 1 , Y 2
Based on the above design, the DKSA module effectively implements the adaptive parameter sharing of a kernel in terms of spatial extent and channels, fuses the characteristics of complementary kernels at different scales, and outputs a richer feature information map. In addition, the module integrates cross-dimensional attention into each feature group block, enhancing the association between local features and global information and improving the representational capability of the model.

3.2. Network Design

As shown in Figure 4, the EDKSA block was further obtained by replacing the 3 × 3 convolutional kernels with DKSA modules in the ResNet bottleneck block, which adaptively models the key information in the feature map and enables better fusion and expression of the complementary features extracted by the dynamic kernel, thus increasing the feature mapping. The EDKSA block can adaptively model the key information in the feature map to enable better fusion and expression of the complementary features extracted by the dynamic kernel, thus increasing the richness of feature expression. A new lightweight backbone architecture, EDKSANet, is designed in this paper by stacking EDKSA blocks in ResNet style. This EDKSANet can dynamically calibrate the modeling of symmetric inter-dimensional relationships to the needs of different visual tasks, improving the synergy and complementarity of feature information and capturing remote dependencies. EDKSANet-50 has two configurations, EDKSANet-50(1) and EDKSANet-50(2), as shown in Table 1.
Figure 4. Block description and comparison of ResNet, RedNet, and EDKSANet, where the two structures of DKSA Module, DKSA Module (1), and DKSA Module (2) are described in Table 1.
Table 1. Design of the proposed EDKSANet.

4. Experiments

To validate the performance of EDKSANet, image classification, target detection, and instance segmentation were first tested on the publicly available datasets ImageNet [] and MS-COCO []. Next, experiments on the computer vision dataset of Tibetan medicinal materials were conducted on the classification task. Finally, to gain a deeper understanding of the network and to provide a reference and basis for further optimization of the design model, several sets of experiments were conducted to explore the effects of different kernels, kernel size, and group size.

4.1. Implementation Details

For the image classification task, this paper used ResNet [] as the backbone and experimented on the ImageNet [] dataset. To further improve the accuracy of the model, a variety of data expansion strategies were used on this basis, including random flipping, random rotation, random scaling, and normalization. The final size of the input tensor was cropped to 224 × 224 . The training configuration was set with reference to EPSANet [] and SKNet [] and optimized by a stochastic gradient descent algorithm, with the optimizer using SGD, momentum set to 0.9, a weight decay of 1 × 10 4 , and a batch size of 128. The initial learning rate was set to 0.1 and decreased by a factor of 10 after every 30 epochs out of 120 epochs.
For the Tibetan medicinal material classification task, the optimizer used adam when training the model. The learning rate was set to 1 × 10 3 , the batch size was set to 64, epochs were set to 100, the decay coefficient for each period was 0.96, β 1 and β 2 were set to 0.9 and 0.999, respectively, and epsilon was set to 1 × 10 8 for numerical stability.
For the target detection task, on the MS-COCO [] dataset, this study used ResNet with the CNN architecture and Faster RCNN [] as the backbone and detector for the experiments. The default configuration was to tune the short edge of the input image to 800. The weights of SGD were attenuated to 1 × 10 4 ; the momentum was set to 0.9, and the batch size of each GPU was 4 within 12 epochs. The learning rate was set to 0.2, and the learning rate was tuned using CosineAnnealingDecay within 120 epochs.
For the segmentation task, the Mask RCNN [] detector was used, and the rest of the configuration was similar to the target detection. In addition, the above detectors were all implemented by PaddleDetection.
The above models were all trained on the Baidu AI Studio platform, and their specific configurations are shown in Table 2.
Table 2. AI Studio Model Training Environment Configuration.

4.2. Image Classification on ImageNet

In 50-layer and 101-layer networks under the CNN architecture, Table 3 shows the results of EDKSANet compared to other networks, with the DKSA module achieving very competitive performance at a low computational cost. Furthermore, compared to models under the CNN + Trans architectures, EDKSANet demonstrates higher cost-effectiveness and practicality, effectively striking a better balance between accuracy, parameter count, and computational complexity.
Table 3. Comparison of various methods on ImageNet in terms of network parameters (in millions), floating point operations per second (FLOPs), and top-1 validation accuracy (%).
As shown in Figure 5, under the CNN architecture, for top-1 accuracy, EDKSANet-50(2) and EDKSANet-101(2) have 27% and 30% fewer parameters and are 24% and 29% less computationally expensive, respectively, when compared to ResNet; these models achieve 3.70% and 3.03% higher accuracy than ResNet. For the same number of parameters and computations, EDKSANet-50(1) and EDKSANet-101(1) achieved a 0.84% and 0.77% improvement, respectively, compared to ResNet.
Figure 5. Comparison of network parameters(in millions), floating point operations per second (FLOPs), and top-1 validation accuracy (%) of ResNet, RedNet, and EDKSANet on different layers.
In the 50-layer networks under the CNN architecture, as shown in Figure 6, EDKSANet(2) showed the best performance in terms of accuracy compared to other attention models, achieving a great improvement. Specifically, EDKSANet-50(2) outperformed SKNet, GCNet, DANet, and CBAM by 0.9% and 1.26%, respectively, in terms of top-1 accuracy.
Figure 6. Comparison among various attention networks for top-1 validation accuracy (%) under the CNN architecture.
When comparing the top-1 accuracy and efficiency of the different models under the CNN architecture, EDKSANet-50(1) and ECANet-101 have similar top-1 accuracies, but the former can reduce the number of parameters by 58% and computational resources by 60%. EDKSANet-101(1) performs similarly to EPSANet-101(Large) in terms of top-1 accuracy but saves 37% of the number of parameters and 38% of the computing resources. In addition, compared to SENet-101 and BAM-101, EDKSANet-101 offers a significant 1.9% improvement in top-1 accuracy while reducing the number of parameters by approximately 33% and computational resources by around 29%. This is shown in Figure 7.
Figure 7. Comparison of network parameters(in millions), floating point operations per second (FLOPs), and top-1 validation accuracy (%) for EDKSANet, ECANet, EPSANet, SENet, znd BAM at different layers under the CNN architecture.
Our proposed CNN architecture model, as with the Trans + CNN architecture model, achieves an efficient unification of the self-attention mechanism and convolution. In terms of performance, as depicted in Figure 8, UniFormer-B excels in top-1 accuracy at 82.5%. However, this comes at the expense of a large number of model parameters and increased computational complexity. Contrastingly, PVTv2-B1 boasts the lowest parameter count and computational complexity, resulting in a correspondingly lower top-1 accuracy. Conversely, our proposed EDKSANet demonstrates higher cost-effectiveness in terms of performance and practicality.
Figure 8. Comparison of network parameters(in millions), floating point operations per second (FLOPs), and top-1 validation accuracy (%) for EDKSANet, PVTv2-B1, UniFormer-B.

4.3. Object Detection and Instance Segmentation on MS COCO

In Table 4, under the CNN architecture, our proposed EDKSANet achieves the best performance in the target detection and instance segmentation tasks, respectively.
Table 4. Comparison of object detection and instance segmentation results on COCO val2017.
In the target detection experiments, under the CNN architecture, as shown in Figure 9, EDSANet-50 has fewer parameters and lower computational costs than ResNet-50. EDKSANet-50(2) can improve the AP accuracy of the Faster-RCNN detector by approximately 4.4% with better performance. Furthermore, from a complexity point of view, EDKSANet-50(1) improves AP performance by 1.1% on the Faster-RCNN detector with almost the same computational complexity compared to RedNet-50. It can be seen that EDSANet-50 is a more efficient model.
Figure 9. Comparison of 50 layers of ResNet, EDKSANet, and RedNet in terms of network parameters (in millions), giga floating-point operations per Second(GFLOPs), and top-1 validation accuracy (%).
Under the CNN architecture, EDSANet-50(2) showed more outstanding performance than other attention networks, achieving the best performance in all AP metrics. According to the data shown in Figure 9, its AP performance is 0.2% better than the EPSANet-50 (Large) method, which is currently the best in the attention, while it has only 78% of the number of parameters and is 28% less expensive to compute. Also striking is the fact that the most significant performance improvement is in the middle, at 24.6%.
In the example segmentation experiments, the DKSA module showed a significant advantage over other attention methods under the CNN architecture. According to the data shown in Figure 10, Specifically, EDKSANet-50(2) outperforms EPSANet-50 (Large), thus demonstrating the best performance among existing methods, with improvements of approximately 0.4%, 0.5%, 0.2%, 0.6%, 0.5%, and 0.3% on AP, AP 50 , AP 75 , AP S , AP M , and AP L , respectively.
Figure 10. Comparison of 50-layer SENet, ECANet, FCANet, EPSANet, and EDKSANet in AP, AP S .
In the experiments on object detection and instance segmentation, we compared our model EDKSANet, designed under the CNN architecture, with the models PVTv2-B1 under the Trans + CNN architecture and UniFormer-B. UniFormer-B excelled in both aspects, achieving AP scores of 43.1% and 44.0% respectively. In comparison, our EDKSANet showed a slight decrease of 2.3% and 2.5% in these metrics. However, the most noteworthy improvement lies in the reduction of parameters, with our model boasting a reduction of 94.0% and 73.5% respectively.
The above experimental results all validate the effectiveness of the DKSA module, and that EDKSANet has better learning and generalization capabilities and can be easily and efficiently applied to other downstream tasks. To fully consider the model’s parameter count and computational complexity, we designed the simplest version of EDKSANet, which still achieves state-of-the-art performance under the CNN architecture. UniFormer [] is the most advanced model under the Trans + CNN architecture, which achieves efficient unification of convolution and self-attention, fully utilizes the advantages of both, solves the problem of local redundancy and global dependency, and achieves efficient feature learning. Although our proposed EDKSANet may not be as accurate as UniFormer, it is worth mentioning that EDKSANet is the first network that realizes the efficient unification of convolution and self-attention under the CNN architecture, which is similar to the design philosophy of current state-of-the-art models. In the future, building upon the simplest version of EDKSANet, we can continue to construct a pyramid structure network, which is expected to further enhance the model’s performance and unleash greater potential.

4.4. Image Classification on Tibetan Medicinal Materials Dataset

The computer vision dataset of Tibetan medicinal material includes 300 categories of common medicinal materials including plants, animals, minerals, and jewelry. These pictures were taken at the Tibetan Medicine Factory of the Tibet Autonomous Region, the Tibetan Medicine University of the Tibet Autonomous Region and the Tibetan Herbal Medicine Company of Tibet. Based on this, this study performs enhancement strategies on the image, including random horizontal, up and down flip, rotate, zoom, and crop operations to achieve a target size of 224 × 224 . Our computer vision dataset for Tibetan medicinal materials contains 150,000 images, 500 for each category. The dataset was divided into a training set and a test set based on a ratio of 8:2. In Figure 11, we show a selection of sample images from the dataset.
Figure 11. Sample data of some Tibetan medicinal material.
Taking the Tibetan medicinal material Rhodiola rosea as an example, the key features extracted for its classification and recognition using ResNet, RedNet, and EDKSANet under the CNN architecture differed significantly, as shown in Figure 12. ResNet extracts features with most of the information concentrated on the background, with less focus on the texture features of key parts of Rhodiola rosea. The distribution of feature information extracted by RedNet is relatively concentrated, with most of the key texture feature areas covered in red. This indicates that the self-attentive mechanism can effectively focus on key texture feature areas, thus making the network more accurate than ResNet in the recognition of Tibetan herbal images. The distribution of feature information extracted by EDKSANet is more concentrated than the two extreme networks with shared filter parameters described above, with key texture feature sites covered in red. This shows that the use of the DKSA module can effectively focus and extract features from the input images and improve the expressive power of the network, thus making EDKSANet’s classification and recognition of Tibetan medicinal materials images more accurate.
Figure 12. Heat maps drawn for the feature information of Rhodiola rosea herbal medicine in the last convolutional layer of the ResNet, RedNet, and EDKSANet intermediate modules [], and the transition of the heat map color from blue to red indicated the importance of the feature information for classification recognition.
Taking into account both model parameters and performance in a comprehensive manner, we aim to lower the actual deployment costs. In this study, we conducted classification experiments on the Tibetan medicinal materials dataset using EDKSANet with ResNet, RedNet, and PVTv2-B1, as well as evaluated them in terms of Parameters, FLOPs, and top-1 acc (%), respectively. The experimental results show that our proposed network can achieve optimal performance with an accuracy of 96.85% when the number of parameters and computational complexity is guaranteed. The specific comparison results are shown in Table 5.
Table 5. Comparison of parameters, FLOPs, top-1 validation accuracy (%) of ResNet, RedNet, and EDKSANet on the Tibetan medicinal materials dataset.

4.5. Ablation Study

As shown in Table 6, this study used ResNet-50 under the CNN architecture as the backbone and adjusted the kernel, kernel size, and group size, respectively, to verify the effectiveness of our network on ImageNet []. First, this study compares the use of C o n v , I n v o , and ( C o n v , I n v o ). The experimental data reveals that adaptive parameter sharing of complementary filters allows better modeling of input features and effectively enhances the representation of the model. Then, to verify the impact of complementary filters at different scales on the model, we perform separate grouped null convolution of feature maps at different scales, which can effectively extract multi-scale refinement features and form remote dependencies to improve the performance of the model without adding extra computational effort. In conclusion, a synthesis of the above analysis reveals that EDKSANet can achieve a good balance between performance and model complexity.
Table 6. Effect of Different Kernels, Kernel Sizes, and Group Size on the Accuracy of Top-1 Validation.

5. Conclusions

In this paper, a computer vision dataset containing 300 species of Tibetan medicinal materials was constructed. Additonally, to address the study of and scientific identification of Tibetan medicinal materials, we proposed a plug-and-play DKSA module, which increases the richness of feature expression by constructing a dynamized kernel that adaptively fuses the features of two types of complementary kernels, through a cross-dimensional attention mechanism. Based on the DKSA module, we have successfully achieved the efficient integration of convolution and self-attention under the CNN architecture. Furthermore, our lightweight EDKSANet, which we proposed, can adaptively model symmetric cross-dimensional relationships, effectively fuse local and global information, better understand contextual feature information, form long-range dependencies, and significantly improve the model’s performance. Through extensive qualitative and quantitative experiments, we have clearly demonstrated that the proposed EDKSANet outperforms other networks in tasks such as image classification, object detection, and instance segmentation. In the Tibetan medicinal dataset, we have achieved an outstanding classification accuracy of 96.85%.
However, we should acknowledge the limitations of our research. The computer vision dataset for Tibetan medicinal materials is still limited in scale, and therefore, it is necessary to further expand the variety of medicinal materials. In the future, we will establish a multimodal dataset for Tibetan medicinal herbs and employ self-supervised learning methods to reduce reliance on labeled data, addressing the issue of scarce data for certain Tibetan medicinal herbs. Additionally, we will continue to explore innovative approaches for parameter sharing in pyramid-structured convolutional kernels to achieve a more efficient and unified network structure for convolution and self-attention.

Author Contributions

Conceptualization, J.Q. and B.W.; methodology, J.Q.; validation, J.Y. and Y.Z.; formal analysis, B.W.; investigation, Y.Z.; resources, J.J.; data curation, J.Y.; writing—original draft preparation, J.Q.; writing—review and editing, B.W.; visualization, J.Q.; supervision, B.W.; funding acquisition, J.J. All authors have read and agreed to the published version of the manuscript.

Funding

The research received financial backing and support from the National Natural Science Foundation of China, grant number 62261051.

Data Availability Statement

In this study, our team has meticulously curated a computer vision dataset comprising 300 different Tibetan medicinal materials. Currently, the dataset is being used for ongoing research and has not been made publicly available. Researchers and institutions interested in collaborating on relevant projects or conducting further analysis may request access to the dataset. To request access, please contact the corresponding author, Bianba Wangdui, at banwangg@utibet.edu.cn. Our research team will review the request, and if deemed appropriate, we will provide access to the dataset according to the terms and conditions set by our institution.

Conflicts of Interest

The authors declare that we have no competing interest regarding the publication of this research paper, “EDKSANet: An Efficient Dual-Kernel Split Attention Neural Network for the classification of Tibetan medicinal materials”, in the journal Electronics.

References

  1. Sakhteman, A.; Keshavarz, R.; Mohagheghzadeh, A. ATR-IR fingerprinting as a powerful method for identification of traditional medicine samples: A report of 20 herbal patterns. Res. J. Pharmacogn. 2015, 2, 1–8. [Google Scholar]
  2. Rohman, A.; Windarsih, A.; Hossain, M.A.M.; Johan, M.R.; Ali, M.E.; Fadzilah, N.A. Application of near-and mid-infrared spectroscopy combined with chemometrics for discrimination and authentication of herbal products: A review. J. Appl. Pharm. Sci. 2019, 9, 137–147. [Google Scholar]
  3. Lecun, Y.; Bottou, L. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
  4. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  5. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  6. Girshick, R. Fast R-CNN. Comput. Sci. 2015, 2015, 1440–1448. [Google Scholar]
  7. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
  8. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the Computer Vision & Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  9. Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
  10. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
  11. He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. In Proceedings of the International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
  12. Li, D.; Hu, J.; Wang, C.; Li, X.; She, Q.; Zhu, L.; Zhang, T.; Chen, Q. Involution: Inverting the inherence of convolution for visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12321–12330. [Google Scholar]
  13. Park, J.; Woo, S.; Lee, J.Y.; Kweon, I.S. BAM: Bottleneck Attention Module. arXiv 2018, arXiv:1807.06514. [Google Scholar]
  14. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  15. Zhang, H.; Zu, K.; Lu, J.; Zou, Y.; Meng, D. EPSANet: An Efficient Pyramid Squeeze Attention Block on Convolutional Neural Network. In Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 November–2 December 2021. [Google Scholar]
  16. Sagar, A. DMSANet: Dual Multi Scale Attention Network. In Proceedings of the International Conference on Image Analysis and Processing, Lecce, Italy, 23–27 May 2022. [Google Scholar]
  17. Gao, S.H.; Cheng, M.M.; Zhao, K.; Zhang, X.Y.; Yang, M.H.; Torr, P. Res2net: A new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 652–662. [Google Scholar] [CrossRef] [PubMed]
  18. Yang, B.; Bender, G.; Ngiam, J.; Le, Q.V. CondConv: Conditionally Parameterized Convolutions for Efficient Inference. arXiv 2019, arXiv:1904.04971v3. [Google Scholar]
  19. Chen, Y.; Dai, X.; Liu, M.; Chen, D.; Yuan, L.; Liu, Z. Dynamic convolution: Attention over convolution kernels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11030–11039. [Google Scholar]
  20. Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. arXiv 2017, arXiv:1703.06211. [Google Scholar]
  21. Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet Classification with Deep Convolutional Neural Networks. Adv. Neural Inf. Process. Syst. 2012, 25, 12. [Google Scholar] [CrossRef]
  22. Zhang, T.; Qi, G.J.; Xiao, B.; Wang, J. Interleaved group convolutions. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4373–4382. [Google Scholar]
  23. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
  24. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
  25. Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
  26. Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  27. Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar]
  28. Yu, F.; Koltun, V. Multi-Scale Context Aggregation by Dilated Convolutions. In Proceedings of the ICLR, San Juan, Puerto Rico, 2–4 June 2016. [Google Scholar]
  29. Jie, H.; Li, S.; Gang, S. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
  30. Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
  31. Qin, Z.; Zhang, P.; Wu, F.; Li, X. Fcanet: Frequency channel attention networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 783–792. [Google Scholar]
  32. Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
  33. Yang, J.; Wang, W.; Li, X.; Hu, X. Selective Kernel Networks. arXiv 2019, arXiv:1903.06586. [Google Scholar]
  34. Zhang, H.; Wu, C.; Zhang, Z.; Zhu, Y.; Zhang, Z.; Lin, H.; Sun, Y.; He, T.; Mueller, J.; Manmatha, R. ResNeSt: Split-Attention Networks. arXiv 2020, arXiv:2004.08955. [Google Scholar]
  35. Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
  36. Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. GCNet: Non-Local Networks Meet Squeeze-Excitation Networks and Beyond. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
  37. Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 568–578. [Google Scholar]
  38. Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pvt v2: Improved baselines with pyramid vision transformer. Comput. Vis. Media 2022, 8, 415–424. [Google Scholar] [CrossRef]
  39. Vasu, P.K.A.; Gabriel, J.; Zhu, J.; Tuzel, O.; Ranjan, A. FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization. arXiv 2023, arXiv:2303.14189. [Google Scholar]
  40. Lee, Y.; Kim, J.; Willette, J.; Hwang, S.J. Mpvit: Multi-path vision transformer for dense prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7287–7296. [Google Scholar]
  41. Li, K.; Wang, Y.; Gao, P.; Song, G.; Liu, Y.; Li, H.; Qiao, Y. Uniformer: Unified transformer for efficient spatiotemporal representation learning. arXiv 2022, arXiv:2201.04676. [Google Scholar]
  42. Li, K.; Wang, Y.; Zhang, J.; Gao, P.; Song, G.; Liu, Y.; Li, H.; Qiao, Y. Uniformer: Unifying convolution and self-attention for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12581–12600. [Google Scholar] [CrossRef] [PubMed]
  43. Rumelhart, D.E. Learning Internal Representations by Error Propagation, Parallel Distributed Processing. In Explorations in the Microstructure of Cognition; MIT Press: Cambridge, MA, USA, 1986. [Google Scholar]
  44. Nair, V.; Hinton, G.E. Rectified Linear Units Improve Restricted Boltzmann Machines Vinod Nair. In Proceedings of the International Conference on International Conference on Machine Learning, Haifa, Israel, 21–24 June 2010. [Google Scholar]
  45. Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
  46. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
  47. Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.